We may earn money or products from the companies mentioned in this post.
RELIABLE LEARNING RATE SCHEDULE: K-DECAY
It is frequently advantageous to lower the learning rate as the training of deep neural networks advances. Pre-established learning rate schedules or adaptive learning rate techniques can be used. To compare the model results, I train a convolutional neural network using various learning rate schedules and adaptive learning rate techniques.
When training a neural network, the most bang for your buck (in terms of accuracy) will come from selecting the correct learning rate and appropriate learning rate schedule.
From here I will show you how to implement and utilize a number of learning rate schedules with K-decay;
- The decay schedule built into most Keras optimizers
- Step-based learning rate schedules
- Linear learning rate decay
- Polynomial learning rate schedules
In order to decrease the overfitting and increase the generalization capability of a model, K-decay learning rate schedules were used.
In this paper, a new method for learning rate schedule was proposed and applied to one of the most important benchmark problems in computer vision named MNIST.
This algorithm was evaluated by comparing against other popular algorithms like Adam optimization algorithm, LPP (last place predictor) and RMSProp.
Why adjust our learning rate and use learning rate schedules?
Learning rate schedules allow you to adjust the learning rate according to your needs. By using a learning rate schedule, you can determine how long it will take for a model to reach its desired accuracy on training data.
This can especially be important when you are dealing with machine learning algorithms that require large amounts of time for building models and even more time for evaluation.
Learning rate schedules can also help accelerate the performance of the model by keeping the learning rate between 0 and 1 if simulation says it is necessary.
The purpose of the learning rate schedule
The purpose of the learning rate schedule is to reduce the amount of data that needs to be used in the training process. The process of evaluating a neural network is often a computationally intensive process, and some forms of learning require a large number of training examples for accurate results.
Therefore, it becomes important to determine how many examples are sufficient for each step along the way. This can be done by adjusting your learning rate as you train to learn from fewer examples each time through training.
To see why learning rate schedules are a worthwhile method to apply to help increase model accuracy and descend into areas of lower loss, consider the standard weight update formula used by nearly all neural networks:
Recall that the learning rate, , controls the “step” we make along the gradient. Larger values of imply that we are taking bigger steps. While smaller values of will make tiny steps. If is zero the network cannot make any steps at all (since the gradient multiplied by zero is zero).
Most initial learning rates (but not all) you encounter are typically in the set .
A network is then trained for a fixed number of epochs without changing the learning rate.
This method may work well in some situations, but it’s often beneficial to decrease our learning rate over time. When training our network, we are trying to find some location along our loss landscape where the network obtains reasonable accuracy. It doesn’t have to be a global minima or even a local minima, but in practice, simply finding an area of the loss landscape with reasonably low loss is “good enough”.
If we constantly keep a learning rate high, we could overshoot these areas of low loss as we’ll be taking too large of steps to descend into those series.
Instead, what we can do is decrease our learning rate, thereby allowing our network to take smaller steps — this decreased learning rate enables our network to descend into areas of the loss landscape that are “more optimal” and would have otherwise been missed entirely by our learning rate learning.
We can, therefore, view the process of learning rate scheduling as:
- Finding a set of reasonably “good” weights early in the training process with a larger learning rate.
- Tuning these weights later in the process to find more optimal weights using a smaller learning rate.
Constant Learning Rate
Constant learning rate is the default learning rate schedule in SGD optimizer in Keras. Momentum and decay rate are both set to zero by default. It is tricky to choose the right learning rate. By experimenting with range of learning rates in our example,
lr=0.1 shows a relative good performance to start with. This can serve as a baseline for us to experiment with different learning rate strategies.
keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)
The mathematical form of time-based decay is
lr = lr0/(1+kt) where
k are hyperparameters and
t is the iteration number. Looking into the source code of Keras, the SGD optimizer takes
lr arguments and update the learning rate by a decreasing factor in each epoch.
lr *= (1. / (1. + self.decay * self.iterations))
Momentum is another argument in SGD optimizer which we could tweak to obtain faster convergence. Unlike classical SGD, momentum method helps the parameter vector to build up velocity in any direction with constant gradient descent so as to prevent oscillations. A typical choice of momentum is between 0.5 to 0.9.
SGD optimizer also has an argument called
nesterov which is set to false by default. Nesterov momentum is a different version of the momentum method which has stronger theoretical converge guarantees for convex functions. In practice, it works slightly better than standard momentum.
In Keras, we can implement time-based decay by setting the initial learning rate, decay rate and momentum in the SGD optimizer.
learning_rate = 0.1 decay_rate = learning_rate / epochs momentum = 0.8 sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False)