Skip to content

5 RELIABLE LEARNING RATE SCHEDULE: K-DECAY

  • by
RELIABLE LEARNING RATE SCHEDULE: K-DECAY

We may earn money or products from the companies mentioned in this post.

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

It is frequently advantageous to lower the learning rate as the training of deep neural networks advances. Pre-established learning rate schedules or adaptive learning rate techniques can be used. To compare the model results, I train a convolutional neural network using various learning rate schedules and adaptive learning rate techniques.

When training a neural network, the most bang for your buck (in terms of accuracy) will come from selecting the correct learning rate and appropriate learning rate schedule.

 

From here I will show you how to implement and utilize a number of learning rate schedules with K-decay;

 

  • The decay schedule built into most Keras optimizers
  • Step-based learning rate schedules
  • Linear learning rate decay
  • Polynomial learning rate schedules

 

In order to decrease the overfitting and increase the generalization capability of a model, K-decay learning rate schedules were used.

In this paper, a new method for learning rate schedule was proposed and applied to one of the most important benchmark problems in computer vision named MNIST.

This algorithm was evaluated by comparing against other popular algorithms like Adam optimization algorithm, LPP (last place predictor) and RMSProp.

 

Why adjust our learning rate and use learning rate schedules?

 

See also  Ultimate Mind Sliver 5e: Advanced Guide

Learning rate schedules allow you to adjust the learning rate according to your needs. By using a learning rate schedule, you can determine how long it will take for a model to reach its desired accuracy on training data.

This can especially be important when you are dealing with machine learning algorithms that require large amounts of time for building models and even more time for evaluation.

Learning rate schedules can also help accelerate the performance of the model by keeping the learning rate between 0 and 1 if simulation says it is necessary.

The purpose of the learning rate schedule

The purpose of the learning rate schedule is to reduce the amount of data that needs to be used in the training process. The process of evaluating a neural network is often a computationally intensive process, and some forms of learning require a large number of training examples for accurate results.

Therefore, it becomes important to determine how many examples are sufficient for each step along the way. This can be done by adjusting your learning rate as you train to learn from fewer examples each time through training.

To see why learning rate schedules are a worthwhile method to apply to help increase model accuracy and descend into areas of lower loss, consider the standard weight update formula used by nearly all neural networks:

W += \alpha * gradient

Recall that the learning rate, \alpha, controls the “step” we make along the gradient. Larger values of \alpha imply that we are taking bigger steps. While smaller values of \alpha will make tiny steps. If \alpha is zero the network cannot make any steps at all (since the gradient multiplied by zero is zero).

See also  TOP 10 BEST UNIVERSITIES IN AFRICA 2022

Most initial learning rates (but not all) you encounter are typically in the set \alpha = \{1e^{-1}, 1e^{-2}, 1e^{-3}\} .

A network is then trained for a fixed number of epochs without changing the learning rate.

This method may work well in some situations, but it’s often beneficial to decrease our learning rate over time. When training our network, we are trying to find some location along our loss landscape where the network obtains reasonable accuracy. It doesn’t have to be a global minima or even a local minima, but in practice, simply finding an area of the loss landscape with reasonably low loss is “good enough”.

If we constantly keep a learning rate high, we could overshoot these areas of low loss as we’ll be taking too large of steps to descend into those series.

Instead, what we can do is decrease our learning rate, thereby allowing our network to take smaller steps — this decreased learning rate enables our network to descend into areas of the loss landscape that are “more optimal” and would have otherwise been missed entirely by our learning rate learning.

We can, therefore, view the process of learning rate scheduling as:

  1. Finding a set of reasonably “good” weights early in the training process with a larger learning rate.
  2. Tuning these weights later in the process to find more optimal weights using a smaller learning rate.

Constant Learning Rate

Constant learning rate is the default learning rate schedule in SGD optimizer in Keras. Momentum and decay rate are both set to zero by default. It is tricky to choose the right learning rate. By experimenting with range of learning rates in our example, lr=0.1 shows a relative good performance to start with. This can serve as a baseline for us to experiment with different learning rate strategies.

keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)
RELIABLE LEARNING RATE SCHEDULE: K-DECAY

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

Fig 1 : Constant Learning Rate

Time-Based Decay

The mathematical form of time-based decay is lr = lr0/(1+kt) where lrk are hyperparameters and t is the iteration number. Looking into the source code of Keras, the SGD optimizer takes decay and lr arguments and update the learning rate by a decreasing factor in each epoch.

lr *= (1. / (1. + self.decay * self.iterations))

Momentum is another argument in SGD optimizer which we could tweak to obtain faster convergence. Unlike classical SGD, momentum method helps the parameter vector to build up velocity in any direction with constant gradient descent so as to prevent oscillations. A typical choice of momentum is between 0.5 to 0.9.

See also  TOP BEST UNIVERSITIES IN AUSTRALIA 2023

SGD optimizer also has an argument called nesterov which is set to false by default. Nesterov momentum is a different version of the momentum method which has stronger theoretical converge guarantees for convex functions. In practice, it works slightly better than standard momentum.

In Keras, we can implement time-based decay by setting the initial learning rate, decay rate and momentum in the SGD optimizer.

learning_rate = 0.1
decay_rate = learning_rate / epochs
momentum = 0.8
sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False)
RELIABLE LEARNING RATE SCHEDULE: K-DECAY

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

Fig 2 : Time-based Decay Schedule

Step Decay

Step decay schedule drops the learning rate by a factor every few epochs. The mathematical form of step decay is :

lr = lr0 * drop^floor(epoch / epochs_drop) 

A typical way is to to drop the learning rate by half every 10 epochs. To implement this in Keras, we can define a step decay function and use LearningRateSchedule: callback to take the step decay function as argument and return the updated learning rates for use in SGD optimizer.

def step_decay(epoch):
   initial_lrate = 0.1
   drop = 0.5
   epochs_drop = 10.0
   lrate = initial_lrate * math.pow(drop,  
           math.floor((1+epoch)/epochs_drop))
   return lratelrate = LearningRateScheduler(step_decay)

As a digression, a callback is a set of functions to be applied at given stages of the training procedure. We can use callbacks to get a view on internal states and statistics of the model during training. In our example, we create a custom callback by extending the base class keras.callbacks.Callback to record loss history and learning rate during the training procedure.

class LossHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
       self.losses = []
       self.lr = []
 
    def on_epoch_end(self, batch, logs={}):
       self.losses.append(logs.get(‘loss’))
       self.lr.append(step_decay(len(self.losses)))

Putting everything together, we can pass a callback list consisting of LearningRateScheduler callback and our custom callback to fit the model. We can then visualize the learning rate schedule and the loss history by accessing loss_history.lr and loss_history.losses.

loss_history = LossHistory()
lrate = LearningRateScheduler(step_decay)
callbacks_list = [loss_history, lrate]history = model.fit(X_train, y_train, 
   validation_data=(X_test, y_test), 
   epochs=epochs, 
   batch_size=batch_size, 
   callbacks=callbacks_list, 
   verbose=2)
RELIABLE LEARNING RATE SCHEDULE: K-DECAY

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

Fig 3a : Step Decay Schedule
RELIABLE LEARNING RATE SCHEDULE: K-DECAY
Fig 3b : Step Decay Schedule

Exponential Decay

Another common schedule is exponential decay. It has the mathematical form lr = lr0 * e^(−kt), where lrk are hyperparameters and t is the iteration number. Similarly, we can implement this by defining exponential decay function and pass it to LearningRateScheduler. In fact, any custom decay schedule can be implemented in Keras using this approach. The only difference is to define a different custom decay function.

def exp_decay(epoch):
   initial_lrate = 0.1
   k = 0.1
   lrate = initial_lrate * exp(-k*t)
   return lratelrate = LearningRateScheduler(exp_decay)
RELIABLE LEARNING RATE SCHEDULE: K-DECAY

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

Fig 4a : Exponential Decay Schedule
RELIABLE LEARNING RATE SCHEDULE: K-DECAY

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

Fig 4b : Exponential Decay Schedule

Let us now compare the model accuracy using different learning rate schedules in our example.

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

RELIABLE LEARNING RATE SCHEDULE: K-DECAY

Fig 5 : Comparing Performances of Different Learning Rate Schedules

 

Leave a Reply

Your email address will not be published.