-
Notifications
You must be signed in to change notification settings - Fork 2
Make exponential learning-rate decay per step. #145
Make exponential learning-rate decay per step. #145
Conversation
@@ -57,13 +57,13 @@ impl ExponentialDecay { | |||
/// | |||
/// If `staircase` is true, the exponent of the decay is | |||
/// computed using integer division. This has the effect that | |||
/// the learning rate only changes every `decay_epochs` epochs. | |||
/// the learning rate only changes every `decay_steps` steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not part of this change, but would generally be informative: Note that decay_steps
not only influences the width of the steps but also directly affects the decay rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'd also prefer a more descriptive name than decay_steps
which mostly makes sense in the staircase
-case. FWIW the formula is included in the doc of the struct:
lr = initial_lr * decay_rate ^ (global_step / decay_steps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, the names we got now mirror those of https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/exponential_decay.
OTOH, PyTorch-exponential decay completely avoids decay_steps
by not offering stair-casing:
class ExponentialLR(_LRScheduler):
"""Set the learning rate of each parameter group to the initial lr decayed
by gamma every epoch. When last_epoch=-1, sets initial lr as lr.
Args:
optimizer (Optimizer): Wrapped optimizer.
gamma (float): Multiplicative factor of learning rate decay.
last_epoch (int): The index of last epoch. Default: -1.
"""
def __init__(self, optimizer, gamma, last_epoch=-1):
self.gamma = gamma
super(ExponentialLR, self).__init__(optimizer, last_epoch)
def get_lr(self):
return [base_lr * self.gamma ** self.last_epoch
for base_lr in self.base_lrs]
I am now wondering whether we should just opt for the simple version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If simple means no staircasing, no please leave staircasing in. I had some models in the past where staircasing improved performance a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With per-epoch or per-batch decay?
I guess it doesn't hurt to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was per epoch. But now it can be per batch. But in the end it shouldn't matter to much. One has to adjust the hyper parameters anyway to get the same schedule as before (old steps * number of batches in the training data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something, but right now exponential decay is only available when using sticker as a library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Plateau replaced it as the default, but I never added an option to switch to exponential decay.
269114a
to
d8b5ad4
Compare
Make the decay in the ExponentialRate schedule based on the global step instead of training epoch.
d8b5ad4
to
e7466e9
Compare
Make the decay in the
ExponentialRate
schedule based on the global step instead of the training epoch.