Do the weight decay before using grad #33

vpj · 2020-12-14T04:52:51Z

Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it. Found this while writing a tutorial implementation.

Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it.

juntang-zhuang · 2020-12-15T18:47:10Z

What’s the difference between doing weight decay before or after grad? I think they are equivalent, when considering update before grad in step 2, is equivalent to update after grad in step 1.

juntang-zhuang · 2020-12-15T18:48:16Z

Thanks for the implementation. Quickly skim through your code, the eps is used differently from our implementation. Not sure how much difference it would cause.

vpj · 2020-12-16T01:40:56Z

@juntang-zhuang grad_residual is computed before uncoupled weight decay changes the gradient

Adabelief-Optimizer/pypi_packages/adabelief_pytorch0.1.0/adabelief_pytorch/AdaBelief.py

Line 164 in bdb1c31

grad_residual = grad - exp_avg

Uncoupled weight decay has no effect (won't work) if it's done after calculating the grad_residual

vpj · 2020-12-16T01:46:37Z

The tutorial has two options of how to use eps. One with the optimization of calculating step size with scalars first before multiplying and dividing by the momentum and variance tensors.
https://lab-ml.com/labml_nn/optimizers/radam.html#section-22

The other where we first calculate the denominator with epsilon, which I think is equivalent to yours.
https://lab-ml.com/labml_nn/optimizers/radam.html#section-25

juntang-zhuang · 2020-12-16T08:12:40Z

@vpj Thanks for answering. Regarding the eps, eps is actually used twice in our algorithm in each step, it appears both within and outside the sqrt of st. Please see updated readme or paper on arxiv, a comparison of algorithm between Adam and AdaBelief is shown, with difference highlighed in blue (There are two differences). Though I'm not sure how much difference the extra eps will cause. It seems your code only uses eps once.

Regarding the decoupled weight decay, I still don't understand why you said it does not take effect after calculating grad_residual. Decoupled weight decay is basically multiply the weight by a constant factor smaller than 1, it's not related to the gradient. I don't think there will be such a big difference, it you consider the optimization process as "update - rescale - update -rescale ..." v.s. "rescale - update - rescale - update", by "update" I mean update only using the gradient, not the weight.

vpj · 2020-12-16T08:24:58Z

@juntang-zhuang Thanks. Sorry I hadn't noticed the two uses of epsilon, will change my code.

About the weight decay. Again my bad, I had been referring to coupled weight decay grad.add_(p.data, alpha=group['weight_decay']) where you change the gradient. But grad_residual and exp_avg is calculated before that.

juntang-zhuang · 2020-12-16T08:30:03Z

@vpj Thanks for clarification. I see, that's an error with coupled weight decay in new version of the code, thanks for pointing out. Will correct it in the next release.

vpj · 2020-12-16T08:34:41Z

awesome. while fixing my code I noticed there might be an issue here

Adabelief-Optimizer/pypi_packages/adabelief_pytorch0.1.0/adabelief_pytorch/AdaBelief.py

Line 175 in bdb1c31

    
           denom = (exp_avg_var.add_(group['eps']).sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])

The epsilon is added and assigned to exp_avg_var, which is not the expected behavior.

vpj · 2020-12-16T08:44:22Z

Fixed my code https://lab-ml.com/labml_nn/optimizers/ada_belief.html#section-24 thanks!

juntang-zhuang · 2020-12-16T09:09:59Z

@vpj Thanks a lot !

vpj · 2020-12-16T09:27:56Z

I saw this in previous version also, since this gets accumulated wouldn't this be a significant numerical difference?

juntang-zhuang · 2020-12-16T20:50:13Z

It does cause numerical difference, and the inplace version is tested in experiments but the non-inplace is not. That's why we prefer to keep current version unless many experiments show non-inplace helps.

juntang-zhuang · 2020-12-19T03:00:30Z

@vpj Fix the coupled weight decay in adabelief-pytorch==0.2.0, now it can be installed by pip. Thanks a lot.

Do the weight decay before using grad

5eb035f

Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it.

juntang-zhuang closed this Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do the weight decay before using grad #33

Do the weight decay before using grad #33

vpj commented Dec 14, 2020

juntang-zhuang commented Dec 15, 2020

juntang-zhuang commented Dec 15, 2020

vpj commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

vpj commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

juntang-zhuang commented Dec 19, 2020

Do the weight decay before using grad #33

Do the weight decay before using grad #33

Conversation

vpj commented Dec 14, 2020

juntang-zhuang commented Dec 15, 2020

juntang-zhuang commented Dec 15, 2020

vpj commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

vpj commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

vpj commented Dec 16, 2020

juntang-zhuang commented Dec 16, 2020

juntang-zhuang commented Dec 19, 2020