-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do the weight decay before using grad #33
Conversation
Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it.
What’s the difference between doing weight decay before or after grad? I think they are equivalent, when considering update before grad in step 2, is equivalent to update after grad in step 1. |
Thanks for the implementation. Quickly skim through your code, the eps is used differently from our implementation. Not sure how much difference it would cause. |
@juntang-zhuang Adabelief-Optimizer/pypi_packages/adabelief_pytorch0.1.0/adabelief_pytorch/AdaBelief.py Line 164 in bdb1c31
Uncoupled weight decay has no effect (won't work) if it's done after calculating the |
The tutorial has two options of how to use The other where we first calculate the denominator with epsilon, which I think is equivalent to yours. |
@vpj Thanks for answering. Regarding the eps, eps is actually used twice in our algorithm in each step, it appears both within and outside the sqrt of st. Please see updated readme or paper on arxiv, a comparison of algorithm between Adam and AdaBelief is shown, with difference highlighed in blue (There are two differences). Though I'm not sure how much difference the extra eps will cause. It seems your code only uses eps once. Regarding the decoupled weight decay, I still don't understand why you said it does not take effect after calculating grad_residual. Decoupled weight decay is basically multiply the weight by a constant factor smaller than 1, it's not related to the gradient. I don't think there will be such a big difference, it you consider the optimization process as "update - rescale - update -rescale ..." v.s. "rescale - update - rescale - update", by "update" I mean update only using the gradient, not the weight. |
@juntang-zhuang Thanks. Sorry I hadn't noticed the two uses of epsilon, will change my code. About the weight decay. Again my bad, I had been referring to coupled weight decay |
@vpj Thanks for clarification. I see, that's an error with coupled weight decay in new version of the code, thanks for pointing out. Will correct it in the next release. |
awesome. while fixing my code I noticed there might be an issue here Adabelief-Optimizer/pypi_packages/adabelief_pytorch0.1.0/adabelief_pytorch/AdaBelief.py Line 175 in bdb1c31
The epsilon is added and assigned to |
Fixed my code https://lab-ml.com/labml_nn/optimizers/ada_belief.html#section-24 thanks! |
@vpj Thanks a lot ! |
I saw this in previous version also, since this gets accumulated wouldn't this be a significant numerical difference? |
It does cause numerical difference, and the inplace version is tested in experiments but the non-inplace is not. That's why we prefer to keep current version unless many experiments show non-inplace helps. |
@vpj Fix the coupled weight decay in |
Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it. Found this while writing a tutorial implementation.