Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do the weight decay before using grad #33

Closed
wants to merge 1 commit into from

Conversation

vpj
Copy link

@vpj vpj commented Dec 14, 2020

Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it. Found this while writing a tutorial implementation.

Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it.
@juntang-zhuang
Copy link
Owner

What’s the difference between doing weight decay before or after grad? I think they are equivalent, when considering update before grad in step 2, is equivalent to update after grad in step 1.

@juntang-zhuang
Copy link
Owner

Thanks for the implementation. Quickly skim through your code, the eps is used differently from our implementation. Not sure how much difference it would cause.

@vpj
Copy link
Author

vpj commented Dec 16, 2020

@juntang-zhuang grad_residual is computed before uncoupled weight decay changes the gradient

Uncoupled weight decay has no effect (won't work) if it's done after calculating the grad_residual

@vpj
Copy link
Author

vpj commented Dec 16, 2020

The tutorial has two options of how to use eps. One with the optimization of calculating step size with scalars first before multiplying and dividing by the momentum and variance tensors.
https://lab-ml.com/labml_nn/optimizers/radam.html#section-22

The other where we first calculate the denominator with epsilon, which I think is equivalent to yours.
https://lab-ml.com/labml_nn/optimizers/radam.html#section-25

@juntang-zhuang
Copy link
Owner

@vpj Thanks for answering. Regarding the eps, eps is actually used twice in our algorithm in each step, it appears both within and outside the sqrt of st. Please see updated readme or paper on arxiv, a comparison of algorithm between Adam and AdaBelief is shown, with difference highlighed in blue (There are two differences). Though I'm not sure how much difference the extra eps will cause. It seems your code only uses eps once.

Regarding the decoupled weight decay, I still don't understand why you said it does not take effect after calculating grad_residual. Decoupled weight decay is basically multiply the weight by a constant factor smaller than 1, it's not related to the gradient. I don't think there will be such a big difference, it you consider the optimization process as "update - rescale - update -rescale ..." v.s. "rescale - update - rescale - update", by "update" I mean update only using the gradient, not the weight.

@vpj
Copy link
Author

vpj commented Dec 16, 2020

@juntang-zhuang Thanks. Sorry I hadn't noticed the two uses of epsilon, will change my code.

About the weight decay. Again my bad, I had been referring to coupled weight decay grad.add_(p.data, alpha=group['weight_decay']) where you change the gradient. But grad_residual and exp_avg is calculated before that.

@juntang-zhuang
Copy link
Owner

@vpj Thanks for clarification. I see, that's an error with coupled weight decay in new version of the code, thanks for pointing out. Will correct it in the next release.

@vpj
Copy link
Author

vpj commented Dec 16, 2020

awesome. while fixing my code I noticed there might be an issue here

denom = (exp_avg_var.add_(group['eps']).sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])

The epsilon is added and assigned to exp_avg_var, which is not the expected behavior.

@vpj
Copy link
Author

vpj commented Dec 16, 2020

@juntang-zhuang
Copy link
Owner

@vpj Thanks a lot !

@vpj
Copy link
Author

vpj commented Dec 16, 2020

I saw this in previous version also, since this gets accumulated wouldn't this be a significant numerical difference?

@juntang-zhuang
Copy link
Owner

It does cause numerical difference, and the inplace version is tested in experiments but the non-inplace is not. That's why we prefer to keep current version unless many experiments show non-inplace helps.

@juntang-zhuang
Copy link
Owner

@vpj Fix the coupled weight decay in adabelief-pytorch==0.2.0, now it can be installed by pip. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants