Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Reproducibility #8

Open
reginehartwig opened this issue Sep 25, 2023 · 3 comments
Open

Training Reproducibility #8

reginehartwig opened this issue Sep 25, 2023 · 3 comments

Comments

@reginehartwig
Copy link

I am currently analyzing the training process of your model.
I recognized that results are only partially reproducible as there seems to be some randomness in the training.

Do you know which parts of the code are influencing the reproducibility? Could it be Pytorch3D, also related to this issue facebookresearch/pytorch3d#659?
It would be great if you could tell me more about it, the tests you might have run, and whether you plan to work on this.

@monniert
Copy link
Owner

Hi @reginehartwig! Which experiments are you having issue to reproduce? As stated in the readme, for complex real images like birds and horses, we observed that the model can still converge to a bad local minima where the prototypical shape is wrong, and you should try another random seed and check the results after the first stage. It is difficult to reproduce exactly this kind of experiments even by setting the random seed, it can depend on the issue you pointed out or it could also depend on the hardware and the versions of librairies you installed.

@reginehartwig
Copy link
Author

Hi @monniert! Thanks for the fast reply!
I ran experiments with p3dcar, cub and shapenetnmr. The problem is that I still get different results for multiple runs, even by taking the same random seed and running with the same setting. The plotted loss curves also reflect that already at the first few epochs.
loss_uniform

Later on the results can become very different. This means, I cannot run a code twice (with the same seed) and expect the same outcome.
loss_uniform_500_epochs
chamfer-l1-icp

@monniert
Copy link
Owner

monniert commented Oct 1, 2023

From what I remember in my case, I think beginning of trainings were mostly identical by fixing the seed, not really sure about performances in the long run though. Are you always running the experiments on the same machine? The source of randomness can come from different tiny things, you should investigate the common source of randomness listed at this link (https://pytorch.org/docs/stable/notes/randomness.html), in particular you should set torch.backends.cudnn.benchmark = False L27 in src.trainer.py

It could also be related to the issue you mentioned. I do not plan to work on this, but would be interested to hear about the root cause if you manage to make it completely deterministic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants