Training Reproducibility #8

reginehartwig · 2023-09-25T09:41:55Z

I am currently analyzing the training process of your model.
I recognized that results are only partially reproducible as there seems to be some randomness in the training.

Do you know which parts of the code are influencing the reproducibility? Could it be Pytorch3D, also related to this issue facebookresearch/pytorch3d#659?
It would be great if you could tell me more about it, the tests you might have run, and whether you plan to work on this.

monniert · 2023-09-26T08:00:44Z

Hi @reginehartwig! Which experiments are you having issue to reproduce? As stated in the readme, for complex real images like birds and horses, we observed that the model can still converge to a bad local minima where the prototypical shape is wrong, and you should try another random seed and check the results after the first stage. It is difficult to reproduce exactly this kind of experiments even by setting the random seed, it can depend on the issue you pointed out or it could also depend on the hardware and the versions of librairies you installed.

reginehartwig · 2023-09-26T19:30:01Z

Hi @monniert! Thanks for the fast reply!
I ran experiments with p3dcar, cub and shapenetnmr. The problem is that I still get different results for multiple runs, even by taking the same random seed and running with the same setting. The plotted loss curves also reflect that already at the first few epochs.

Later on the results can become very different. This means, I cannot run a code twice (with the same seed) and expect the same outcome.

monniert · 2023-10-01T13:11:34Z

From what I remember in my case, I think beginning of trainings were mostly identical by fixing the seed, not really sure about performances in the long run though. Are you always running the experiments on the same machine? The source of randomness can come from different tiny things, you should investigate the common source of randomness listed at this link (https://pytorch.org/docs/stable/notes/randomness.html), in particular you should set torch.backends.cudnn.benchmark = False L27 in src.trainer.py

It could also be related to the issue you mentioned. I do not plan to work on this, but would be interested to hear about the root cause if you manage to make it completely deterministic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Reproducibility #8

Training Reproducibility #8

reginehartwig commented Sep 25, 2023

monniert commented Sep 26, 2023

reginehartwig commented Sep 26, 2023

monniert commented Oct 1, 2023

Training Reproducibility #8

Training Reproducibility #8

Comments

reginehartwig commented Sep 25, 2023

monniert commented Sep 26, 2023

reginehartwig commented Sep 26, 2023

monniert commented Oct 1, 2023