Training step is too slow #25

guoyang9 · 2019-10-14T14:33:53Z

Hi,
Thank you for your code.
As I go deeply into this code, I found the training step is particular slow. The problem here (I guess) is the dataset construction processing, where too much functions (e.g., padding sequences, getting history) are implemented in the __get_item__.
I wonder, have you tried to wrap these functions in the __init__ function? This might lead to more memory consuming but will absolutely accelerate the training process.
Thanks.

The text was updated successfully, but these errors were encountered:

abhshkdz · 2019-10-14T17:44:26Z

Hey @guoyang9, thanks for trying out the code! No, don't think we've tried moving all the padding + repackaging to init, but that should definitely lead to significant speed-up (cc @kdexd). Let us know (and please send a PR) if you happen to give that a shot.

kdexd · 2019-10-14T19:47:22Z

Hi @guoyang9, thanks for trying out!

I agree, it should accelerate training, it is only a matter of design choice — I prioritized memory consumption while developing the code.

The __getitem__ execution gets parallelized across the multiple workers (per example) for reasonably small batch sizes, and the next batch is collected during the forward pass on current batch (for num_workers > 0 in the dataloader).

On the other hand, padding all the sequences with zeros would grow the memory requirements roughly in the order of the number of examples. This was also one of the motivations to do pre-processing on the fly, instead of reading tokens from H5 files (others being flexible switching of vocabulary and such).

However, this is purely my intuition and I haven't tried moving things to __init__. I may get to this a bit later. In the mean time if you try it out yourself and it improves the speed with a decent trade-off in memory, I would encourage you to make a PR — happy to accept your contribution! :)

guoyang9 · 2019-10-15T00:52:53Z

Hi @abhshkdz @kdexd ,thanks for your reply.
I will try to divert the padding function from __getitem__ to __init__, and check the memory consumption.
I will get back to you later.
Thanks! :)

guoyang9 · 2019-10-18T01:08:01Z

Interestingly, after I moved the to_indices() and _pad_sequence() functions into the __init__ of reader (this could result in worse readability), there is only marginal speed improvements as I tested.
This is really a hard issue for me.
Have you guys solved where the speed bottleneck in? @abhshkdz @kdexd

shubhamagarwal92 · 2020-05-28T00:51:22Z

Two suggestions to speed up the code and to avoid memory leak on other GPU:

Do torch.cuda.set_device(device) before torch.cuda.empty_cache()
torch.cuda.empty_cache() after every epoch instead of every batch. Had a major speedup.

Please let me know if you want me to raise a PR. Thanks.

abhshkdz · 2020-05-28T22:28:11Z

@shubhamagarwal92 Thanks for the suggestions! Both make sense to me. If you could send in a pull request, that'd be great, thanks!

shubhamagarwal92 mentioned this issue May 29, 2020

Code Speedup #34

Merged

abhshkdz closed this as completed May 29, 2020

abhshkdz reopened this May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training step is too slow #25

Training step is too slow #25

guoyang9 commented Oct 14, 2019

abhshkdz commented Oct 14, 2019

kdexd commented Oct 14, 2019

guoyang9 commented Oct 15, 2019 •

edited

Loading

guoyang9 commented Oct 18, 2019

shubhamagarwal92 commented May 28, 2020

abhshkdz commented May 28, 2020

Training step is too slow #25

Training step is too slow #25

Comments

guoyang9 commented Oct 14, 2019

abhshkdz commented Oct 14, 2019

kdexd commented Oct 14, 2019

guoyang9 commented Oct 15, 2019 • edited Loading

guoyang9 commented Oct 18, 2019

shubhamagarwal92 commented May 28, 2020

abhshkdz commented May 28, 2020

guoyang9 commented Oct 15, 2019 •

edited

Loading