Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training step is too slow #25

Open
guoyang9 opened this issue Oct 14, 2019 · 6 comments
Open

Training step is too slow #25

guoyang9 opened this issue Oct 14, 2019 · 6 comments

Comments

@guoyang9
Copy link

Hi,
Thank you for your code.
As I go deeply into this code, I found the training step is particular slow. The problem here (I guess) is the dataset construction processing, where too much functions (e.g., padding sequences, getting history) are implemented in the __get_item__.
I wonder, have you tried to wrap these functions in the __init__ function? This might lead to more memory consuming but will absolutely accelerate the training process.
Thanks.

@abhshkdz
Copy link
Member

Hey @guoyang9, thanks for trying out the code! No, don't think we've tried moving all the padding + repackaging to init, but that should definitely lead to significant speed-up (cc @kdexd). Let us know (and please send a PR) if you happen to give that a shot.

@kdexd
Copy link
Member

kdexd commented Oct 14, 2019

Hi @guoyang9, thanks for trying out!

I agree, it should accelerate training, it is only a matter of design choice — I prioritized memory consumption while developing the code.

The __getitem__ execution gets parallelized across the multiple workers (per example) for reasonably small batch sizes, and the next batch is collected during the forward pass on current batch (for num_workers > 0 in the dataloader).

On the other hand, padding all the sequences with zeros would grow the memory requirements roughly in the order of the number of examples. This was also one of the motivations to do pre-processing on the fly, instead of reading tokens from H5 files (others being flexible switching of vocabulary and such).

However, this is purely my intuition and I haven't tried moving things to __init__. I may get to this a bit later. In the mean time if you try it out yourself and it improves the speed with a decent trade-off in memory, I would encourage you to make a PR — happy to accept your contribution! :)

@guoyang9
Copy link
Author

guoyang9 commented Oct 15, 2019

Hi @abhshkdz @kdexd ,thanks for your reply.
I will try to divert the padding function from __getitem__ to __init__, and check the memory consumption.
I will get back to you later.
Thanks! :)

@guoyang9
Copy link
Author

Interestingly, after I moved the to_indices() and _pad_sequence() functions into the __init__ of reader (this could result in worse readability), there is only marginal speed improvements as I tested.
This is really a hard issue for me.
Have you guys solved where the speed bottleneck in? @abhshkdz @kdexd

@shubhamagarwal92
Copy link
Contributor

Two suggestions to speed up the code and to avoid memory leak on other GPU:

  1. Do torch.cuda.set_device(device) before torch.cuda.empty_cache()
  2. torch.cuda.empty_cache() after every epoch instead of every batch. Had a major speedup.

Please let me know if you want me to raise a PR. Thanks.

@abhshkdz
Copy link
Member

@shubhamagarwal92 Thanks for the suggestions! Both make sense to me. If you could send in a pull request, that'd be great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants