Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU memory leak in TextPairRegressor #3490

Conversation

MattGPT-ai
Copy link
Contributor

#3487

This addresses the above memory leak by saving a reference to the TextPair when the concatenated Sentence is created. Its embeddings are explicitly cleared when clear_embeddings is called.

This also adds type hints to files relevant to this change.

@MattGPT-ai MattGPT-ai force-pushed the gh-3487/fix-gpu-memory-leak-text-pair-regressor branch from 27c0638 to a4e7223 Compare July 8, 2024 17:28
@MattGPT-ai
Copy link
Contributor Author

Just pushed a type hint change for a file that I didn't really touch, but I think is checking because I added type hints to Sentence:
predicted: List[List[Union[int, float]]] = [[] for _ in range(number_tokens)]

Not sure if it's the case that this should just be List[List[int]] and the line predicted[i].append(input_indices[i].item()) should just cast int(input_indices[i].item()) - but that seemed riskier since it's changing code and not just the type hint.
Any comments here?

flair/data.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Mostly looks good. Just please correct the return signature of the embedding property, as indicated in the comment above.

I also added a suggestion on checking for the concatenated data point, which may give a minimal speed-up during training (though likely negligible so up to you if you want to include it).

Regarding the lemmatizer, I'm not so sure. Probably best to err on the side of caution and use the old, underspecified signature.

flair/models/pairwise_regression_model.py Outdated Show resolved Hide resolved
keep reference to concatenated sentence that is created when not embedding data points separately in a DataPair. those embeddings are then able to be cleared in clear_embeddings, freeing the memory from GPU
@MattGPT-ai MattGPT-ai force-pushed the gh-3487/fix-gpu-memory-leak-text-pair-regressor branch from 3accd91 to 1a6f110 Compare July 9, 2024 01:53
@MattGPT-ai MattGPT-ai requested a review from alanakbik July 9, 2024 09:50
@mattb-zip
Copy link

I have addressed all of the requests I believe, let me know if you can approve or request any further changes

@MattGPT-ai
Copy link
Contributor Author

MattGPT-ai commented Jul 12, 2024

I just realized that this same issue exists for TextPairClassifier

The simple fix is to just apply the same logic to _get_embedding_for_data_point -- however, since these two functions are identical, it might be good to do a small refactor where they each call the same common function for embedding a TextPair.

Since these inherit from different base classes, it doesn't seem like the best idea to try to do this via inheritance

@alanakbik
Copy link
Collaborator

@MattGPT-ai thanks for fixing this and adding the type hints!

I'll merge this now. The same bug in TextPairClassifier could be fixed similarly. Agree that addressing this via inheritance would be overkill since this only affects two classes (and doing too much via inheritance tends to make code less intuitive).

@alanakbik alanakbik merged commit 7887678 into flairNLP:master Jul 13, 2024
1 check passed
@MattGPT-ai
Copy link
Contributor Author

I actually pushed an additional commit to fix it in TextPairClassifier, just did the same fix. I have an idea for refactoring this: make _get_embedding_for_data_point a function in TextPair - right now it's just a type definition, but it could be made a class that inherits DataPair[Sentence, Sentence]. Then the models that use TextPair would just use its own function to embed. Could have a single function or two functions for embedding separately or jointly.

This would give better control over how these embeddings are set in the TextPair. One problem I thought of is that if we're saving the concatenated sentence, you could have an inconsistency if you make a change to one of the texts in the pair, and it would not be updated accordingly. This refactor would allow us to have a setter function for each member of the pair that would automatically update the concatenated sentence. I think right now we're relying too much on the models to handle all of this, but all this functionality should exist in the data class independent of the models.

Currently in DataPair, you have the property embedding that returns the concatenation of the two member embeddings concatenated, which is more in line with doing separate embeddings. That might be a decent default, but I think it makes sense to override that for the case of TextPair, and it might even be good to have a property that sets if its saved embedding is separate or joint. Not sure about that, but your thoughts would be much appreciated. If you think that makes sense, I could push a PR and maybe create a separate issue

@alanakbik
Copy link
Collaborator

Hi @MattGPT-ai, is it a good idea and agree that saving the concatenated data point may lead to problematic inconsistencies.

I'm a bit worried though that moving too much logic into the data class will cause the logic to become too distributed. Right now, the TextPairClassifier makes all model-relevant decisions through _get_embedding_for_data_point: whether or not to concat, what separator to use, etc. Perhaps you could prepare a PR and we can discuss it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants