Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache hashes on filesystem #87

Open
leggewie opened this issue Sep 24, 2020 · 5 comments
Open

Cache hashes on filesystem #87

leggewie opened this issue Sep 24, 2020 · 5 comments
Labels
✨ enhancement Improvement or change to an existing feature 🎁 feature request Not existing yet and need to be implemented 🙏 help wanted I can't do this alone and need contributors

Comments

@leggewie
Copy link
Contributor

When hashing about 20.000 to 30.000 mails mdedup used up 2G of RAM. I think it would be good idea to offload storing the hashes to a temporary file, one that could potentially be reused on a subsequent run.

@kdeldycke
Copy link
Owner

kdeldycke commented Oct 1, 2020

Thanks for the bug report. I'm not surprised the CLI has limited performances. The goal is to make it work before making it fast. Once the project gets more stable and better tested, I guess we can invest some time making it fast.

@kdeldycke kdeldycke changed the title RFE: please consider to keep hashes in a log file instead of in RAM Cache hashes on filesystem Oct 1, 2020
@stweil
Copy link

stweil commented Nov 5, 2020

I just try to deduplicate a mailbox with about 300000 e-mails. mdedup has allocated all available memory including swap space, thus making the system unusable until the kernel will finally kill it with out-of-memory.

So the question is not whether to make it work or to make it fast. Currently it is neither fast nor does it work for really large mailboxes.

Is at least 3 GB RAM for 300000 e-mails or more than 10 KB per e-mail a reasonable size? The numbers above indicate even 100 KB per e-mail.

@leggewie
Copy link
Contributor Author

@stweil To be fair, swap exists, so this does indeed become a question of optimization as @kdeldycke correctly pointed out.

@leggewie
Copy link
Contributor Author

FWIW, I was able to run this on a 10 year-old thinclient (!) with 2GB of RAM by adding 5 to 10 GB of swap on an external USB HDD to analyze a maildir account with about 150' mails. This really is only a question of convenience. Just add temporary swap space and drink copious amounts of coffee while waiting for now ;-)

@kdeldycke
Copy link
Owner

The thing is this project was always a hack for single users with small data. The fact that people are now using it across several mail sources and much bigger boxes is evidence it fills real user needs.

But now that a wider audience is playing with it shows its weaknesses. We need some more developers to tackle that issue, and a file-system cache is indeed a good feature to have.

Let bootstrap that by discussing the implementation.

I propose to simply persist the hash <=> email's UID map into a local SQLite database. That way we don't need to re-invent yet another file format. And that's easy to do thanks to the sqlite3 module included in every Python distribution.

@kdeldycke kdeldycke added ✨ enhancement Improvement or change to an existing feature 🎁 feature request Not existing yet and need to be implemented 🙏 help wanted I can't do this alone and need contributors and removed enhancement labels Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement Improvement or change to an existing feature 🎁 feature request Not existing yet and need to be implemented 🙏 help wanted I can't do this alone and need contributors
Projects
None yet
Development

No branches or pull requests

3 participants