Cache hashes on filesystem #87

leggewie · 2020-09-24T15:54:29Z

When hashing about 20.000 to 30.000 mails mdedup used up 2G of RAM. I think it would be good idea to offload storing the hashes to a temporary file, one that could potentially be reused on a subsequent run.

kdeldycke · 2020-10-01T13:00:10Z

Thanks for the bug report. I'm not surprised the CLI has limited performances. The goal is to make it work before making it fast. Once the project gets more stable and better tested, I guess we can invest some time making it fast.

stweil · 2020-11-05T21:36:03Z

I just try to deduplicate a mailbox with about 300000 e-mails. mdedup has allocated all available memory including swap space, thus making the system unusable until the kernel will finally kill it with out-of-memory.

So the question is not whether to make it work or to make it fast. Currently it is neither fast nor does it work for really large mailboxes.

Is at least 3 GB RAM for 300000 e-mails or more than 10 KB per e-mail a reasonable size? The numbers above indicate even 100 KB per e-mail.

leggewie · 2021-01-11T16:43:15Z

@stweil To be fair, swap exists, so this does indeed become a question of optimization as @kdeldycke correctly pointed out.

leggewie · 2021-01-22T16:11:38Z

FWIW, I was able to run this on a 10 year-old thinclient (!) with 2GB of RAM by adding 5 to 10 GB of swap on an external USB HDD to analyze a maildir account with about 150' mails. This really is only a question of convenience. Just add temporary swap space and drink copious amounts of coffee while waiting for now ;-)

kdeldycke · 2021-01-22T16:37:33Z

The thing is this project was always a hack for single users with small data. The fact that people are now using it across several mail sources and much bigger boxes is evidence it fills real user needs.

But now that a wider audience is playing with it shows its weaknesses. We need some more developers to tackle that issue, and a file-system cache is indeed a good feature to have.

Let bootstrap that by discussing the implementation.

I propose to simply persist the hash <=> email's UID map into a local SQLite database. That way we don't need to re-invent yet another file format. And that's easy to do thanks to the sqlite3 module included in every Python distribution.

kdeldycke added enhancement labels Oct 1, 2020

kdeldycke added the feature request label Oct 1, 2020

kdeldycke changed the title ~~RFE: please consider to keep hashes in a log file instead of in RAM~~ Cache hashes on filesystem Oct 1, 2020

kdeldycke mentioned this issue Nov 5, 2020

Subfolders are not processed #123

Closed

kdeldycke mentioned this issue Jan 20, 2021

Implement *-discarded actions #146

Closed

3 tasks

kdeldycke mentioned this issue Mar 15, 2021

Add option to only act on duplicate messages #204

Closed

8 tasks

kdeldycke mentioned this issue Oct 4, 2022

OOM: mdedup hangs then exits with message Killed #362

Closed

3 tasks

kdeldycke added ✨ enhancement Improvement or change to an existing feature 🎁 feature request Not existing yet and need to be implemented 🙏 help wanted I can't do this alone and need contributors and removed enhancement labels Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache hashes on filesystem #87

Cache hashes on filesystem #87

leggewie commented Sep 24, 2020

kdeldycke commented Oct 1, 2020 •

edited

Loading

stweil commented Nov 5, 2020 •

edited

Loading

leggewie commented Jan 11, 2021

leggewie commented Jan 22, 2021

kdeldycke commented Jan 22, 2021

Cache hashes on filesystem #87

Cache hashes on filesystem #87

Comments

leggewie commented Sep 24, 2020

kdeldycke commented Oct 1, 2020 • edited Loading

stweil commented Nov 5, 2020 • edited Loading

leggewie commented Jan 11, 2021

leggewie commented Jan 22, 2021

kdeldycke commented Jan 22, 2021

kdeldycke commented Oct 1, 2020 •

edited

Loading

stweil commented Nov 5, 2020 •

edited

Loading