Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subfolders are not processed #123

Closed
tichaczech opened this issue Nov 4, 2020 · 8 comments
Closed

Subfolders are not processed #123

tichaczech opened this issue Nov 4, 2020 · 8 comments
Assignees
Labels
🐛 bug Something isn't working, or a fix is proposed

Comments

@tichaczech
Copy link

Only 69 emails are processed, although I have +30k in mine .Maildir. Those 69 are just in my INBOX folder, the rest is in subfolders which leads to my assumption that subfolders are not processed.

Deduplication command on .Maildir

$ mdedup --hash-header date --hash-header from --hash-header to --hash-header message-id --strategy select-smallest --dry-run --export .Maildir.tmp .Maildir
● Phase #0 - Load mails
Opening /(...)/.Maildir ...
maildir detected.
69 mails found.
(...)

File count in .Maildir

$ find .Maildir -type f | wc -l
36674

All data on execution context as provided by $ mdedup --version:

$ mdedup --version
mdedup 6.0.1
{'username': '-', 'guid': '77601cc7d08d280102bb709be48d881', 'hostname': '-', 'hostfqdn': '-', 'uname': {'system': 'Linux', 'node': '-', 'release': '3.10.105', 'version': '#25426 SMP Wed Jul 8 03:10:21 CST 2020', 'machine': 'armv7l', 'processor': ''}, 'linux_dist_name': '', 'linux_dist_version': '', 'cpu_count': 2, 'fs_encoding': 'utf-8', 'ulimit_soft': 1024, 'ulimit_hard': 4096, 'cwd': '-', 'umask': '0o2', 'python': {'argv': '-', 'bin': '-', 'version': '3.8.2 (tags/Contacts-1.0.0-0232-200617:57e5f51, Jun 29 2020, 09:34:08) [GCC 4.9.3 20150311 (prerelease)]', 'compiler': 'GCC 4.9.3 20150311 (prerelease)', 'build_date': 'Jun 29 2020 09:34:08', 'version_info': [3, 8, 2, 'final', 0], 'features': {'openssl': 'OpenSSL 1.0.2u-fips  20 Dec 2019', 'expat': 'expat_2.2.1', 'sqlite': '3.10.2', 'tkinter': '', 'zlib': '1.2.8', 'unicode_wide': True, 'readline': True, '64bit': False, 'ipv6': True, 'threading': True, 'urandom': True}}, 'time_utc': '2020-11-04 00:19:10.939181', 'time_utc_offset': 1.0, '_eco_version': '1.0.1'}
@kdeldycke
Copy link
Owner

Oh yes you're right. That is a legitimate issue. Users expects all mails in all subfolders to be part of the initial deduplication pool by default.

@kdeldycke
Copy link
Owner

I hacked something together in b82accc . It is available in the brand new 6.0.2 release.

Can you try it out and shares the results here please?

@tichaczech
Copy link
Author

tichaczech commented Nov 5, 2020

Seeing the changes you made, I was a bit expecting the following result

  • it processes subfolders 👍
  • it processes all of them 👍
  • it takes looooooong :(

So for this bug report, I take it fixed and closed!

What I didn't expect

  • it dies prematurely 👎

On my system (AL212 1.4GHz, 1GB RAM) it processed about 1/3 of all emails before it was killed (probably for using all memory). Do not know how big mailboxes others deduping, but ours are huge (+100GB, +10Mio emails) and after f*cked up migration, we have a plethora of duplicates - so some incremental disk offloading may help (if my assumption is right).

As a developer myself, I can offer some help, but it would take some time to jump in (I am more like a C++/C#/Typescript guy), and I don't currently have any. The only help I can offer for now is testing and "light debugging".

May I open another bug report for that or you would not bother with it?

@kdeldycke
Copy link
Owner

Ah yeah, I'm not surprised at all. mdedup is quite naive and loads up all mails in memory. Some real refactoring is required here to improve performance.

I know there's lots of low hanging fruit lying around (like mail's double copies). And there's also: #87. So no need to create new tickets.

I'm past the personal needs for that tool to be honest. I don't need it anymore. These last few weeks are probably the last efforts I invested to make the tool and the project in good shape (stable feature, good enough unittests). Now we both need a strong contributor to step in if we need more big stuff.

@kdeldycke
Copy link
Owner

That being said, and now that I think about it, you can hire me to implement better performance! 👨‍💻

@tichaczech
Copy link
Author

tichaczech commented Nov 5, 2020

My boss doesn't like the idea of spending money on such things (bloody him) and that is exactly the reason why I am here and not running some paid tool already (with all the respect to work you have done).

I even lifted my *ss of the chair to ask over again, but unlucky for both of us - he didn't change his mind :/.

Anyway, thank you, and good luck with whatever else you are doing right now!

@leggewie
Copy link
Contributor

leggewie commented Jan 11, 2021

@tichaczech I'd suggest some massive swap partition while you are weeding out those dupes. That worked/works for me.

As a datapoint, I'm using about 3GB memory for 50.000 mails.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 12, 2021
@kdeldycke kdeldycke added 🐛 bug Something isn't working, or a fix is proposed and removed bug labels Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🐛 bug Something isn't working, or a fix is proposed
Projects
None yet
Development

No branches or pull requests

3 participants