Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM: mdedup hangs then exits with message Killed #362

Closed
3 tasks done
danielhatton opened this issue Oct 2, 2022 · 8 comments
Closed
3 tasks done

OOM: mdedup hangs then exits with message Killed #362

danielhatton opened this issue Oct 2, 2022 · 8 comments
Assignees
Labels
🐛 bug Something isn't working, or a fix is proposed 🙏 help wanted I can't do this alone and need contributors

Comments

@danielhatton
Copy link

danielhatton commented Oct 2, 2022

Preliminary checks

Describe the bug

When running mdedup on a largeish maildir tree (tens of thousands of messages across dozens of folders, occupying about 3GB in total), the process gets about two-thirds of the way (by number of messages) through the "Compute hashes and group duplicates" phase, then hangs (at which point everything else running on the system also becomes very slow) for a few minutes, before exiting with the word "Killed" printed to the shell.

I suspect that the hang results from mdedup's memory usage spiking to the point where lots of swap is being used, and the exit results from mdedup being zapped by the OOM killer. Immediately before the hang, mdedup's resident memory usage as reported by top is 1.8GB.

To reproduce

Steps to reproduce the behavior:

  1. The full mdedup CLI invocation you used.

    mdedup -s discard-all-but-one -a delete-discarded a_maildir_folder another_maildir_folder etcet era a_directory_which_as_well_as_being_a_maildir_folder_contains_subdirectories_which_are_also_maildir_folders

  2. The data set leading to the bug.
    Cannot provide this due to risk of confidential data leak (and the likelihood that the problem is specifically associated with the data set not being "minimal").

Expected behavior

mdedup runs to completion.

CLI output

Add here the raw copy of some console output you were able to produce. Some exemple includes:

Cannot provide this due to risk of confidential data leak.

Environment

All data on execution context as provided by $ mdedup --version:

$ mdedup --version
mdedup 6.2.0
{'username': '-', 'guid': '5ab0da572cb8b705a3025278ab30147', 'hostname': '-', 'hostfqdn': '-', 'uname': {'system': 'Linux', 'node': '-', 'release': '5.15.0-43-generic', 'version': '#46~20.04.1-Ubuntu SMP Thu Jul 14 15:20:17 UTC 2022', 'machine': 'x86_64', 'processor': 'x86_64'}, 'linux_dist_name': '', 'linux_dist_version': '', 'cpu_count': 4, 'fs_encoding': 'utf-8', 'ulimit_soft': 1024, 'ulimit_hard': 1048576, 'cwd': '-', 'umask': '0o2', 'python': {'argv': '-', 'bin': '-', 'version': '3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]', 'compiler': 'GCC 9.4.0', 'build_date': 'Jun 22 2022 20:18:18', 'version_info': [3, 8, 10, 'final', 0], 'features': {'openssl': 'OpenSSL 1.1.1f  31 Mar 2020', 'expat': 'expat_2.2.9', 'sqlite': '3.31.1', 'tkinter': '8.6', 'zlib': '1.2.11', 'unicode_wide': True, 'readline': True, '64bit': True, 'ipv6': True, 'threading': True, 'urandom': True}}, 'time_utc': '2022-10-02 13:33:07.424109', 'time_utc_offset': 0.0, '_eco_version': '1.0.1'}

Additional context

Add any other context about the problem here.

@danielhatton
Copy link
Author

Thinking a bit further about those numbers, it looks like mdedup is trying to hold the entire contents of the maildir tree in RAM at once, which is surprising given that what it's trying to do is compare hashes of certain headers.

@kdeldycke
Copy link
Owner

Yes, mail-deduplicate implementation is quite naive and choke on a non-trivial size of mails. The goal of this CLI was first to have it work before making it performant. We haven't reach that stage yet that's why implementing a cache has been proposed for several years, see: #87.

I do not have any time to work on mail-deduplicate right now. But feel free to propose PRs! :)

@danielhatton
Copy link
Author

Thanks. Being a bit of a kludger, I might just write a wrapper script that takes my list of n maildir folders, and invokes mdedup the requisite n(n-1)/2 times to compare them all pairwise. Although n(n-1)/2 for my dataset is about 500, so it'll take a while to run.

@kdeldycke
Copy link
Owner

When in doubt, brute force it. If it works, it's not a kludge. And machine time is cheaper than developer time. 😁

@kdeldycke
Copy link
Owner

Still, the commit history of that project indicate there's a non-null chance of me refreshing the code base once a year. So if your patient you might see a new release of mail-deduplucate in a couple of months.

@kdeldycke kdeldycke added 🐛 bug Something isn't working, or a fix is proposed and removed bug labels Nov 23, 2022
@kdeldycke kdeldycke changed the title mdedup hangs then exits with message "Killed" OOM: mdedup hangs then exits with message Killed Dec 9, 2022
@kdeldycke kdeldycke added the 🙏 help wanted I can't do this alone and need contributors label Dec 9, 2022
@kdeldycke
Copy link
Owner

@shirosaki just proposed PR #562 to reduce the memory usage of mail-deduplicate. I just merged it upstream and try to cut a release today.

@kdeldycke
Copy link
Owner

Just released mail-deduplicate 7.3.0, with performance enhancements from @shirosaki .

I will close this issue for now on then.

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🐛 bug Something isn't working, or a fix is proposed 🙏 help wanted I can't do this alone and need contributors
Projects
None yet
Development

No branches or pull requests

2 participants