Dupefinder

Main Workflow

dupefinder-all-in-one.py

This script handles the entire process in one step.

Note: the all-in-one process is ideal where there's minimal chance the process will be interrupted, such as a local drive.

Choose a folder you'd like to scan for duplicates
Copy the path to that folder. This will be the first parameter (e.g. "/Folder")
Choose a name for the output file containing the resulting .csv file of all suspected duplicate files. (e.g. "folder-2024-05-01-dupes.csv"). This is the second parameter
Choose a name for this job (e.g. "folder-2024-05-01"). This is the third parameter
Run it...

python3 dupefinder-all-in-one.py /Folder older-2024-05-01-dupes.csv folder-2024-05-01

Optional parameters

--csvfile - List of all filetypes the script encounters. Default: "filetypes.csv".
--maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file. Lower values will generate a hash from a smaller portion of each file, which can save time, but may also result in false positives. Default: entire file.
--skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered. Default: no skips.
--maxfiles - Only count this many files. Default: scans all files.

Alternate workflow

In the event that the process might be interrupted (such as with a network share), it may be desirable to run the steps one at a time.

Run filetype_finder.py. Specify the folder you'd like to scan as the first parameter. You may also want to specify a nickname for this scan, e.g. "folder-2024-05-01".
- If the file-finder operation is interrupted, you can restart the scan at the last successful file number using --skipuntil. Be sure to save it to a new .csv file and append the .csv files before running dupefinder.py.
Run dupefinder.py. The first parameter is the nickname specified above with "-all.csv" appended to it, e.g. "folder-2024-05-01-all.csv". The second parameter is the name of all detected duplicates, e.g. "folder-2024-05-01-dupes.csv".
Run dupe-dupe-checker.py. The dupefinder process sometimes finds the same duplicate twice, but in opposite directions; in other words, file X is the same as file Y and file Y is the same as X. Running this process detects and eliminate such entries. The only parameter is the output of the dupefinder step; the original file is overwritten.

Utilities

csvscramble.py - CSV Scramble

Inputs a CSV file and outputs it in random row order. Technically works with any CR-delimited file.

in_name - Input file
out_name - Scrambled output file

dupe_dupe_checker.py - Dupe Dupe Checker

Removes all dupes from a list that are simply a reverse of another dupe.

dupecsvfile - The CSV file containing the list of dupes. Required. Overwrites the existing file.

dupefinder-all-in-one.py

filepath - The path to scan
dupefile - Output CSV containing all duplicates
filetypefolder - Folder where lists of files of each type will be written. Do not include trailing slash. Default: "filetypes"
--csvfile - List of all filetypes the script encounters. Default: "filetypes.csv".
--maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file. Lower values will generate a hash from a smaller portion of each file, which can save time, but may also result in false positives. Default: entire file.
--skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered. Default: no skips.
--maxfiles - Only count this many files. Default: scans all files.

dupefinder.py - Dupefinder

Given a CSV of file hashes generated by Hashmaker, will find all duplicates. Will probably need a pass through dupe-dupe-checker afterwards.

hashcsvfile - Input CSV file containing hashes
dupefilepath - Output CSV containing all duplicates

file-types-compare.py - File Types Compare

Runs the whole sequence at once: makes a hashfile, checks for dupes, and removes the duplicate dupes.

filepath - Path to search for duplicates
--hashcsvfile - Optional path for hash CSV file; hashfile.csv is used by default. Default: hashfile.csv

filetype-finder.py

Makes a CSV list of every file type in a path. Additionally, creates a separate list of every file of each file type in the specified folder (defaults to filetypes/)

filepath - The path to scan
--csvfile - Output filetype list. Default: "filetypes.csv"
--filetypefolder - Folder where lists of files of each type will be written. Do not include trailing slash. Default: "filetypes"
--maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file
--skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered
--maxfiles - Only count this many files

folder-compare.py

parser = argparse.ArgumentParser(prog="Folder Compare",
							  description="Given a CSV dupe file, will generate a summary of the number of matching files in every permutation of folders.  Useful for determining whether entire *folders* are duplicates, or close to duplicates, of one another")
parser.add_argument('dupefile', help="The CSV dupe file to use")
parser.add_argument('outfile', help="File name of the report to generate")

hashmaker.py - Hashmaker

Generates a CSV list of hashes, but filtered by file type. Command line function is no longer part of the main workflow, but included because it may still be useful.

filetypes - Comma-separated list of all file types you'd like to check, e.g. '.jpg,.gif,.bmp"
path - The path for which to make hashes
--hashcsvfile - Optional path for hash CSV file; hashfile.csv is used by default

mergecsvfiles.py - Merge All CSV Files

Merges an entire folder of file type CSVs into a single one with an added column for the CSV file from which each row came.

path - Folder full of file types
outfile - Destination for merged CSV file

mergecsv.py - CSV Merge

Merges two CSV files.

file1 - The first CSV file
file1name - Text to be added to each row of file 1, to tell them apart
file2 - The second CSV file
file2name - Text to be added to each row of file 2
outfile - The output CSV file

time_estimate.py

Not a command line utility. Specifies the function for estimating time when doing a file scan.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
__pycache__		__pycache__
oldtools		oldtools
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
csvscramble.py		csvscramble.py
dupe_dupe_checker.py		dupe_dupe_checker.py
dupecounter.py		dupecounter.py
dupefinder-all-in-one.py		dupefinder-all-in-one.py
dupefinder.py		dupefinder.py
file-types-compare.py		file-types-compare.py
filetype_finder.py		filetype_finder.py
folder-compare.py		folder-compare.py
hashmaker.py		hashmaker.py
mergeallcsvfiles.py		mergeallcsvfiles.py
mergecsv.py		mergecsv.py
time_estimate.py		time_estimate.py
to-run-a-scan.txt		to-run-a-scan.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dupefinder

Main Workflow

dupefinder-all-in-one.py

Optional parameters

Alternate workflow

Utilities

csvscramble.py - CSV Scramble

dupe_dupe_checker.py - Dupe Dupe Checker

dupefinder-all-in-one.py

dupefinder.py - Dupefinder

file-types-compare.py - File Types Compare

filetype-finder.py

folder-compare.py

hashmaker.py - Hashmaker

mergecsvfiles.py - Merge All CSV Files

mergecsv.py - CSV Merge

time_estimate.py

About

Releases

Packages

Languages

nicksmadscience/archival-processor

Folders and files

Latest commit

History

Repository files navigation

Dupefinder

Main Workflow

dupefinder-all-in-one.py

Optional parameters

Alternate workflow

Utilities

csvscramble.py - CSV Scramble

dupe_dupe_checker.py - Dupe Dupe Checker

dupefinder-all-in-one.py

dupefinder.py - Dupefinder

file-types-compare.py - File Types Compare

filetype-finder.py

folder-compare.py

hashmaker.py - Hashmaker

mergecsvfiles.py - Merge All CSV Files

mergecsv.py - CSV Merge

time_estimate.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages