This script handles the entire process in one step.
Note: the all-in-one process is ideal where there's minimal chance the process will be interrupted, such as a local drive.
- Choose a folder you'd like to scan for duplicates
- Copy the path to that folder. This will be the first parameter (e.g. "/Folder")
- Choose a name for the output file containing the resulting .csv file of all suspected duplicate files. (e.g. "folder-2024-05-01-dupes.csv"). This is the second parameter
- Choose a name for this job (e.g. "folder-2024-05-01"). This is the third parameter
- Run it...
python3 dupefinder-all-in-one.py /Folder older-2024-05-01-dupes.csv folder-2024-05-01
- --csvfile - List of all filetypes the script encounters. Default: "filetypes.csv".
- --maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file. Lower values will generate a hash from a smaller portion of each file, which can save time, but may also result in false positives. Default: entire file.
- --skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered. Default: no skips.
- --maxfiles - Only count this many files. Default: scans all files.
In the event that the process might be interrupted (such as with a network share), it may be desirable to run the steps one at a time.
- Run filetype_finder.py. Specify the folder you'd like to scan as the first parameter. You may also want to specify a nickname for this scan, e.g. "folder-2024-05-01".
- If the file-finder operation is interrupted, you can restart the scan at the last successful file number using --skipuntil. Be sure to save it to a new .csv file and append the .csv files before running dupefinder.py.
- Run dupefinder.py. The first parameter is the nickname specified above with "-all.csv" appended to it, e.g. "folder-2024-05-01-all.csv". The second parameter is the name of all detected duplicates, e.g. "folder-2024-05-01-dupes.csv".
- Run dupe-dupe-checker.py. The dupefinder process sometimes finds the same duplicate twice, but in opposite directions; in other words, file X is the same as file Y and file Y is the same as X. Running this process detects and eliminate such entries. The only parameter is the output of the dupefinder step; the original file is overwritten.
Inputs a CSV file and outputs it in random row order. Technically works with any CR-delimited file.
- in_name - Input file
- out_name - Scrambled output file
Removes all dupes from a list that are simply a reverse of another dupe.
- dupecsvfile - The CSV file containing the list of dupes. Required. Overwrites the existing file.
- filepath - The path to scan
- dupefile - Output CSV containing all duplicates
- filetypefolder - Folder where lists of files of each type will be written. Do not include trailing slash. Default: "filetypes"
- --csvfile - List of all filetypes the script encounters. Default: "filetypes.csv".
- --maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file. Lower values will generate a hash from a smaller portion of each file, which can save time, but may also result in false positives. Default: entire file.
- --skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered. Default: no skips.
- --maxfiles - Only count this many files. Default: scans all files.
Given a CSV of file hashes generated by Hashmaker, will find all duplicates. Will probably need a pass through dupe-dupe-checker afterwards.
- hashcsvfile - Input CSV file containing hashes
- dupefilepath - Output CSV containing all duplicates
Runs the whole sequence at once: makes a hashfile, checks for dupes, and removes the duplicate dupes.
- filepath - Path to search for duplicates
- --hashcsvfile - Optional path for hash CSV file; hashfile.csv is used by default. Default: hashfile.csv
Makes a CSV list of every file type in a path. Additionally, creates a separate list of every file of each file type in the specified folder (defaults to filetypes/)
- filepath - The path to scan
- --csvfile - Output filetype list. Default: "filetypes.csv"
- --filetypefolder - Folder where lists of files of each type will be written. Do not include trailing slash. Default: "filetypes"
- --maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file
- --skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered
- --maxfiles - Only count this many files
parser = argparse.ArgumentParser(prog="Folder Compare",
description="Given a CSV dupe file, will generate a summary of the number of matching files in every permutation of folders. Useful for determining whether entire *folders* are duplicates, or close to duplicates, of one another")
parser.add_argument('dupefile', help="The CSV dupe file to use")
parser.add_argument('outfile', help="File name of the report to generate")
Generates a CSV list of hashes, but filtered by file type. Command line function is no longer part of the main workflow, but included because it may still be useful.
- filetypes - Comma-separated list of all file types you'd like to check, e.g. '.jpg,.gif,.bmp"
- path - The path for which to make hashes
- --hashcsvfile - Optional path for hash CSV file; hashfile.csv is used by default
Merges an entire folder of file type CSVs into a single one with an added column for the CSV file from which each row came.
- path - Folder full of file types
- outfile - Destination for merged CSV file
Merges two CSV files.
- file1 - The first CSV file
- file1name - Text to be added to each row of file 1, to tell them apart
- file2 - The second CSV file
- file2name - Text to be added to each row of file 2
- outfile - The output CSV file
Not a command line utility. Specifies the function for estimating time when doing a file scan.