Skip to content

nicksmadscience/archival-processor

Repository files navigation

Dupefinder

Main Workflow

dupefinder-all-in-one.py

This script handles the entire process in one step.

Note: the all-in-one process is ideal where there's minimal chance the process will be interrupted, such as a local drive.

  1. Choose a folder you'd like to scan for duplicates
  2. Copy the path to that folder. This will be the first parameter (e.g. "/Folder")
  3. Choose a name for the output file containing the resulting .csv file of all suspected duplicate files. (e.g. "folder-2024-05-01-dupes.csv"). This is the second parameter
  4. Choose a name for this job (e.g. "folder-2024-05-01"). This is the third parameter
  5. Run it...
python3 dupefinder-all-in-one.py /Folder older-2024-05-01-dupes.csv folder-2024-05-01

Optional parameters

  • --csvfile - List of all filetypes the script encounters. Default: "filetypes.csv".
  • --maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file. Lower values will generate a hash from a smaller portion of each file, which can save time, but may also result in false positives. Default: entire file.
  • --skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered. Default: no skips.
  • --maxfiles - Only count this many files. Default: scans all files.

Alternate workflow

In the event that the process might be interrupted (such as with a network share), it may be desirable to run the steps one at a time.

  1. Run filetype_finder.py. Specify the folder you'd like to scan as the first parameter. You may also want to specify a nickname for this scan, e.g. "folder-2024-05-01".
    • If the file-finder operation is interrupted, you can restart the scan at the last successful file number using --skipuntil. Be sure to save it to a new .csv file and append the .csv files before running dupefinder.py.
  2. Run dupefinder.py. The first parameter is the nickname specified above with "-all.csv" appended to it, e.g. "folder-2024-05-01-all.csv". The second parameter is the name of all detected duplicates, e.g. "folder-2024-05-01-dupes.csv".
  3. Run dupe-dupe-checker.py. The dupefinder process sometimes finds the same duplicate twice, but in opposite directions; in other words, file X is the same as file Y and file Y is the same as X. Running this process detects and eliminate such entries. The only parameter is the output of the dupefinder step; the original file is overwritten.

Utilities

csvscramble.py - CSV Scramble

Inputs a CSV file and outputs it in random row order. Technically works with any CR-delimited file.

  • in_name - Input file
  • out_name - Scrambled output file

dupe_dupe_checker.py - Dupe Dupe Checker

Removes all dupes from a list that are simply a reverse of another dupe.

  • dupecsvfile - The CSV file containing the list of dupes. Required. Overwrites the existing file.

dupefinder-all-in-one.py

  • filepath - The path to scan
  • dupefile - Output CSV containing all duplicates
  • filetypefolder - Folder where lists of files of each type will be written. Do not include trailing slash. Default: "filetypes"
  • --csvfile - List of all filetypes the script encounters. Default: "filetypes.csv".
  • --maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file. Lower values will generate a hash from a smaller portion of each file, which can save time, but may also result in false positives. Default: entire file.
  • --skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered. Default: no skips.
  • --maxfiles - Only count this many files. Default: scans all files.

dupefinder.py - Dupefinder

Given a CSV of file hashes generated by Hashmaker, will find all duplicates. Will probably need a pass through dupe-dupe-checker afterwards.

  • hashcsvfile - Input CSV file containing hashes
  • dupefilepath - Output CSV containing all duplicates

file-types-compare.py - File Types Compare

Runs the whole sequence at once: makes a hashfile, checks for dupes, and removes the duplicate dupes.

  • filepath - Path to search for duplicates
  • --hashcsvfile - Optional path for hash CSV file; hashfile.csv is used by default. Default: hashfile.csv

filetype-finder.py

Makes a CSV list of every file type in a path. Additionally, creates a separate list of every file of each file type in the specified folder (defaults to filetypes/)

  • filepath - The path to scan
  • --csvfile - Output filetype list. Default: "filetypes.csv"
  • --filetypefolder - Folder where lists of files of each type will be written. Do not include trailing slash. Default: "filetypes"
  • --maxhashreps - How many 16kb chunks to generate hash for each file. Default: the entire file
  • --skipuntil - For continuing incomplete previous scans. Will skip until this file number is encountered
  • --maxfiles - Only count this many files

folder-compare.py

parser = argparse.ArgumentParser(prog="Folder Compare",
							  description="Given a CSV dupe file, will generate a summary of the number of matching files in every permutation of folders.  Useful for determining whether entire *folders* are duplicates, or close to duplicates, of one another")
parser.add_argument('dupefile', help="The CSV dupe file to use")
parser.add_argument('outfile', help="File name of the report to generate")

hashmaker.py - Hashmaker

Generates a CSV list of hashes, but filtered by file type. Command line function is no longer part of the main workflow, but included because it may still be useful.

  • filetypes - Comma-separated list of all file types you'd like to check, e.g. '.jpg,.gif,.bmp"
  • path - The path for which to make hashes
  • --hashcsvfile - Optional path for hash CSV file; hashfile.csv is used by default

mergecsvfiles.py - Merge All CSV Files

Merges an entire folder of file type CSVs into a single one with an added column for the CSV file from which each row came.

  • path - Folder full of file types
  • outfile - Destination for merged CSV file

mergecsv.py - CSV Merge

Merges two CSV files.

  • file1 - The first CSV file
  • file1name - Text to be added to each row of file 1, to tell them apart
  • file2 - The second CSV file
  • file2name - Text to be added to each row of file 2
  • outfile - The output CSV file

time_estimate.py

Not a command line utility. Specifies the function for estimating time when doing a file scan.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages