Skip to content

Latest commit

 

History

History
42 lines (28 loc) · 3.02 KB

README.md

File metadata and controls

42 lines (28 loc) · 3.02 KB

gutengrep

Build Status

Find whole sentences matching a regex in Project Gutenberg plain text files.

Example commands

gutengrep.py "^[^\w]*And then" "*.txt" --cache --sort --correct -o output/and-then.txt

gutengrep.py "^[^\w]*But why" "*.txt" --cache --sort --correct -o output/but-why.txt

gutengrep.py -i "whale" moby11.txt --sort --correct -o out\mobydick-whale.txt

Example output

Name Sorted Regex Input Word count
But why? But why? ^[^\w]*But why *.txt 7,572
And then! And then! [^\w]*And then *.txt 85,014
The whale The whale whale moby11.txt 50,913
Why Why [^\w]*Why *.txt 184,832
Once upon a time Once upon a time -i once upon a time *.txt 6,195
The End The End -i the end\. *.txt 142,94
Happily ever after Happily ever after -i happily ever after *.txt 271
Moonlit Moonlit -i moonlit *.txt 52,345
Moonlight Moonlight -i moonlight *.txt 3,186

See also nanogenmo.md.

Tips

Download the Project Gutenberg August 2003 CD (download and mount the ISO file) and copy all the text files from the 'etext' directories to your hard drive, and put all of the text files in the same directory.

When working on the whole corpus, use --cache to cut down on file operations. The first time it will build a cache file of all tokenised sentences. This first pass takes about 5 minutes on my MBP to go through the 597 books of the Project Gutenberg CD and extract its 3,583,390 sentences. Subsequent runs using the cache take about 40 seconds.

If searching just a single file, or a subset of files, make sure not to use --cache because it will use the cache file generated on the initial file spec.