GitHub - pealco/neighborhooding-coarsecoding: Some scripts for finding the phonological neighborhood and coarsecode cache sizes for words in a corpus.

pealco / neighborhooding-coarsecoding Public

Notifications You must be signed in to change notification settings
Fork 0
Star 5

Some scripts for finding the phonological neighborhood and coarsecode cache sizes for words in a corpus.

pealco.net/code

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README		README
corpus.txt		corpus.txt
mapper.py		mapper.py
output.txt		output.txt
reducer.py		reducer.py
test.txt		test.txt

Repository files navigation

These are some scripts for finding phonological neighborhood and PFNA hash cache sizes.

It expects a corpus of words in the format below. That is, a word's spelling followed by its phonetic transcription in ASCII IPA. In this example, the phones in the transcription have a space between them, but this is not strictly necessary.

REALLOWANCE r ij @ l aU @ n s 
REALLY r ij l ij 
REALM r E l m 
REALPOLITIK r ij l p O l @ t I k 
REALTOR r ij l t R 

The scripts are named mapper.py and reducer.py because they were originally written to be used in a MapReduce context. They work just fine (and quickly) without MapReduce, however.

mapper.py
This script does the neighborhooding. Run alone, it will output the size of a word's lexical neighborhood. This is defined as all words in the corpus that are within edit distance (Levenshtein distance) 1.

reducer.py
This script translates ASCII IPA into PFNA coarsecode and calculates the size of a code's hash cache. This script requires input from mapper.py. It cannot be run alone. 

Usage:

cat corpus.txt | mapper.py > output.txt
cat corpus.txt | mapper.py | reducer.py > output.txt