Skip to content

Analysis

Aurelie Herbelot edited this page Jul 8, 2022 · 6 revisions

Once the Multilingual PeARS pipeline has run, it is possible to inspect its output with the script analyze.py. We show below various aspect of the analysis, together with example outputs from the system run on the Simple English Wikipedia.

Cluster analysis

The following command will get a list of the clusters retrieved by the Birch algorithm, applied on top of UMAP. Each cluster is associated with a list of keywords which characterize the encoded topic.

python3 analyze.py clusters --lang=simple

The above returns the following output:

## Showing cluster labels retrieved by Birch over hacked UMAP representations (training set) ##
300  clusters found:
166 ▁leap ▁calendar ▁friday ▁month ▁week ▁gregorian ▁day ▁tuesday ▁year ▁wednesday
187 ▁paintings ▁exhibition ▁art ▁museum ▁painting ▁gallery ▁portrait ▁painter ▁abstract ▁painted
87 eolithic ▁cave ▁stones ▁stonehenge ▁linear ▁rh ▁papyrus ▁paint ▁stone ▁archaeolog
28 ▁ozone ethyl ▁fiber ▁furnace ▁bio ▁fuel ▁poly ▁cellulose ▁carbon still
167 ▁leon ▁basque ▁catalan ese ▁juan ▁josé ▁madrid ▁barcelona ▁catalonia mont
98 ▁editor ▁news ▁newspaper ▁magazine ▁journal ▁gill ▁chronic ▁phil ▁journalist ▁newspapers
169 ▁chart billboard ▁debut ▁song ▁album ▁singles ▁released ▁gag ▁rapper ▁charts
75 ▁dam ▁indonesian ▁emirate ▁nile ▁aztec ▁indonesia ▁dhabi ▁nahuatl ▁java ▁tenochtitlan
3 ▁geometry euclidean ▁polygon ▁triangle angular ▁numeral ▁numbers ▁parallel ▁plane ▁binary
296 ▁theorem ▁formula ▁integer ▁function ▁numbers ▁derivative ▁algebra ▁variable ▁equation ▁linear
...

UMAP vector analysis

The following command provides two different outputs.

    python3 analyze.py umap --lang=en

The first output shows the distribution of documents across clusters in a single, randomly chosen, Wiki dump file. Here is a truncated example for the English Wikipedia:

## Showing distribution of articles across clusters (UMAP representations) ##

166 ['april', 'august', 'december', 'february', 'january', 'june', 'july', 'leap year', 'march', 'may', ...]
187 ['art', 'frida kahlo', 'vincent van gogh', 'jackson pollock', 'artist', 'joseph beuys', ...]
87 ['a', 'archaeology', 'geography', 'writing', 'stone age', 'stonehenge', 'iron age', 'prehistory', ...]
28 ['air', 'glass', 'oil', 'plastic', 'soap', 'water', 'argon', 'fuel', 'coal', 'petroleum', 'natural gas', ...]
167 ['autonomous communities of spain', 'madrid', 'o canada', 'basque country (greater region)', 'barcelona', ...]
98 ['alan turing', 'alexander graham bell', 'michael moore', 'the salvation army', 'spelling bee', 'dan brown', ...]
169 ['alanis morissette', 'bob marley', 'abba', 'michael jackson', 'christina aguilera', '50 cent', 'westlife', ...]
3 ['arithmetic', 'circle', 'cube', 'dimension', 'geometry', 'graph theory', 'mathematics', 'movement', 'number', ...]
 ... 

The second output shows the 10 nearest neighbours for a document query, taken out of a small subset of 10,000 Simple Wikipedia documents (in unsorted order):

## Showing articles similar to some query (UMAP representations) ##
1 august
['june', 'leap year', 'gregorian calendar', 'march', 'may', 'julian calendar', 'september', 'common year', 'january']

2 art
['hans sebald beham', 'surrealism', 'artist', 'poster', 'tempera', 'graffiti', 'johannes gutenberg', 'expressionism', 
'joseph beuys']

3 a
['writing', 'omega', 'nato phonetic alphabet', 'letter', 'alphabetical order', 'orthography', 'romanization', 'z', 'ß']

4 air
['biodiesel', 'water', 'water (molecule)', 'liquid', 'fuel cell', 'states of matter', 'heat conduction', 'greenhouse gas',
 'carbon dioxide']

5 autonomous communities of spain
['seville', 'lesotho', 'san marino', 'galician language', 'cotonou', 'galicia (spain)', 'aragonese language', 
'basque country (greater region)', 'basque language']

6 alan turing
['kaspar hauser', 'zimmermann telegram', 'remembrance day', 'knight bachelor', 'guild', 'charles dickens', 'alexander graham bell', 
'horatio nelson', 'samuel pepys']

7 alanis morissette
['nine inch nails', 'garth brooks', 'gwen stefani', 'whitney houston', 'n-dubz', 'christina aguilera', 'olivia newton-john', 
'justin timberlake', 'bob marley']

8 farming
['venus figurines', 'plantation', 'neolithic revolution', 'nile', 'nomadic people', 'steppe', 'crop', 'history of asia', 
'brick']

9 arithmetic
['matrix (mathematics)', 'divisor', 'exponential function', 'boolean algebra', 'remainder', 'factorization', 'division (mathematics)', 
'order of operations', 'mental calculation']

Fruit Fly analysis

Finally, it is possible to inspect nearest neighbours of documents using their binary hashes. In this case, similarity is computed using hamming distance, making the computation particularly efficient. Again, we show examples of neighbours computed over the Simple English Wikipedia, with hashes of 256 bits:

april
[ ('february', 0.984375), ('november', 0.984375), ('1700', 0.984375), ('june', 0.9765625),  ('september', 0.9765625), 
('december', 0.9765625), ('leap year', 0.9765625), ('easter', 0.9765625), ('christmas', 0.9765625)]

australia
[('ancient australia', 0.984375), ('tasmania', 0.9609375), ('torres strait islanders', 0.9609375), ('emu', 0.9609375),
('kangaroo', 0.96875), ('thylacine', 0.96875), ('band (anthropology)', 0.96875), ('history of australia', 0.9609375), 
('benalla, victoria', 0.953125)]

american english
[('african-american vernacular english', 0.984375), ('ido', 0.984375), ('grammar', 0.9765625), ('tense (grammar)', 0.9765625), 
('accent', 0.9765625), ('demonym', 0.9765625), ('interlingua', 0.9765625), ('vowel', 0.9765625), ('tatar language', 0.9765625) ]

abbreviation
[('morphology (linguistics)', 0.9921875), ('synonym', 0.9921875), ('radical (chinese character)', 0.9921875), , 
('braille', 0.9921875), ('interpreter', 0.9921875), ('sight-reading', 0.984375), ('categorical imperative', 0.984375), 
('readability', 0.984375),  ('audiolingual method', 0.984375)]

angel
[('paul the apostle', 0.9921875), ('epistle to the galatians', 0.9921875), ('book of judith', 0.9921875), ('prayer', 0.984375), 
('gospel of john', 0.984375), ('fall of man', 0.984375), ('book of genesis', 0.984375), ('bible', 0.984375), ('devil', 0.984375)]

ad hominem
[('truth', 0.9765625), ('socrates', 0.9765625), ('fallacy', 0.9765625), ('knowledge', 0.96875), ('the republic', 0.96875), 
('phrase', 0.96875), ('reductio ad absurdum', 0.96875), ('aesthetics', 0.96875), ('logic', 0.96875)]
Clone this wiki locally