Skip to content

user0706/Zero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About Zero

Zero is a simple tool for creating a dataset based on a known corpus and desired keywords.<br> To successfully create a dataset, it is necessary to define the corpus, output file, label and keyword/s.

- Input corpus

A directory containing one or more .txt documents needs to be selected. Preferably, the document is utf-8 encoded. Also, to avoid memory problems, it is recommended that the selected directory contains more smaller documents than one large one.

- Output file

The output file must be .CSV format utf-8 encoded, comma delimited.

- Label

By defining the label, the class is defined, ie. affiliation of sentences containing the desired keyword.

- Keyword/s

For a keyword, it is possible to enter one or more words. Each word must be separated by a punctuation mark (preferably a comma)

Screenshots

enter image description here

To-Do

  • Duplicate keyword detection
  • Code adaptation for big data