Skip to content

Webscrape text for making a word list and a language model suitable for Kaldi ASR.

License

Notifications You must be signed in to change notification settings

uiuc-sst/crawl-wiki

Repository files navigation

crawl-wiki

Webscrape text for making a word list and a language model suitable for Kaldi ASR.

To collect about 60 files named wikipedia/*/yyyymmdd.txt, run crawl_wikipedia_all_lang.

Todo

Filter more, to be appropriate for pseudo-swahili ASR:

  • Replace numbers with newlines.
  • Cull lines with fewer than 3 words?

Scrape more than just the top page?

About

Webscrape text for making a word list and a language model suitable for Kaldi ASR.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published