Skip to content

wmotte/niqqud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Niqqud prediction

DOI

Background

For purposes of disambiguation, a system of diacritical signs (niqqud) is used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet. Natural language processing (NLP) methodology, including speech recognition and speech-to-text algorithms would benefit from a model that accurately predicts niqqud.

Dataset

This repository contains scripts to acquire a dataset of corresponding sentences with and without niqqud. The dataset is devided into a set of training (90%), development (5%) and testing (5%) sentences. This will help to train, optimize and compare NLP models. Sentences are extracted from The Sefaria Library, a free and growing library of Jewish texts.

  • Total sentences in training set: 211,638
  • Total sentences in test set: 11,758
  • Total sentences in dev set: 11,758