Building a Hebrew FrameNet Lexical Resource from Parallel Movie Subtitles

Summary

This work sets out to acquire Hebrew exemplar sentences with FrameNet annotations by projecting annotations from English. To this end, we use the OpenSubtitles 2016 dataset of aligned English-Hebrew subtitles of movies and television shows.

Thesis in PDF format
Source code for the SRL visualization tool
Jupyter notebooks for dataset construction, dataset statistics, and feature extraction
Two sqlite databases: manual_annotations.sqlite3 (original seed) and data.sqlite3 (result of classifier annotation on a subset of the dataset)
The code for the classifier, and a pickled trained classifier

Aligned SRL Visualization

In order to run the SRL visualization tool, you need:

Python 3.5 or later (Python 2 not supported)
conllu package to parse CoNLL-U format
scikit-learn and imbalanced-learn for the classifier
Download the dataset from here and place it in the static/dataset directory

The best way to get started is to create a new virtual environment (either through Conda or python3 -m venv <venv name>), and run pip install -r requirements.txt, followed by python3 viz.py.

Data

Statistics

# of subtitles	# of sentences	# of English tokens	# of Hebrew tokens (before segmentation)	# of Hebrew tokens (after segmentation)	English vocabulary size	Hebrew vocabulary size
30,789	23,062,193	194,217,249	118,236,346	188,375,525	1,540,672	894,759

Processing

The following diagram shows the data pipeline in our work:

Specifications

The computer used to process the data is an Intel Xeon E5645 @ 2.40 GHz with 24 cores and 128GB RAM.

Runtime

English POS-tagging and dependency parsing: less than one day
Hebrew segmentation, morphological analysis, morphological disambiguation, and dependency parsing using YAP: one month
Automatic English-Hebrew alignment: one day
English SRL: one week
Each of the other parts took a few minutes

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
static		static
templates		templates
.gitignore		.gitignore
Building_a_Hebrew_FrameNet_Lexical_Resource_from_Parallel_Movie_Subtitles.pdf		Building_a_Hebrew_FrameNet_Lexical_Resource_from_Parallel_Movie_Subtitles.pdf
DatasetStatistics.ipynb		DatasetStatistics.ipynb
OpenSubtitles.ipynb		OpenSubtitles.ipynb
README.md		README.md
SRLViz.ipynb		SRLViz.ipynb
classifier.joblib		classifier.joblib
classifier.py		classifier.py
data.sqlite3		data.sqlite3
manual_annotations.sqlite3		manual_annotations.sqlite3
msc_flow.png		msc_flow.png
presentation.pdf		presentation.pdf
requirements.txt		requirements.txt
tree2.png		tree2.png
viz.py		viz.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a Hebrew FrameNet Lexical Resource from Parallel Movie Subtitles

Summary

Contents

Aligned SRL Visualization

Data

Statistics

Processing

Specifications

Runtime

About

Releases

Packages

Languages

bgunlp/hebrew_srl

Folders and files

Latest commit

History

Repository files navigation

Building a Hebrew FrameNet Lexical Resource from Parallel Movie Subtitles

Summary

Contents

Aligned SRL Visualization

Data

Statistics

Processing

Specifications

Runtime

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages