Skip to content

Maha-J-Althobaiti/Ara_Eng_Parallel_Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Arabic-English Parallel Corpus

We used a comparable corpus to automatically construct a parallel corpus between Arabic and English.

We utilised Arabic and English Wikipedia, a free source of comparable corpus. The resulting Arabic-English corpus consists of 105,010 parallel sentences with a total number of 4.6M words.

The Arabic-English parallel corpus is avialable in plain text files (Moses) format.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

https://creativecommons.org/licenses/by-nc-nd/4.0/

You are free to:
	Share — copy and redistribute the material in any medium or format 

Under the following terms:
	Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
	NonCommercial — You may not use the material for commercial purposes.
	NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material. 

The corpus is available on request

We will be glad to share the corpus with you, please contact us by email: (m.j.althobaiti AT gmail dot com) OR (maha.j AT tu dot edu dot sa)

Below is a sample of Arabic sentences and their corresponding English sentences from our parallel corpus:

1- يتطلب أداء علم العلاج عن بعد أن يختار أخصائي علم الأمراض صور الفيديو للتحليل وتقديم التشخيصات .
2- في عام 1921 أصبح أصغر طبيب يرأس القسم الطبي في جامعة ألمانية .
3- ولقد كان السهل الساحلي في ذلك الزمان أضيق مما هو عليه اليوم وكان مغطى بنباتات السافانا .
4- يلعب الفريق مبارياته في ملعب مصطفى تشاكر في البليدة أو في ملعب 5 يوليو الذي يقع في الجزائر العاصمة .

1- Performance of telepathology requires that a pathologist selects the video images for analysis and the rendering of diagnoses .
2- In 1921 he became the youngest doctor to chair the medical department of a German university .
3- The Coastal Plain was then narrower than it is today and was covered with savannah vegetation .
4- The team plays their home matches at the Mustapha Tchaker Stadium in Blida and Stade du 5 Juillet in Algiers .

Published Work

More information about the way of collectign our parallel corpus and its statistics is available in our published paper entitled ``A Simple Yet Robust Algorithm for Automatic Extraction of Parallel Sentences: A Case Study on Arabic-English Wikipedia Articles''.

@article{althobaiti2021simple,
  title={A Simple Yet Robust Algorithm for Automatic Extraction of Parallel Sentences: A Case Study on Arabic-English Wikipedia Articles},
  author={Althobaiti, Maha Jarallah},
  journal={IEEE Access},
  volume={10},
  pages={401--420},
  year={2021},
  publisher={IEEE}
}

Althobaiti, Maha Jarallah. "A Simple Yet Robust Algorithm for Automatic Extraction of Parallel Sentences: A Case Study on Arabic-English Wikipedia Articles." IEEE Access 10 (2021): 401-420.

About

Arabic-English Parallel Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published