Skip to content

Zarharan/PersianStanceDetection

Repository files navigation

Persian Stance Classification

We released here a Persian dataset that can be used for a number of NLP tasks in the context of fact-checking. Although this dataset can be used for fact-checking and summarization, the focus of this work is on stance classification as a stepping stone for fake news detection in the Persian language.

In order to collect this dataset, after collecting articles, for each claim we allocate three labels; the first label is article (body text) stance according to the claim (article-claim stance), the second label is the article’s headline stance according to the claim (headline-claim stance) and the third one is article (body text) stance according to its headline (article-headline stance). We release here article-claim stance as ArticleToClaim.txt file and headline-claim stance as HeadlineToClaim.txt file. In addition, we have released FullDataset.txt, this dataset can be used in order to stance detection and fake or rumor detection in Persian.

Embedding

With respect to text embedding, we created matrix embeddings by using fastText and the create_embedding_matrix function in the LSTMPersianStance_HeadToClaim.ipynb file and saved this dictionary (matrix embedding) as w2v_persian.pkl . The matrix embedding is then loaded whenever it is needed.

Annotation Guideline

We prepared a guideline in both English and Persian language, which consists of notes, suggestions, and examples about stance labels. The file named GuidLine_FA.pdf contains a Persian guideline and the file named GuideLine_EN.pdf contains an English guideline.

The Dataset License

Our Persian stance classification dataset is being provided to you under license CC BY-NC. You can read more about this licence here.

The Related Paper

Our academic paper which describes the process of building our dataset in detail and provides full results can be found here: https://truthandtrustonline.files.wordpress.com/2019/10/paper_30.pdf .