Colloquial Arabic Tweets: Collection, Automatic Annotation, and Classification | IEEE Conference Publication | IEEE Xplore

Colloquial Arabic Tweets: Collection, Automatic Annotation, and Classification


Abstract:

In this paper, our goal is to explore the performance of our curated datasets of Arabic dialect tweets (dialects from 4 main regions: Gulf, Levant, North Africa, Egypt). ...Show More

Abstract:

In this paper, our goal is to explore the performance of our curated datasets of Arabic dialect tweets (dialects from 4 main regions: Gulf, Levant, North Africa, Egypt). There are two main datasets: The Twitter Arabic Dialect dataset and the Twitter Arabic Dialect Emoji (TADE) dataset. The automatic annotation of the tweets into the 4 selected dialects is achieved by using a manually prepared lexicon for each dialect. To validate the resulting corpus, we use traditional (shallow) and deep learning classifiers for the purpose of dialect classification using a modified version of the TADE dataset. We experiment with many sound shallow classifiers including Gradient Boosting, Logistic Regression, Nearest Centroid, Decision Tree, MultinomialNB, SVM, XGB, Random Forest, and AdaBoost. For the deep learning classifiers, we use MLP and CNN. We experiment with TFIDF and word embeddings for feature selections. We validate the usefulness of our dataset via utilization in the experiments. It will be made available to the research community after publication.
Date of Conference: 04-06 December 2020
Date Added to IEEE Xplore: 07 January 2021
ISBN Information:
Conference Location: Kuala Lumpur, Malaysia

I. Introduction

The Arabic language presents many challenges to the NLP field in general. As literature is mostly focused on the English language, many researchers working on Arabic NLP seek to adapt the methodologies to better fit the needs of the Arabic language

Contact IEEE to Subscribe

References

References is not available for this document.