Abstract:
Social media platforms are heavily used by people to express their views in their native languages. Besides positive views, people often use abusive or offensive language...Show MoreMetadata
Abstract:
Social media platforms are heavily used by people to express their views in their native languages. Besides positive views, people often use abusive or offensive language to express their anger or frustration. Resource-rich languages have offensive language detection systems which automatically monitor and block offensive content, however, they are very rare for low-resourced languages. This is because of the nonavailability of datasets for local languages. This article proposes a model which automatically detects offensive language for a very low-resource language, i.e., Pashto. The Roman Pashto dataset is created by picking 60 thousand comments from different social media and labeling them manually. The proposed model is trained and tested using three different feature extraction approaches, i.e., bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF), and sequence integer encoding. Four traditional classifiers and a deep sequence model are used to train on this task. Experimental result shows that the random forest classifier works best and give 94.07 % testing accuracy on a combination of unigrams, bigrams, and trigrams. The same classifier gives maximum accuracy of 93.90 % with TF-IDF. However, the overall highest testing accuracy of 97.21% is achieved by using bidirectional long short-term memory (BLSTM). The corpus created in this work is made available for the researcher working in this domain.
Published in: IEEE Transactions on Computational Social Systems ( Volume: 11, Issue: 4, August 2024)