Abstract:
With the increasing use of social media with its ability for users to share comments immediately, the extent of a system to identify offensive content has become a necess...Show MoreMetadata
Abstract:
With the increasing use of social media with its ability for users to share comments immediately, the extent of a system to identify offensive content has become a necessity in all languages. Due to the lack of publicly available resources on offensive language identification for Farsi, which has more than 110 million speakers, we present Pars-OFF, a three-layered annotated corpus for offensive language detection in Farsi to fill the existing gap. The introduced corpus contains 10,563 data samples. The tweets have been collected with a combination of similarity-based and keyword-based data selection techniques to avoid severe unbalancedness. Additionally, as a baseline, this article reports the performance of the traditional machine learning approaches and Transformer based models over the Pars-OFF dataset. The best performance was obtained by the BERT+fastText model, yielding the F1-Macro score of 89.57.
Published in: IEEE Transactions on Affective Computing ( Volume: 14, Issue: 4, 01 Oct.-Dec. 2023)