Deep Learning Based Cross Domain Sentiment Classification for Urdu Language

Sentiment analysis is a widely researched area due to its various applications in customer services, brand monitoring, and market research. Automatic sentiment classification is an important but challenging task. Contrary to the English language, sentiment analysis for low-resource languages like Urdu is an under-explored research area. Most of the work on sentiment analysis in the Urdu language is domain-dependent where models are mostly trained and tested on the same dataset on limited domains. However, sentiments in different domains are expressed differently, and manually annotating the datasets for all possible domains is unfeasible. Training a sentiment classifier using annotated data on one domain and testing it on another domain results in poor performance as the terms appearing in the source domain (training data) might not appear in the target (testing data) domain. In this paper, we present a baseline method for cross-domain sentiment analysis in the Urdu language using two different domains. Feature extraction is performed using n-grams and word embedding techniques. Sentiment classification is performed using machine learning and deep learning classifiers. The proposed method achieves an accuracy, precision, recall, and F1 scores of 0.77, 0.83, 0.68, and 0.75, respectively.


I. INTRODUCTION
In this technology-driven era, online social networks (OSNs) 17 such as Twitter and Facebook are actively involved in 18 enabling global connectivity. Users freely consume and gen-19 erate information that leads towards precedent amounts of 20 data [1]. Due to the explosion of this data, the internet 21 has become a huge dynamic repository of public views 22 on a large variety of topics or genres (movie reviews, 23 sports reviews, electronic reviews, etc.) [2]. Sentiment clas-24 sification has become a key enabler of opinion summa-25 rization and extraction that automatically categorizes the 26 sentiment in a piece of text on any topic or entity [3]. 27 Some examples of such content include merchandise buyers, 28 The associate editor coordinating the review of this manuscript and approving it for publication was Xinyu Du . product reviews, hotel customers, etc. The emotional ten-29 dency exhibited by categorizing sentiment polarity can 30 be a helpful indicator of consumer behavior and opin-31 ions leading to improved efficiency in information shar- 32 ing among users and improved business services and 33 solutions [4]. 34 Sentiment analysis (SA) is performed at different levels 35 i.e., document level, sentence level, and aspect level. In the 36 document level SA, the whole document is considered as a 37 basic unit, discussing a single topic. The whole document is 38 considered positive if there are more positive sentences than 39 negative sentences and vice versa. Whereas sentence level SA 40 categorizes the sentiment in each sentence as positive, nega-41 tive, or neutral. The aspect level SA is a more fine-grained 42 analysis that classifies the sentiments based on the aspects 43 that are already identified [5]. of native Urdu speakers [7]. Existing works on Urdu SA [18], impractical. To the best of our knowledge, no work has been 101 done on CDSA for the Urdu language, using Urdu script. 102 So, there is a need to develop such models that are flexible 103 enough to train on one domain and predict the sentiments in 104 another domain, minimizing the time and effort in manually 105 annotating the dataset. To overcome this problem, the main 106 aim of this study is to develop models that can adapt across 107 domains and minimize the manual labeling effort. This paper 108 presents a baseline method for CDSA in the Urdu language 109 and performs classification tasks using ML and DL methods. 110 The main contributions of this paper are summarized as 111 follows:

112
• Development of annotated datasets for two different but 113 related domains i.e., cricket and football. A total of 114 9221 tweets are collected from Twitter in this regard. 115 Out of these tweets, 6221 belong to the cricket domain 116 while the remaining 3000 tweets belong to the football 117 domain.

118
• Extraction of n-gram features at both the word level 119 and character level. Moreover, for DL methods, word 120 embedding is generated using the one-hot encoder.

129
• To the best of our knowledge, no study exists on CDSA 130 for the Urdu language. Therefore, this study presents a 131 baseline method for CDSA and evaluates the proposed 132 method using four evaluation measures i.e., accuracy, 133 precision, recall, and F1-score. A DL model is also 134 developed that is adaptable to handle other domains.

135
• Use of two existing, widely used, standard Urdu datasets 136 to validate the proposed predictive model.

137
The rest of the paper is organized as follows. Section II 138 presents the literature survey followed by Section III that 139 explains the research methodology of this study. Section IV 140 presents the experimental setup and explains the results and 141 discussion. Section V concludes the research.

143
Research in the field of SA has received increasing attraction 144 lately and many studies have been conducted in this field. The 145 following section discusses different studies on SA using the 146 lexicon-based approach, ML approach, and DL approach.

147
A. LEXICON-BASED APPROACH 148 Mukhtar et al. [22] perform SA in Urdu language using 149 a lexicon-based approach by enhancing an existing SA 150 lexicon and introducing context-dependent words in it. 151 These words are used with or without conjunctions. More-152 over, the authors develop rules to assign sentiments to 153 these context-dependent words and these rules are further 154 combined with a sentiment analyzer. The results show a 155 significant performance improvement of the Urdu sentiment 156 analyzer from 83% to 89%. Hossein et al. [23] introduce a 157 lexicon-based method for SA in the Persian language using 158 a dataset of mobile reviews. The authors extract the aspects 159 from the reviews using the combination of 'noun adjective' 160 pair or 'nouns adverbs adjective' pair using a lexicon. They 161 also consider the impact of intensifiers on the reviews and 162 present a visual summary of aspects in the reviews. The  respectively. Additionally, they also extract feature unions 220 based on best-performing features in the character-gram and 221 word grams. To improve the performance of the system, they 222 select the three best ML classifiers to form an ensemble 223 classifier that uses voting and weighted voting techniques. 224 Furthermore, they apply LSTM and CNN DL classifiers over 225 the entire dataset to further enhance the system performance. 226 T-tests are applied to show the statistical significance of the 227 proposed approach.

228
Noor et al. [21] perform SA in Roman Urdu for automo-229 bile reviews by extracting features from the data using the bag 230 of words model and then assigning Term frequency-inverse 231 document frequency (TF-IDF) weights to these features. 232 For experimentation purposes, SVM with linear kernel and 233 the cubic kernel is used. Moreover, one-vs-all and one-vs-234 one techniques are used to perform ternary classification. 235 SVM cubic kernel outperforms linear kernel in multi-class 236 classification. Bibi et al. [28] propose a technique to perform 237 SA in the Urdu language using tweets. Features are extracted 238 from the data using POS tags i.e., adjectives and count of posi-239 tive and negative words in a sentence. Moreover, the proposed 240 methodology is evaluated using 10-fold cross-validation. The 241 decision table is applied to extracted features achieving an 242 accuracy of 90%. Although the proposed method produces 243 good results, the size of the corpus is very small and some 244 other important POS tags like nouns, verbs, etc. are not 245 considered during the feature extraction process.  perform best for the SA task with an F1 score of 82%. 319 Dashtipour et al. [33] propose a SA framework in the Per-320 sian language for hotel reviews that detects the polarity of the 321 sentence using linguistic rules and DL models. Upon pattern 322 detection, this method allows polarity to flow from words 323 to concepts based on the symbolic dependency relations. 324 Furthermore, when no pattern is detected, this method uses 325 its sub-symbolic counterpart and uses DL for sentiment clas-326 sification. The proposed method achieves up to 15% higher 327 accuracy than the baseline methods. Li et al. [34] propose 328 a bidirectional LSTM with a self-attention mechanism and 329 multi-channel features (SAMF-BiLSTM). This approach is 330 comprised of two parts i.e., multi-channel features and a 331 self-attention mechanism. In the first phase, existing senti-332 ment resources and linguistic knowledge are modeled, and 333 various features are extracted as input to the model. Then, 334 BiLSTM is used to extract the information regarding senti-335 ments. Additionally, the BiLSTM-D model is also developed 336 for document-level SA. The proposed method performed 337 better than the baseline methods.

338
A comprehensive literature survey highlights that most of 339 the works on Urdu language SA are domain-dependent where 340 the annotators annotate the datasets for multiple domains. 341 However, annotating datasets for different domains is a 342 tedious as well as time-consuming process. This study applies 343 the DL model for CDSA that is flexible enough to adapt 344 to a new domain without annotation. To the best of our 345 knowledge, no work has been reported on the CDSA problem 346 for the Urdu language.

348
This section discusses the methodology adopted to solve the 349 problem of CDSA. The architecture of the proposed method-350 ology is illustrated in Figure 1. To have a cross-domain dataset for different but related 353 domains, we develop a dataset for two domains i.e., cricket 354 and football, using the Twitter intelligence tool (TWINT) 355 library. The dataset comprises a total of 9221 sentences 356 out of which 6221 belong to the cricket domain while the 357 remaining 3000 belong to the football domain. To evaluate 358 the performance of our approach, two datasets are used, i.e., 359 balanced, and unbalanced. The balanced dataset comprises 360 an equal number of positive, negative, and neutral sentences. 361 The statistics for both balanced and unbalanced datasets are 362 shown in Table 1. The keywords used to search cricket and 363 football tweets are illustrated in Table 2.

364
To preprocess the dataset for annotation, (i) special char-365 acters like @, #, $, !, (ii) hyperlinks, (iii) emoticons, 366 (iv) unwanted characters, (v) extra spaces, are removed. 367 Additional information like username, location, language, 368 sources, etc. that was collected at the time of data crawling 369 is also removed.   annotation guidelines are formulated with the mutual con-  The comprehensive guidelines are formulated for data anno-382 tation. Following guidelines are used to annotate the dataset. 383 Guidelines for positive polarity: A sentence is assigned 384 with positive polarity if it has a greater number of positive 385 sentiments as compared to negative sentiments; for instance: 386 387 Zafar Gohar is a specialist bowler and a good batsman.

388
Guidelines for negative polarity: A sentence is assigned 389 with negative polarity if it has a greater number of neg-390 ative sentiments as compared to positive sentiments; for 391 instance:  And who has been taken in the test team to do a fashion 415 show?

416
Sentences with conjunctions: Sentences with conjunctions 417 have two clauses i.e., subsequent clause (present before con-418 junction) and consequent clause (present after conjunction). 419 The overall sentiment of the sentence is assigned based 420 on the sentiment of the consequent clause. For example, 421 in the following example (while) is a conjunction, and the 422 VOLUME 10, 2022 sentiment this conjunction shows is positive so the overall 423 polarity of this sentence will be positive.  The agreement between the annotators is measured using 438 inter-rater reliability which is defined as the range up to which 439 the two annotators assign the same score to a variable [30] and 440 is computed using the following equations where P o is the relative observed agreement among annota- For DL models, numeric input is considered for the classi-468 fication tasks, so as a part of feature engineering, the tweets 469 are transformed into one-hot vectors. It is a 1 × N vector 470 consisting of 0's in all the cells of a vector except in a cell that 471 is used to uniquely identify a word in a document that has a 472 value of 1. For encoding, each token of the tweet is separately 473 encoded and then padded to make sure that all vectors are of 474 the same length [36]. 475 For the classification of sentiments, several classical ML 476 models i.e., MNB, BNB, LR, RF, and linear SVC and 477 RNN, LSTM, and GRU DL models are used. The details of 478 these classifiers and models are discussed in the following 479 subsections.
where P(t k |p) shows the conditional probability whether a 490 term t k appears in a new document whose polarity can be 491 determined as follows 492 P(t k |p) = count(t k |p) + 1 count(t p ) + |V | where P(t k |p) shows the conditional probability whether a 503 term t k appears in new document of polarity p and P(t k |p) 504 represents conditional probability of non-occurring term t k 505 in new document of polarity p and can be determined as 506 P(t k |p) = count(t k |p) + 1 count(N p ) + 2 (8) 507 where N p is the total number of documents having a 509 polarity p.
where P shows the probability of the occurrence of the 524 feature. X 1 , X 2 · · · X k shows the value of predictor and 525 β 1 , β 2 · · · β k shows the model's intercept [39].

535
RNN is a DL model and is mostly used for classification.

536
It assigns weights to the sequence of previous data points. 537 RNN performs better for semantic analysis of data as it con-538 siders the information of previous nodes. It usually contains 539 three layers i.e., the input layer, the hidden layer, and an 540 output layer which can be formulated as 542 where x t represents the state at time t, and u t states input 543 at step t. These weights can be used to form an equation 544 parameterized by  Forget gate f is used to show information that needs to be 560 removed from the cell C. Whereas input states exhibit the 561 new information that is to be added to the cell state C. The 562 output gate O determines the output based on the sigmoid 563 function [43].
It is a variant of LSTM that uses two gates and fewer parame-569 ters. The one gate is update gate u t which is the combination 570 of input and forget gate while the other gate is reset gate r t that 571 shows relevance with the previous cell state for calculating 572 the next candidate. The state of cell is equivalent to hidden 573 state i.e., tanh layer creates a new candidate vector C using r t . 574 The equations for GRU are formulated as [43] 575

580
This section contains the details of the experimental setup, 581 evaluation parameters, and the discussion of experimental 582 results.

584
In this study, the experiments on the dataset are carried out 585 using the Scikit-learn toolkit, and tweets are classified as 586 positive, negative, and neutral sentiments. For ML models, 587 features are extracted using a TF-IDF vectorizer, and later 588 five classifiers BNB, MNB, LR, RF, and linear SVC are 589 applied for the classification of tweets. For the DL methods, 590 features are extracted using the one-hot encoder, and the 591 classification of tweets is performed using RNN, LSTM, and 592 GRU. These classifiers are trained on cricket data and tested 593 on a football dataset. Moreover, the aforementioned models 594 are applied to both balanced and unbalanced datasets. The 595 ratio of the training and testing dataset is 80% and 20% 596 respectively. Default parameters are used for all experiments 597 on ML classifiers.    dimension is set to 300. Table 3 illustrates the parameter 612 tuning for DL models.  Results on the union of word grams and character grams are 642 shown in Table 5. The maximum length of character grams 643 is selected from 3 to 12. MNB performs best on the balanced 644 dataset in the case of combined unigrams and character grams 645 with accuracy, precision, recall, and F1 score of 0.61, 0.66, 646 0.52, and 0.58, respectively. For the unbalanced dataset, BNB 647 with combined unigrams and character grams performed best 648 with accuracy, precision, recall, and F1 score of 0.64, 0.55, 649 0.67, and 0.60, respectively.

650
The performance of MNB for combined trigrams and 651 character-grams is lowest with accuracy, precision, recall, and 652 F1 score of 0.35, 0.29, 0.84, and 0.44, respectively. A possible 653 reason that MNB and BNB show the best performance is that 654 our dataset is comprised of tweets that are small in length in 655     Similarly, by changing the number of hidden layers, batch 670 size, and the number of epochs, the difference in results of 671 the LSTM model can be seen in Table 7. When the number 672 of hidden layers is set to 2 with a batch size of 32 and an 673 epoch of 3, LSTM gives better results as compared to RNN. 674 The highest accuracy, precision, recall, and F1-score of 0.76, 675 0.79, 0.72, and 0.75 are achieved respectively.

676
By using different hidden layers, batch size, and the num-677 ber of epochs the results of the GRU model are shown in 678 Table 8. GRU performed best with one hidden layer, batch 679 VOLUME 10, 2022   In addition to this, deep learning models i.e., RNN, LSTM, 738 and GRU are also used for the classification task where 739 GRU performs best with accuracy, precision, recall, and 740 F1-score of 0.77, 0.83, 0.68, and 0.75, respectively. Over-741 all, the performance of deep learning models is higher than 742 the machine learning classifiers. It is also observed that 743 increasing the number of hidden layers has no significant 744 effect on accuracy. To validate our proposed model, our best 745 performing model is tested on two standards, widely used, 746 Urdu datasets. This study serves as the baseline for future 747 research in cross-domain sentiment analysis in low-resource 748 languages like Urdu. In the future, we intend to increase the 749 accuracy of cross-domain sentiment analysis for Urdu.