Sentiment Analysis of Reviews in Natural Language: Roman Urdu as a Case Study

Opinion Mining from user reviews is an emerging field. Sentiment Analysis of Natural Language text helps us in finding the opinion of the customers. These reviews can be in any language e.g. English, Chinese, Arabic, Japanese, Urdu, and Hindi. This research presents a model to classify the polarity of the review(s) in Roman Urdu text (reviews). For the purpose, raw data was scraped from the reviews of 20 songs from Indo-Pak Music Industry. In this research a new dataset of 24000 reviews of Roman Urdu text is created. Nine Machine Learning algorithms—Naïve Bayes, Support Vector Machine, Logistic Regression, K-Nearest Neighbors, Artificial Neural Networks, Convolutional Neural Network, Recurrent Neural Networks, ID3 and Gradient Boost Tree, are attempted. Logistic Regression outperformed the rest, based on testing and cross validation accuracies that are 92.25% and 91.47% respectively.


I. INTRODUCTION
Though there exist some companies that collect reviews about any entity through market surveys, it's an outdated way to acquire feedbacks [1]. The boom of technology has squeezed the distances and digitized the world. Products are being shelved online. In order to know the response (on different aspects) of the product, the feedback mechanism is provided in the form of comments or reviews [2]. This feedback, i.e. Reviews, is valuable for both-user and the displayer [3]. Such reviews had replaced the old style of surveys that were critical to revamp quality standards. The drastic increase in the number of e-users has caused the exponential growth in these reviews [4].
Recently, the showbiz of South Asia has evolved enormously [5] in which, internet technology has played a vital role [6]. Using this paradigm people have easy access to songs, movies, plays and many more. Interestingly, the artists The associate editor coordinating the review of this manuscript and approving it for publication was Hiu Yung Wong. have direct connections with their fans through social media. Viewers not only watch the material but also give their feedbacks on the same page or channel [7]. Different websites manage such material, YouTube is one of them [8].
YouTube is one of the most popular platforms [9] for hosting videos of different kinds like entertainment, education, sports, etc. People are using YouTube, professionally, to earn money by creating their channels and posting their videos in them. YouTuber [10] is a well-known and successful profession nowadays. YouTubers make their videos and post on their pages/channels to engage the audience. These professionals, judge the popularity of the content [6] by the number of likes, dislikes and comments on the video [9]. A simple formula to check the content quality is: CQ = (TL − TD > 0? ''good'' : Bad'') (1) where TL is Total likes and TD is Total Dislikes. This formula provides very limited insight into the content. Another way to check the credibility of the content of a video is by analyzing the comments on that video [11]. Like many other websites, YouTube also facilitates writing reviews in the comments section [12]. Previously, it was easy to judge the contents by manually reading these reviews as reviews were limited in quantity. But nowadays as the count of these reviews has increased tremendously it is humanly not possible to read and analyze all reviews. It raises the need for automated processes/tools for the purpose.
Sentiment Analysis is one such tool. People posts their opinion about the item-good, bad or neutral, in the given comments section [13]. It categorizes the polarity to the review [14] as positive, negative, or neutral [15].
Though educated people review the item in English, but in general, people prefer to review the item in a language they feel comfortable. Originating from English speaking countries, most of the websites support only roman script [16], for reviews [17]. That is why most of the Sentiment Analysis research work is done on the English language. People from non-English speaking countries, when they visit these websites, write comments in their native languages using Roman script like Chinese, Arabic, Japanese, Urdu, Hindi, etc. Most of the antecedent work is done in the domain of Sentiment Analysis on English and Chinese languages [18]- [20]. Urdu is one of the popular and widely used languages of the Sub-continent. There exist no scripting standards for Roman Urdu [21] to spell Urdu words. People not profound in the English language use Roman script to write their reviews.
Limited research is witnessed on Sentiment Analysis of Roman Urdu text, so there is a need to extend the research in this direction. It can help people to improve their business strategies and to tackle the consumer's needs. For this research, data of Roman Urdu text was required. Being followed by the folks, the music content of South-Asia was opted to scrap the reviews from YouTube. Reviews against the song help to understand different quality aspects and sentiments of people for the content [22].
To perform Roman Urdu Sentiment Analysis, a good dataset was required [23]. In existing research works, the datasets were not big enough. This paper contributes a benchmark dataset of Roman Urdu text extracted from songs reviews (Let's call it DRU). The entire corpus was manually annotated to perform Sentiment Analysis. To build DRU, reviews were scraped from YouTube from selected twenty songs mentioned in table 1. Total 321,504 reviews were scrapped. Urdu Reviews were extracted by applying different filters. DRU contains 24,000 manually annotated reviews.
The rest of the paper is organized into five sections. Section-II presents existing works. Section-III discusses the methodology which is adapted to meet the objectives of this research. Section-IV describes the entire corpus generation process of DRU. Experimental results are presented and discussed in section V. Section-VI concludes the research and presents possible future enhancements.

II. LITERATURE REVIEW
In [17], researchers performed Roman Urdu Sentiment Analysis on a dataset of just 1,600 hotel reviews. The maximum accuracy that they achieved was 91% using the Support Vector Machine. Reference [24] Presented Sentiment Analysis of Roman Urdu reviews on mobiles. The researchers tried Decision Tree, Naïve Bayes and KNN algorithms for this purpose. They achieved 97.50% using Naive Bayes. The researchers of [25] used the N-gram model. Maximum accuracy of 72.37% was achieved through Naïve Bayes using uni-Bigram. Authors of [26], performed Aspect Level Sentiment Analysis using different Machine Learning techniques to classify the products in three different categories (Low, Medium and High).
In [27], efforts were made to perform Sentiment Analysis using discoursed based Sentiment Analysis. Reviews were scraped from different websites providing social services. Analysis was performed using discoursed based Part of speech tagging. In [28], the ''Bag of Words Meets Bag of Popcorn'' dataset was used to perform Sentiment Analysis and maximum accuracy was achieved 90.90% by using word vector weighted averaging + Random Forest. In [29], the reviews were collected, on mobile, from amazon and aspect level Sentiment Analysis was performed on the dataset. Association rule mining was used for the segmentation of sentences. After the opinion orientation of the words, they simply counted the total numbers of positive and negative comments for each aspect and finally ranked the aspects based upon the numbers of positive reviews. In [30], to perform the aspect level Sentiment Analysis on movie reviews, the 5-Gram technique and the feature-based heuristic scheme is used to perform aspect level Sentiment Analysis.
In [31] Sentiment Analysis was performed using dimensionality reduction after applying different preprocessing techniques like slangs handling, stop-words, stemming, and lemmatization. Dataset-Bag of Words Bag of Popcorns, was taken from Kaggle. Total 256 experiments were conducted. The accuracies of experiments before and after preprocessing techniques were 83.417% and 84.90% respectively. In [32], researchers worked upon finding the hidden pattern from raw text data using preprocessing techniques. To avoid the outliers in data the different preprocessing techniques were implemented i.e., HTML tags, Stop words removal. Standardization of the dataset was done by implemented stemming and lemmatization. The features were extracted by applying N-gram and TF/IDF. Development of dataset had been attempted, previously, but most of them are of small sizes. As [24], [25] and [33] used dataset of 300, 600 and 1600 Roman Urdu reviews. In English, Chinese, and affluent languages benchmark datasets are available. In [34] dataset of English reviews on Salsa Music is presented. In [35] a dataset of 1000 English reviews is presented and emotional analysis is performed. This creates a need to develop a benchmark dataset on Roman Urdu Reviews to perform Sentiment Analysis on it.

III. METHODOLOGY
To generate the corpus i.e. DRU and to perform the Sentiment Analysis on it, following methodology was adopted. The methodology of the paper is shown in figure 1.

A. DATASET COLLECTION
For the dataset, reviews were collected from YouTube. Different songs of Indian and Pakistani singers were selected to collect Roman Urdu reviews. Table 1 enlists the name of the songs and the number of scraped reviews.
After scraping, these reviews were saved in CSV file format. This raw dataset contained reviews in Roman Urdu as well as in other languages'. It also contained a lot of noisy data (like special characters, numbers, slang, emoji's). To clean the data and extract Roman Urdu reviews and make it ready for the analysis different pre-processing techniques were applied.

B. PRE-PROCESSING
To get good analytical results using Machine Learning techniques, data is supposed to be very refined and of high quality [38]. For the purpose different pre-processing techniques [32], [39] were applied to raw dataset to make it ready for a high standard analysis. Raw dataset had different types of issues like noise, special characters, emojis, text in different languages, varying lengths of reviews, etc. The properties of raw data were shown in table 2. To get the required quality of data, different pre-processing techniques were applied like data filtration, data integration, lowercasing, remove emoji's, and length standardization. Details can be seen in subsections.

1) DATA INTEGRATION
The raw data was in twenty different files, one for each song. To perform data analysis altogether, all the data in these files was integrated into one. The combined number of records (i.e. reviews) reached 321,504 and the file size of 107KBs.

2) NOISE REMOVAL
Noise is a factor that affects the analysis badly. It was observed that collected data had a lot of noise i.e. special characters, punctuations, numbers and emojis. The presence of noise badly impact the quality of classification results [40]. The focus of this research was only on text analyses, so all noise from data was removed. The changes in dataset-DRU, before and after noise removal, can be seen in table 3.

3) LOWERCASING
Raw dataset contained both types of text-uppercase and lowercase. When this type of data is used for classification, classifiers find different variations [41] of the same input class [42]. Being case sensitivity of classifiers, will recognize ''mast'' and ''MAST'' as two different inputs. To overcome this problem, the complete dataset is converted into lowercase.

4) ROMAN URDU TEXT FILTRATION
The raw dataset was having reviews in multiple languages. The focus of this research was only on Roman Urdu, so the Urdu reviews in the Roman script were extracted using data filtration technique. Data Filtration was performed in Microsoft Excel by applying different filters. Filtered data was further filtered during the data annotation where annotators were asked to remove any record if it is carrying non-Urdu text. Table 4 illustrated the sample filters which were used to extract the Urdu Reviews.

5) LIMITING TEXT STRING SIZE
In the dataset some of the reviews were found too large for example there was a review of 5,578 characters. Long reviews cause issues that reduce the performance of the classifiers [43]. To avoid performance issues, and keeping mind the minimum data loss, the maximum review length was set to 150 characters. This check had an impact but was not huge. Only 6.75% of data got cropped. Before applying this check, the dataset had 1,240,168 characters and after the application of this cut, the number of characters was reduced to 1,156,473. The data traits of ''DRU'', at this moment, are shown in table 5, before and after applying this preprocessing function.

IV. CORPUS GENERATION
This section illustrates the measures that were adopted to build a labelled dataset of Roman Urdu Reviews to perform Sentiment Analysis. Steps that were taken to construct the DRU were as follows: Step 1: Collection of data as discussed in section III.A Step 2: Preprocessing as discussed in section III.B Step 3: Data Extraction as discussed in section III.B.5.
Step 4: Data Annotation as discussed below:

A. ANNOTATION GUIDELINES
Data annotation is a static component, in which a label (i.e. class) is assigned to each text according to its subjectivity expressed in the text. Annotation can be performed using three schemes i.e. Manual Annotation, Auto-annotation and Semi Auto-annotation. In this study, manual annotation was performed to label the reviews into their targeted class according to the guidelines presented in [44] and [45]. Each review was labelled with one of the two classes-Positive or negative.
A review was marked as positive if the sentiment was positive by the expression [46], [47]. In the case of a review that showed both positive and neutral expression, it was classified as positive [48], [49]. The presence of any agreement of approval made the review positive. Like a review ''kamal ka gana h'' (the song is awesome) in which the word ''kamal'' was the shining word that defined the polarity of the whole sentiment. Another review ''wow bhut acha sound voice hai'' (wow, the voice is too good) in which ''wow'' and ''bhut acha'' were illocutionary words that classified this comment as positive.
A review that was negative in terms of sentiment or expression was classified as negative [50], [51]. A review that contained a negative word, classified as a negative review. Like a review ''na hero acha na herion'' (neither hero nor heroine is decent), was classified as negative. Another review ''faltu gana mud khrab kr diya mera'' (useless song, spoiled the mood) in which the word ''faltu'' 'rubbish' made this review a negative. Some sample annotated reviews are illustrated in table 6.

V. MODELING AND EXPERIMENTATION
For the best classification results, experimentation was performed using different classification algorithms. The details of the experimentation setup are as follows:

A. EXPERIMENTAL SETUP
As discussed above different classification algorithms and techniques were used to design models. The ML algorithms were applied in Python and results were generated by applying TF/IDF. For the cross-validation of the models was done using K-Fold procedure with k = 10. The experimental design is shown in figure 2.

B. DESIGN MODEL
This research targets the binary type of classification [52]. The Machine Learning algorithms used to design the model were Naïve Bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), Artificial Neural Network (ANN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), ID3 and Gradient Boost Tree (GB). To classify the reviews and to check the performance of the model, data was split into the ratio of 9:1 i.e. 90% data was used for the training of the model and 10% data was used to test the model.

1) Naïve BAYES
Naïve Bayes is a simple statistical-based probabilistic classifier that is based on the ''Bayes'' theorem. It is a fast and efficient classification technique and is famous for text data classification. It, also, can handle both continuous and discrete types of data [53]- [56]. Details of test classification results can be seen in confusion matrix shown in figure 2(a). It achieved test accuracy of 89.67% while cross validation accuracy was 85.37%.

2) SUPPORT VECTOR MACHINE
Support Vector Machine is a linear model for classification. It generalizes between two different classes if labelled dataset is provided for training to the algorithm. SVM is the representation of data as points in space. It can map the data and separate the categories divided by a clear gap that can be as far/wide as possible [57]. Details of test classification results can be seen in confusion matrix shown in figure 2(b). It achieved test accuracy of 88.46% while cross validation accuracy was 86.96%.

3) LOGISTIC REGRESSION
Logistic Regression is the most popular Machine Learning algorithm of machine to perform regression tasks. It is mostly used for forecasting and relationship between variables [58]. Details of test classification results can be seen in confusion matrix shown in figure 2(c). It achieved test accuracy of 92.25% while cross validation accuracy was 91.47%. VOLUME 10, 2022

4) DECISION TREE
A Decision Tree is a hierarchy-based classification model that uses the divide and conquers approach. If data is discontinuous, it performs well. From the tree family, ID3 and Gradient Boost were opted in this study because these algorithms perform better on text data [59]. It captured overfitting when the data was noisy and if a small variation in data occurs prediction of the model gets unstable [60], [61]. ID3 and Gradient Boost Tree achieved 87.92% and 85.79% accuracies respectively on testing data. Details of test classification results of ID3 and Gradient Boost Tree can be seen in confusion matrix shown in figure 2(d) and 2(e) respectively. The accuracies on 10-Fold cross-validations were 86.31% and 85.52% which showed that models were good-fitted.

5) K-NEAREST NEIGHBOR
The K-nearest-neighbor algorithm is based on instance-based learning. It can simply compare the given test instances with the training set using Manhattan distance, Hamming distance, Minkowski distance and Euclidean distance. It is simple to understand and easy to implement. It is a lazy learner because it memorizes the training data and it does not perform well with missing values [62], [63]. Details of test classification results can be seen in confusion matrix shown in figure 2(f). It achieved test accuracy of 86.13% while cross validation accuracy appeared to be 86.64%. The model is good fitted.

6) ARTIFICIAL NEURAL NETWORK
Artificial neural networks (ANNs), generally called neural networks (NNs), are computing systems inspired by the biological neural networks that constitute animal brains. Neural networks learn (or get trained) by processing examples, with well-defined input(s) and result(s). It forms probabilistic weighted associations between the two. These weights are stored within the net itself. Details of test classification results can be seen in confusion matrix shown in figure 2(g). It achieved test accuracy of 90.38% while cross validation accuracy was 88.00%.

7) CONVOLUTIONAL NEURAL NETWORK
Convolutional Neural Network (CNN) belongs to the family of neural networks. It can simply take the input, assign the learnable weights to the objects, and classify them [64]. CNN performs well on imagery data [65], CNN did not perform well on DRU. Details of test classification results can be seen in confusion matrix shown in figure 2(h). It achieved test accuracy of 66.54% while cross validation accuracy was 67.19%.

8) RECURRENT NEURAL NETWORK
Recurrent Neural Network, known as Recursive Neural Network, also belongs to the family of Neural Networks, in which the association between nodes form a graph that is connected along a temporal sequence. Recurrent means the current output becomes the input at the next step, unlike Feed Forward Neural Network. RNN helps to make a good prediction in Sentiment Classification [66]. Details of test classification results can be seen in confusion matrix shown in figure 2(i). It achieved test accuracy of 91.71% while cross validation accuracy was 90.88%.

C. MODEL VALIDATION
To check the performance of the model, reviews are classified by using the 9:1 ratio. After getting the results of classification, the model is cross-validated using the K-Fold crossvalidation technique. The value of K was defined as 10.

VI. DISCUSSION OF RESULTS
This research had two goals-Dataset and finding a better model for Roman Urdu text Sentiment Analysis. Detailed discussion is as below

A. DATASET OF ROMAN URDU (DRU)
This dataset contains a collection of about 24,000 song reviews from YouTube. In this dataset, only polarized Urdu reviews written in roman script were being considered. The positive and negative reviews are equal in number. The entire corpus contains 11,17,303 characters. After the annotation phase, the inter-annotator agreement value was calculated with the help of the ''Kappa coefficient''. The value 0.87% showed that annotation quality is excellent. Statistics of the corpus DRU are shown in table 7.

B. FINDING A BETTER MODEL FOR ROMAN URDU TEXT SENTIMENT ANALYSIS
DRU was used for the binary classification. Four evaluation measures were calculated and considered for the purpose but keeping in view the importance and trend in literature review, highest importance was given to accuracy. Results were generated using TF/IDF method in Python. Table 8 shows comparative analysis of all nine models. Table 8 illustrates test results of all models.
As shown in Table 8, ANN and RNN showed better performance, but Logistic Regression outperformed all with 92.25% accuracy and 92.39% F-score on testing data.  On the validation of the model, LR outperformed and attained 91.47% accuracy. Whereas the ANN and RNN also performed well and attained 90.38% and 91.71% accuracy on testing data, the values of F-score were 90.22% and 91.80% respectively. On Recall the SVM outperform the other classifiers with the value of 93.68%. On precision KNN outperformed all with the value of 93.79%, NB also performed very good and was close to KNN, and its value of precision is 93.62%. As stated above, the models were validated using the 10-Fold cross-validation technique to validate the results (to see if the model is over-fitted or under-fitted). It was witnessed that there was no big difference between testing accuracy and validation accuracy which confirms that model is good fitted (see Figure 4). Looking at the literature review it can be seen tha best accuracy of closely related work, presented in [17], on Roman Urdu Sentiment Analysis achieved 91% accuracy by using Support Vector Machine, while this research is showing 91.47 validation accuracy with 92.25% test accuracy on a a dataset of 24000 reviews (which was bigger than the dataset of [17] i.e. 1600 reviews).

VII. CONCLUSION
This research was carried out to define a mechanism to see people's sentiments about some entity through their reviews VOLUME 10, 2022 in Roman Urdu. The targeted audience of this research was the people of the sub-continent. The targeted language was Urdu written in roman script. To generate dataset of Roman Urdu text, reviews of 20 video songs were collected from YouTube. After performing preprocessing and data labelling, different Machine Learning algorithms were applied. For the results generation, implementation was performed in Python and results were generated by applying TF/IDF. The nine different classifiers NB, ID3, GB, SVM, LR, KNN, ANN, CNN and RNN were applied on labelled data. Experiments were conducted on Binomial Dataset, where Logistic Regression outperforms the other classifiers with 92.25% accuracy. For cross-validation, K-Fold was applied with k = 10. In crossvalidation, LR beat all other algorithms with an accuracy of 91.47%. Based on the experimental results shown above, this study recommends LR and RNN for the classification of the Roman Urdu datasets.
This study opens a window for further research to improve classification results. This research can also be enhanced further with the standardization of Roman Urdu corpus. This study highlights the importance of Part of Speech tagging in Roman Urdu. The Dataset-DRU used for this research can be accessed by following: https://drive.google.com/file/ d/1bml7fMTjJ1ZBxaDgx-AWpJjXLOK1vGsN/view?usp= sharing MUHAMMAD AASIM QURESHI is a seasoned academician and researcher with 20 years of professional experience. More than 30 publications are on his credit including his Ph.D. degree in algorithms. His current areas of interest include artificial intelligence, algorithms, machine learning, and fuzzy logic. He has supervised numerous projects and theses related to virtual/augmented reality, recommender systems, sentiment analysis, robot navigation, etc. He is the session chair of several national and international conferences. He has also been an invited speaker at various research events. He is also a reviewer of various international journals and conferences.