Convolutional Neural Network Based Classification of App Reviews

An app store (i.e., Google Play) is a platform for mobile apps for almost every software and service. App stores allow users to browse and download apps and facilitate developers to keep an eye on their apps by providing ratings and reviews of the apps. App reviews may include the user’s experience, information about bugs, request for new features, or rating of the app in word. The manual categorization of app reviews is critical and time-consuming for developers. Automatic classification of app reviews may help developers especially for fixing bugs on time. In this perspective, several approaches have been proposed for the automatic classification of reviews. However, none of them exploits the non-textual information of app reviews. In this paper, we propose a deep learning based approach for the classification of app reviews. It does not only leverage non-textual information of app reviews but also exploits a deep learning technique that has proved more accurate for the text classification in various domains. The approach first extracts textual and non-textual information of each app review, preprocesses the textual information, computes the sentiment of app reviews using Senti4SD, and determines the history of the reviewer includes the total number of reviews posted by the reviewer, and his submission rate (i.e., what percentages of his review have been submitted for the associated app). Second, we create a digital vector against each app review. Finally, we train a deep learning based multi-class classifier to classify app reviews. The proposed approach is evaluated on a public dataset, and the results suggest that it significantly improves the state of the art. It improves average precision from 75.72% to 95.49%, average recall from 69.40% to 93.94%, and f-measure from 72.41% to 94.71%, respectively.


I. INTRODUCTION
In this digital world, softwares are moved from computers to mobile phones. App stores (i.e., Google Play Store and Apple AppStore) provide mobile apps (noted as apps for short in the rest of this paper) almost for every field of life. In the fourth quarter of 2019, Google's Play Store and Apple's AppStore were the top two largest app stores with 2.57 million and 1.84 million apps, respectively 1 . Such app stores enable users to browse and download apps, and collect users' reviews (i.e., star rating and textual feedback) for developers to improve their apps. To this end, a number of automated approaches have been proposed for automatic classification of reviews [7]- [10]. Such approaches consider app reviews as plain texts, their metadata (e.g., length of text) and sentiment, and employ traditional machine learning techniques to make the prediction. To further improve their performance, in this paper, we propose a Convolutional Neural Network (CNN) based approach for the classification of app reviews. On one hand, it leverages non-textual information of app reviews that have not yet been employed by existing approaches. On the other hand, it exploits a deep learning based classifier that has proved more accurate for the text classification in various domains.
The approach works as follows: 1) it extracts textual (i.e., the textual information, and the sentiment of the textual information computed by Senti4SD) and non-textual features (i.e., the statistics of the reviewer (the total number of reviews posted by the reviewer, and his submission rate, i.e., what percentages of his review have been submitted for the associated app), and the statistics (metadata) of each app review; 2), it preprocesses the textual information and transforms it into a digital vector; and 3) it trains a CNN classifier to classify multi-class reviews. The proposed approach is evaluated on a public dataset, and the results suggest that it significantly improves the state of the art. It improves average precision from 75, 72% to 95.49%, average recall from 69.40% to 93.94%, and f-measure from 72.41% to 94.71%, respectively.
The rest of the paper is organized as follows: Section II provides the details of the proposed approach. Section III presents the evaluation of the proposed approach. Section IV introduces related work, and Section V concludes the paper and indicates future interests.

II. APPROACH A. OVERVIEW
Identification of the associated class of app reviews (noted as reviews) is essentially a multi-class classification. All the submitted reviews are automatically classified into four classes, i.e., bug reports, enhancement reports, user experiences, and ratings. Fig. 1 illustrates an overview of the proposed approach. A brief introduction is presented as follows: 1) For each review r, we extract its textual information (noted as t r ), i.e., the text of the review. 2) We extract non-textual features (noted as nt r ) of the review, e.g., the statistical data of the app, i.e., app size, the number of app installations, and the statistical data of the reviewer.
3) The textual information t r is preprocessed with natural language processing techniques, and convert it into numerical vectors. 4) We calculate the sentiment (noted as s r ) of each review. We compute the sentiment based on the textual information tr of the review. 5) We extract the labels of the reviews, i.e., whether they belong to bug reports, feature requests, user experiences, or ratings. 6) We train a multi-class classifier with the labeled data that are collected in the previous steps. 7) Finally, for new reviews, we extract their t r and nt r (as mentioned in Steps 1-4), and input to the trained multi-class classifier to generate the label (bug reports, enhancement reports, user experiences, or ratings) of the new reviews. Details of the proposed approach are presented in the following sections.

B. DATA EXTRACTION
A review r from a set of reviews R can be defined as follows: where tr is the textual information (i.e., text) of the review, nt r is the non-textual information of the review, and l is the category (label) of r, i.e., whether it belongs to bug reports, enhancement report, user-experience, or rating. Non-textual information holds statistics of the reviewer who committed the review and statistics of the app associated with the review: nt r consists of two metrics (SRev and SApp), where SRev further consists of two metrics: Rev n and Rev r . Rev n is the  number of reviews (either bug reports, enhancement reports, user experience, or rating) associated with the given app, and Rev r is the average rate for review (correspond to same review category) associated with the app. SApp also consists of two metrics (App s and App i ) where App s is the size of the app and App i is the number of installation of the app. SApp is collected from the metadata of the app.
Labels (bug report, enhancement report, user experience, and rating) are defined using the most trivial technique that checks the particular list of words within each review to automatically classify it. In this regard, we use SQL queries, i.e., LIKE. Table 1 shows a list of words (the complete list can not be represented, therefore we only represent the most influential word) for each category of review. These lists are formed based on the existing researches [3], [5], [11]. The existing researches consider nouns (aspects), adjectives, verbs, and adverbs as keywords. Note that, one review may belong to one or more categories, e.g., ''Doesn't work at the moment. Was quite satisfied before the last update. Will change the rating once it's functional again'' can be categorized as a bug report or rating review.

C. PREPROCESSING
We preprocess the textual information t r to convert it into a digital vector. Fig. 3 illustrates an overview of the preprocessing. The details of the preprocessing are as follows.
First, we apply the spelling check on each review. Then, we extract and remove the stop-words, e.g., special characters from reviews and convert the remaining text into lowercase to improve the comparison. After that, we tokenize each review into tokens (words) using Python NLTK [12]. Next, we lemmatize the tokens to turn the comparative and superlative words into their base words, e.g., liked turns into like). Finally, we convert the preprocessed text into vectors (embedding). In this regard, we leverage word2vec [13] that takes each token from the preprocessed text and converts it into a fixed-length numeric vector. Notably, we exploit standard libraries Gensim and TensorFlow to implement word2vec. We implement the Skip-gram architecture of word2vec for the given dataset with settings: window_size = 2, n=300 (dimensions of word embeddings), epoch=50, and learning_rate=0.01. A concatenated vector WV r is created for each from the vectors of each token of the review.

D. SENTIMENT ANALYSIS
The existing studies [14], [15] and the word-list of each category (as shown in Table 1) motivate and suggest that sentiment related words may help in the classification of reviews. For example, 79.6% of the rating reviews contains the keywords ''good'' and ''bad''.
We compute the sentiment RSen of each review and consider it as one of the key feature of the reviews. It can be represented as: where r is a review, r.t r is the textual feature of the review, and RSen(r) is the sentiment of the review. The function CalRSen computes the sentiment of each review r using Senti4SD [16]. It exploits three different kinds of features: 1) sentiment lexicon; 2) n-grams extracted from the given dataset (uni-gram and bi-gram in our case); and 3) semantic features. Semantic features are dependent on word representation in a distributional semantic model. The semantic features capture the similarity between the vector representations of the Stack Overflow documents and prototype vectors representing the polarity classes in a distributional semantic model, built using positive, negative, and neutral words from the sentiment lexicon. Senti4SD considers the emotion-words, modifiers, and negation in the review for calculation and returns the sentiment of the review. Note that, we leverage Senti4SD because of its significant performance for software engineering text in contrast to most commonly used repositories, e.g., SentiWordNet [17].

E. CONVOLUTIONAL NEURAL NETWORK BASED CLASSIFIER
The proposed Convolutional Neural Network (CNN) based classifier first extracts the features from preprocessed textual information WV r and combines the extracted textual features and non-textual features nt r , and predicts labels based on the combined features. The details of both steps are as follows. VOLUME 8, 2020

1) EXTRACTION OF TEXTUAL FEATURES
The CNN based classifier takes WV r and returns a feature map of c that contains the maximum values of the features. The classifier exploits the three convolution layers for the extraction of textual features. We apply filter sizes 3, 4, and 5, respectively. Each filter improves feature vectors by performing a convolution on the corresponding layer and create a feature map for the next layer.
The initial convolution layer takes a word WV i from WV r of a k-dimensional vector, where k is equal to 300. Assume x i X k be the k-dimensional vector corresponding to WV i and X i:i+j is the concatenation of feature vectors x i -X i+j . A filter w X dk is applied to a window of d words creates a new feature map c i that can be defined as, where b represents a bias and f represents the tangent non-linear function. The filter creates a feature map c using the given window of features that can be defined as Notably, we pass the numerical vectors into a CNN with dropout = 0.2 to prevent the overfitting.

2) CLASSIFICATION BASED ON MERGED FEATURES
The classifier takes the additional inputs: the sentiment of each review RSen and the statistical information of the app SApp, and merge the additional inputs with the processed textual information using the merge layer. The merge layer directly fuses RSen and SApp in the processed textual information c. Note that we merge the additional information directly to reduce the loss of convolutions in contrast to passing it into a separate network. Then, the flatten layer converts the integrated feature map matrix into a vector. Finally, dense layer takes the flattened information, computes the weighted average w and a bias b of the integrated features, and applies a non-linear activation function relu to predicts the classes.

III. EVALUATION
In this section, the proposed deep learning-based classification approach (DCAR) for reviews is evaluated with the real-world reviews from Google Play and Apple app stores.

A. RESEARCH QUESTIONS
We investigate the following research questions for the evaluation of DCAR. We exploit the reviews dataset which is extracted by Maalej et al. [7] from Google store and Apple store. The review only from the top apps are crawled. The total 1,126,453 reviews for 1100 apps from Apple store, and 146,057 reviews for 80 apps from Google store. Each review contains text, title, app name, category, store, date of submission, reviewer-id, and rating. We follow Maalej et al. [7] and consider their manually labeled dataset of 4400 reviews for the evaluation of DCAR. The data is selected in two phases. In the first phase, 2000 reviews are randomly selected from the collected reviews from both stores (1000 reviews from each store). In the second phase, the top 3 apps are first selected from both stores. Then, 400 reviews are randomly selected from each app. In total, four sets of sample reviews (named as 1100 apps, Dropbox, Evernote, and TripAdvisor) are created from the Apple store reviews, where the sets contain 100, 400, 400, and 400 reviews, respectively. Similarly, four sets of sample reviews (named as 80 apps, PicsArt, Pinterest, and Whatsapp) are created from the Google store reviews, where the sets contain 100, 400, 400, and 400 reviews, respectively. The statistics of the dataset are shown in Table 2.
Moreover, they conducted a paid peer, manual content analysis for the selected reviews to create the truth set. They sent every review to 2 randomly selected coders out of 10 computer science experts. They briefly explained the coders in a meeting about the manual classification task and provided a tool for classification. The manually analyzed reviews contain 2000 randomly selected app (1000 Apple apps and 1000 Google apps) and 2400 manually selected apps (1200 Apple apps and 1200 Google apps).

C. EXPERIMENTAL DESIGN 1) RQ1: COMPARISON AGAINST THE STATE OF THE ART
The first research question (RQ1) provides a comparison between the proposed approach (DCAR) and the state of the art. To answer RQ1, we compare DCAR against the Maalej approach [7] (noted as Maalej's for short in the rest of this paper). To the best of our knowledge, Maalej's reports the best results for the classification of reviews. We also compare DCAR against Umer approach [15] and Ramay approach [18] as both are declared best for the classification software engineering text.
The comparison employs the ten-fold cross-validation for the evaluation. On each fold, reviews from a single set of sample reviews (10%) are taken for testing (noted as sTest), whereas others (90%) are taken for training (noted as sTrain). We train DCAR and Maalej's separately with the same sTrain. After that, the trained models are evaluated separately with the same sTest. Note that, we evaluate the performance of DCAR and Maalej's using the well-known and most adopted metrics for machine learning classification [7], [14], [15], [19], i.e., precision, recall, and f-measure.

2) RQ2: INFLUENCE OF DIFFERENT FEATURES
The second research question (RQ2) examines the influence of different features employed by DCAR as mentioned in Section II-B. DCAR leverages t r and nt r features of reviews. We disable each of them and repeat the evaluation (only for DCAR) as mentioned in Section III-C1. Such evaluation measures the impact of each features on DCAR.

3) RQ3: INFLUENCE OF PREPROCESSING
The third research question (RQ3) examines the influence of the preprocessing (Section II-C) on the given dataset by comparing the performance of DCAR with preprocessing disabled DCAR. We remove all the preprocessing steps and repeat the evaluation (as mentioned in Section III-C1) to examine the impact of preprocessing.

4) RQ4: COMPARISON AMONG DIFFERENT CLASSIFIERS
The fourth research question (RQ4) provides a comparison of the proposed classifier among other machine and deep learning classifiers. In this regard, we exploit the Naive Bayes (NB), Multi-nomial Naive Bayes (MNB), Decision Tree (DT), Support Vector Machine (SVM), Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM) and repeat the evaluation as mentioned in Section III-C1. Note that we select these classifiers due to their significant performance for the textual classification [7], [14], [15], [18], [19].

1) RQ1: COMPARISON AGAINST THE STATE OF THE ART
To answer the research question RQ1, we compare DCAR against Maalej's, Umer's, and Ramay's. Table 3 presents the evaluation results. The first column represents the approaches, the second column represents the evaluation metrics, and columns 3-6 present the performance on each of the given categories. The last column presents the average performance of the approaches of each testing category. The first row represents the testing category, and the rest of the rows present the performance of the approaches on the given category. The table presents the best performance for each testing category in bold.
From Table 3, the following observations are made.  Consequently, DCAR has significant improvement in f−measure on Ratings. We perform one-way ANOVA on f-measure to further investigate the performance improvement of DCAR. ANOVA examines the difference between the performance of the given approaches. Fig. 4 illustrates the results of ANOVA analysis.
The results suggests that f-ratio is 61.2748 and p-value is 1.5116E-07 that is less than 0.05. From Fig. 4, we conclude that ANOVA indicates a significant difference among the f-measure of the given approaches. Note that, we also conduct the ANOVA on precision and recall that confirms the significant improvement of DCAR.
Based on the preceding analysis, we conclude that DCAR significantly improves the state of the art in classification of reviews.

2) RQ2: INFLUENCE OF DIFFERENT FEATURES
To answer the research question RQ2, we evaluate the performance reduction of DCAR by disabling a few of the given features. Table 4 presents the evaluation results. The first column presents the disabled features. The rest of the columns present the performance of DCAR against each disabled setting.
From Table 4, the following observations are made.
• First, the performance of DCAR reduces upon disabling any of the employed features. The default setting of DCAR (i.e., none of the features is disabled) achieves the highest average performance.
• Second, the textual features (i.e., extracted from reviews) is critical for DCAR. Disabling the textual features (2 nd row) returns the highest reduction in the performance of DCAR.   The performance comparison of both cases indicates that statistics of reviews are more appropriate for the classification of reviews.
• Finally, we notice that disabling Rev n (10 th row) has more significant reduction in performance in contrast to Rev r (11 th row). The performance comparison of both cases indicates that Rev n is more appropriate for the classification of reviews. Based on the preceding analysis, we conclude that all of the employed features are useful. Leveraging the non-textual features, particularly the statistics of reviewers, results in a significant increase in performance.

3) RQ3: INFLUENCE OF PREPROCESSING
To answer the research question RQ3, we examine the influence of the preprocessing. We remove the preprocessing step and repeat the evaluation (as mentioned in Section III-C1). Table 5 presents the evaluation results. The second row presents the performance of DCAR with the default setting (i.e., preprocessing enabled). The third row presents the performance of DCAR without preprocessing (i.e., preprocessing disabled). The last row presents the performance improvement of DCAR.
From Table 5, the following observations are made.

4) RQ4: COMPARISON AGAINST DIFFERENT CLASSIFIERS
To answer the research question RQ4, we compare our neural network based classifier CNN with deep learning-based classifier (i.e., LSTM) and machine learning-based classifiers (i.e., Random Forest(RF), Support Vector Machine (SVM), Multi-nomial Naive Bayes (MNB), and Naive Bayes (NB)). Table 6 presents the evaluation results. We bold the maximum performance of each classifier on each testing category.
From Table 6, the following observations are made.
• First, the proposed classifier (CNN) achieves the highest performance upon the selected deep learning classifiers. One reason is that CNN is better for extracting position invariant features in contrast to LSTM. Another reason is that CNN performs exceptionally well with high-dimensional feature [20].
• Second, the CNN classifier also achieves the highest performance upon the selected machine learning classifiers. One reason is that CNN transforms the non-linear and inter-dependent features into a high-dimensional plane.
• Third, although the state of the art [7], [21] suggests that Bayesian is effective in the classification of reviews, it results in performance reduction with the proposed approach. One possible reason is that some of the given non-textual features (Rev n and Rev r ) to the classifier are inter-related, and Bayesian performs well with the independent features [15], [18]. In contrast to SVM and RF, Bayesian is not appropriate with the proposed approach.
• Fourth, we observe the slight difference in a performance comparison of SVM and RF. The evaluation of the proposed approach employing these classifiers on other datasets may influence the reported performance. Based on the preceding analysis, we conclude that the proposed classifier is appropriate for the performance improvement of the proposed approach.

E. THREATS TO VALIDITY
A threat to external validity is that only a limited number of reviews from the selected apps are considered for the evaluation of the proposed approach. Although we observe the slight change in performance of the proposed approach among given categories, the results may not hold for other apps or adding more categories.
A threat to construct validity is that the labels in the exploited dataset could be incorrect. Maalej et al. [7] manually labeled the selected reviews that could be incorrect for different reasons [7]. As a result, such incorrect labeling could produce inaccurate results.
A threat to internal validity is that we recode the Maalej's with different evaluation criteria (i.e., ten-fold crossvalidation). Consequently, the average results of the Maalej's are slightly different. To mitigate the threat, we double-check the implementation and evaluation results.

IV. RELATED WORK
It is evident that one of the major success factors for software projects is the users' involvement and their feedbacks. VOLUME 8, 2020 For example, the positive impact of user involvement is highlighted in [22], where authors suggested managing user involvement carefully otherwise it may cause more problems in contrast to benefits. Similarly, Pagano and Brügge [23] conducted an empirical case study for software evolution and evident that user feedback has an important piece of information for developers to not only improve the software quality but also identify the missing features. Similar to traditional requirement engineering, crowd-based requirement engineering is also a focus of researchers. In [24], [25], authors addressed the scalability issue in the case of multiple users and the significance of the tool required for the analysis of their feedback. In [26], authors focused on getting user feedback from a mobile device which includes implicit information. To develop and maintain software projects, bug repositories are considered as one of the most scrutinized tools for the collection of user feedback [27].
The analysis of reviews for app-stores (i.e., Google Play store and Apple store) got significant research attention in recent years because reviews are usually difficult to understand due to their unstructured textual information and frequency, moreover only a third of them are informative. Therefore, Ciurumelea at al. [28] developed a tool for the developers to analyze the direct and valuable feedback provided through user reviews, to better plan maintenance and evolution activities for their mobile-apps. To facilitate the reviews' analysis, Maalej et al. [7] have classified it into three major categories: i) general exploratory studies, ii) app feature extraction, and iii) reviews filtering and summarization. Furthermore, the authors provided a relationship between customer, business, and technical characteristics of mobile-apps from BlackBerry store [2]. They exploited NLP techniques with data mining to extract the correlation and trends. The results suggest that a strong correlation between mobile-app popularity and customer rating, whereas the correlation between the number of features and price is mild. Similarly, board exploratory studies for the Apple store are conducted by Hoon et al. [29] and Pagano and Maalej [3]. These studies identified the trends for the rating, topics discussed in reviews, quality of mobile-apps, and quality of the review. Zhang et al. [30] proposed a novel approach to automatically tag the unlabelled issue reports. This approach computes the similarity between each unlabelled issue report and user reviews related to bugs and features and also calculates the textual similarity scores between each unlabelled issue report and labeled ones.
Some studies mined user opinions and mobile-apps features from application stores. In [31], Harman et al. uses a greedy algorithm to extract the mobile-apps features from the official pages presenting the description of applications to analyze business and technical aspects of mobile-apps. In [32], Chandy et al. exploited the latent model to classify spams in the mobile app stores and categorized the reviews into malicious and normal groups. To group and extract the feature requests from the mobile-app reviews, MARA (Mobile App Review Analyzer) [5] is introduced that exploits Latent Dirichlet Allocation (LDA) and linguistic rules for identifying common topics. In [6], authors alleviated an automatic solution for topic extraction. Moreover, they used LDA for the summarization of user reviews.
Wiscom [33] analyzed user comments and ratings in three different levels. In the first level, inconsistency in reviews is discovered. Then, the reasons for liking and disliking of mobile-apps are identified. Finally, an insight into the major concerns of the users is provided. The study exploited the linear regression model to identify negative words with the help of user reviews and ratings. Furthermore, these words are applied as an input to the LDA model to find the reasons for people disliking of mobile-apps. Li et al. [1] proposed another method to analyze user satisfaction with the help of user reviews. Authors employed a predefined dictionary to match words or phrases of user reviews. App Review Mining (AR-Miner) [34] approach is introduced that extracts most informative user reviews by using Naive Bayes. The approach first removes irrelevant and noisy reviews and groups informative reviews with the help of topic modeling. Then, it uses a ranking scheme to prioritize informative reviews. Finally, it presents the most informative reviews using an intuitive visualization approach.
An automatic approach for the classification of different software artifacts has gained significant research attention. In [8], authors exploited machine learning with linguistic rules to classify user reviews into a taxonomy. This taxonomy is created by analyzing developers' emails. Bacchelli et al. [9] leveraged a natural language parser and Naive Bayes to classify useful information from developers' emails. To classify the structured and unstructured data, Zhou et al. [10] also leveraged the machine learning techniques. Martens and Maleej [35] exploited a machine learning classifier to classify the fake reviews and achieved a recall of 91% and an AUC/ROC value of 98%. Similarly, Ekanata and Budi [36] exploited Naive Bayes, Support Vector Machine, Logistic Regression, and Decision Tree for the classification of reviews. The results suggest that Logistic Regression provides the best f-measure of 85% when unigram, sentence length, and sentiment score are combined.
For the sentiment analysis, Cambria [37] reported that automatically capturing the sentiments about social events, political movements, marketing campaigns, and product preferences of general public has raised interest in the scientific community and business world for the exciting open challenges, the remarkable fallouts in marketing, and financial market prediction. Later, Amir and Erik [38] explored the potential of a novel semi-supervised learning model based on the combined use of random projection scaling as part of a vector space model, and support vector machines to perform reasoning on a knowledge base. To this end, they combined a graph representation of commonsense with a linguistic resource for the lexical representation of affect. The evaluation results suggest a significant improvement in tasks such as emotion recognition and polarity detection, and propose a way for the development of semi-supervised learning approaches to big social data analytics. Furthermore, Wang et al. [39] proposed an automatic method for the construction of the domain-specific sentiment lexicon to avoid sentimental ambiguity. It incorporates the sentiment information not only from the existing lexicons but also from the corpus. They exploited an improved TF-IDF to calculate the sentiment of words. The evaluation results suggest that constructed lexicon improves the sentimental ambiguity and outperforms the state of the art approaches. Moreover, Ma et al. [40] proposed a novel solution for aspect-based sentiment analysis by exploiting commonsense knowledge. They incorporated the commonsense knowledge of sentiment related concepts in end-to-end training of LSTM network that outperforms the state of the art methods.
Based on the preceding literature analysis, we conclude that a number of researches have been proposed for the classification of reviews. However, DCAR differs in that it does not only leverage the deep learning classifier but also employed the statistical information of reviewers and mobileapps. Compared to the state of the art (discussed in this section), the additional features (i.e., statistical information of reviewers) make DCAR different from off-the-shelf text classification approaches.

V. CONCLUSION AND FUTURE WORK
Automated classification of reviews from various apps is highly desirable. In this paper, we propose a deep learning based approach for the classification of reviews. Compared to existing approaches, the proposed approach does not only leverage the statistics of reviews as non-textual features that have not been employed but also exploits a deep learning technique to classify reviews. The results of the ten-fold cross-validation evaluation on real-world reviews indicate that the proposed approach significantly surpasses the state of the art.
The employed additional features i.e., statistics of reviews suggest that these features are appropriate for the classification of review. However, temporal and location-based features are not yet examined, i.e., reviews from different geographical regions/cultures may cause their sentiment. In future, it could a possible research direction to improve the results of the review classification. Moreover, it could be interesting to evaluate the related approaches with a larger dataset of real-world reviews.