Predicting the Helpfulness Score of Product Reviews Using an Evidential Score Fusion Method

Everyday many online product sales websites and specialized reviewing forums publish a massive volume of human-generated product reviews. People use these reviews as valuable free source of knowledge when decide to buy products. Therefore, an accurate automated system for distinguishing useful reviews from non-useful ones is of great importance. This article presents a new model for specifying the usefulness of comments using the textual features extracted from the reviews. Various types of features including emotion-related, linguistic and text-related features, valence, arousal, and dominance (VAD) values, review-length and polarity of comments are exploited in this study. Moreover, two new algorithms are presented: an improved evidential algorithm for emotion recognition, and an algorithm for extracting VAD values for each review. Finally, the usefulness of reviews is predicted using the mentioned features and an improved Dempster–Shafer score fusion algorithm. The proposed method is applied to review datasets of Books and Video Games of Amazon. The results show that combining the features associated with emotions, features of VAD, and text-related features improves the accuracy of predicting the usefulness of reviews. Also, in comparison with the original Dempster–Shafer method, the precision of the improved Dempster–Shafer algorithm for both datasets is 15% and 11% higher, respectively.


I. INTRODUCTION
With the advent of the Web and the expansion of e-commerce, users are expressing their views on products and services on many specialized and commercial sites to interact and work together. Through online reviews, customers share their personal beliefs, experiences of purchasing decisions, and evaluations towards services or products [1]. These reviews contain valuable information and can be used to analyze people's attitudes and interests. Moreover, they can be used to identify and analyze people's positive and negative views on a variety of targets such as locations, products, and specific events. [2]. Such informative reviews are valuable for both consumers and producers. Consumers read product reviews before making a purchase decision to reduce search costs and purchase uncertainty [3]. A well-written review identi-The associate editor coordinating the review of this manuscript and approving it for publication was Shiqing Zhang . fies the strengths and weaknesses of products for producers and identifies what can be learned about new product development [1].
User reviews are already taking up a lot of space on the web and the volume of user-generated textual data is increasing every day. Therefore, with the rapid increase of online reviews, it is impossible for people to review all comments related to a product or service in a limited time [3]- [5]. To alleviate this problem, Amazon and some other online retailers allow consumers to evaluate opinions by implementing useful voting systems. In these systems, beside each comment, there is a question such as: ''Was this comment helpful to you?''. Also, usefulness information is usually reported in the form of a usefulness score to help the consumer evaluate the review [3]. This usefulness score is equal to the ratio of the number of useful votes to the total number of votes for a given opinion and is written as ''n out of t people found this helpful'' [6].
Although the usefulness score calculated on the basis of user votes may be informative, it suffers from the ''cold-start'' problem. In other words, it cannot be used properly for recent (newly written) reviews, having not yet been voted on. Hence, due to a lack of information, computing the usefulness score of such newly written reviews is more complicated [7], [8]. Another problem is that the comments that are posted earlier attract the most readers' attention, and are consequently placed at the top of the list, while comments that are posted later are listed at the bottom of the list and ignored [4], [5]. This phenomenon is known as the Matthew effect [9]. Another problem with usefulness score is that as reviews are quickly posted, some useful reviews are likely to be covered by useless reviews before being considered [8]. Therefore, it is important to use content-based methods for automatically analyzing reviews that can accurately predict the usefulness score of reviews as soon as they are posted on a website.
The importance of automatically recognizing useful comments has been examined in previous studies [1], [4]- [6], [10], [11]. However, most existing studies have some limitations. For example, few studies have examined the effect of various emotions such as sadness, happiness, trust, expectation, etc. on the usefulness of reviews though, many researchers have acknowledged the role of emotion in the online environment and emphasized the importance of examining the role of emotion in interpreting online reviews [6], [10]. Moreover, it has been shown that emotions from the same polarity (e.g., anger and fear) may have different effects on consumers' activities including their perceived usefulness of a review [10]. Consequently, it is necessary to consider the effect of different emotions about the usefulness of reviews. Another problem of the existing methods is that they do not consider the intensity of the words in each emotion category. This is important because it is possible for two different words to represent the same emotions with different intensities. For example, the two words ''good'' and ''great'' both convey a sense of pleasure, but it is clear that each convey a different intensity from that emotion. To our knowledge, previous studies on review usefulness prediction did not addressed this problem. In the current study, this issue is considered when extracting an emotion from a review. Also taking into account the three semantic dimensions of valence, arousal, and dominance (VAD) along with the dimensions of emotion for the context of each review can lead to a more detailed analysis because, according to [12], analyzing these three dimensions is very effective in understanding the meaning of the text. It is also expected that the more accurate the meaning of the review is, the more useful it will be.
In addition to the abovementioned problems, some previous studies' results need to be investigated more carefully. For example, the results of previous research indicate that surprise and expectation do not play a critical role in the classification of usefulness of reviews [10]. And the two emotions of surprise and anticipation fall into the category of positive emotions. However, examining the vocabulary of these emotions shows that they are two-sided (i.e., they may be positive or negative). In fact, positive surprise and expectation can produce positive opinion while negative surprise and expectation can lead to negative opinion. These have not been considered in previous studies on review helpfulness prediction. Therefore, in this study, each of these two emotions is divided into two groups of positive and negative. This, increases the diversity of existing emotion categories in the NRC lexicon [13], [14] (i.e., sadness, pleasure, fear, anger, hatred, trust, surprise and expectation) to 12 by adding four ''positive surprises'', ''negative surprises'', ''positive anticipation'', and ''negative anticipation'' emotions.
Having extracted the above features, to help avoid biasing the review helpfulness classifier, we train seven classifiers and selected top three ones to combine their results. Specially, each of three classifiers assign a probability to each usefulness category. These probabilities may be interpreted as the confidence level of classifiers in assigning categories to reviews. To obtain the final helpfulness category of reviews we proposed an improved evidential fusion method. This method is based on the Dempster-Shafer (D-S) theory which has been studied extensively in decision-making and sentiment analysis in recent years [15], [16]. This algorithm is used to fuse information obtained from different sources, especially when there is uncertainty in the sources [16]- [18].
Considering the mentioned aspects of the current study, in the proposed model, we seek to answer the following questions to determine the usefulness of reviews: 1) Does considering three dimensions of valence, arousal, and dominance (VAD) along with contextual and emotion-related features affect the usefulness of the comments? 2) Does the improvement of the D-S algorithm increase the accuracy of the review usefulness prediction system? The main contributions of this study are as follows: 1) Given that both surprise and expectation in the NRC emotion lexicon can have either negative or positive polarity, the NRC lexicon is enhanced by adding four emotions namely, positive surprise, negative surprise, positive expectation, and negative expectation. 2) A new algorithm for extracting 12 separate emotion vectors from the reviews is presented that considers the following aspects: a. To increase the accuracy of the extracted emotion, the intensity of the emotion of words in separate emotional groups is also considered. b. To extract the emotion vector correctly, the effect of negative structures in the sentence has been investigated. c. To extract the emotion vector more accurately, the effect of intensifying and decreasing emotion structures on words has also been investigated. 3) Three features of valence, arousal, and dominance (VAD) for each text for detecting the helpfulness score of reviews is considered. VOLUME 8, 2020 4) To determine the usefulness of the reviews, the triplet structure is used to improve the D-S algorithm. The findings of this study can be used to improve e-shops, opinion mining, review summarization, and recommender systems. Also, the emotion and sentiment extraction parts of the proposed system may be used separately in sentiment and emotion analysis.
The remainder of the paper is organized as follows. In Section II, related work is presented, Section III presents the proposed method in more details. Section IV analyzes the results of the implementation, then discusses the findings. Section V concludes the paper and presents some directions for future work.

II. RELATED WORK A. EXISTING STUDIES ON THE USEFULNESS OF REVIEWS
The usefulness of a review means the objective evaluation of the quality of the review by others [10]. Consumers can hardly manually identify useful reviews among the high volume of product reviews on websites, so finding useful reviews automatically and understanding of the factors influencing the usefulness of reviews are important [6]. The usefulness of online reviews is a multi-dimensional concept that can be controlled by a variety of factors [10]. Many researchers have suggested various factors that may influence the usefulness of reviews. For example, review length that is the number of words that constitute the review may be an influential factor [1], [19]- [23]. Several studies have shown the positive effects of review length [1], [21], [22], [24], [24]- [29]. This effect varies with regard to product type; review length has a greater effect on the usefulness of search product than experimental goods [1], [27].
In 2016, Qazi et al. [30] presented a conceptual model for predicting the usefulness of reviews. This study not only examines quantitative factors (such as the number of concepts), but also focuses on the qualitative aspects of reviews including types of review (i.e., regular, comparative, and suggestive reviews). A comparative review expresses a relationship of similarity or difference between two or more entities, or describes an individual's preference for common features of entities. Normal opinions express a general view and suggestive reviews give advice on whether or not to buy a product [30]. The researchers found that all three types of reviews had significant effects on the purchase decision [4].
There are different opinions about the impact of subjectivity/ objectivity on the usefulness of reviews. Facts contain objective statements about their entities, events, and attributes while subjective expressions describe individuals' emotions and evaluations of their entities, events, and attributes [31]. The effect of subjectivity on the usefulness of opinion has been demonstrated in [8], [32]- [35].
Researchers also examined several aspects of review text such as various readability measures, spelling errors, subjectivity levels, and average usefulness of reviews [33]. It has been shown that opinions with a combination of objective and subjective sentences have a negative relationship with product sales compared to only subjective or only objective ones [33]. The researchers also found that mid-length reviews with a few misspellings were more effective from customers' view than very long or very short reviews with more misspellings [4], [33]. However, in [20] the authors concluded that legibility is more effective than review length [4]. In [8], the characteristics of noun, verb, and adjective have been used as effective predictors and in [6], the adverb feature is also defined alongside the noun, verb, and adjective. Linguistic features are also used to predict usefulness [35].
Several studies have shown that the readability of the review text is an effective factor in predicting the usefulness of the review [6], [8], [35], [36]. Reviews with high readability are likely to be read and receive more votes from users [6]. In addition, previous studies have shown that total votes and total number of reviewers are also effective features of predicting the usefulness of online reviews [37]. A new reviewer feature, namely reviewer's activity length along with total votes and total number of reviews are also proposed [6]. Also disclosed personal information such as real name, location, nickname is shown to be effective features to predict the usefulness of online reviews [38].
In [8], a few models were designed to predict the usefulness of consumer reviews using several contextual features such as polarity, subjectivity, entropy, and ease of reading. These machine learning models automatically determine the usefulness of each initial review as soon as it is posted on the website so that they have an equal chance of being viewed by others. Authors in [39] examine how online consumers' reviews interact with one another and how consumers' beliefs evolve over time. The researchers proposed a dynamic model of opinion evolution that is applicable to the opinions of online consumers in the e-commerce environment and influencing factors such as visitor readability, sorting and dissemination strategies, convergence parameters, feedback, and thresholds of trust. In [40], a review usefulness prediction framework is proposed to use multilingual reviews to generate relevant business insights and predicts the usefulness of reviews with the help of non-English comments.
Previous studies show that product features also play an important role in predicting the usefulness of online reviews [1], [6], [37]. Types of products (i.e., experimental and search), can play an important role in the usefulness of reviews [6], [41]. Determining the quality of empirical products before use is not easy, so consumers looking for empirical products use others' experiences, but search products can be judged on the basis of product specification before purchasing [6], [27], [42]. The researchers found that positive and negative emotions in search products were more effective than empirical ones. In addition, for search products, the combination of features with positive and negative emotions performs better than experimental goods [6].
In [6], four product features namely, product emergency index, Amazon sales rank, Amazon product price list, and time spent from product release date were proposed. In [5], product description and question-answer features along with contextual features are used. The results show that using product description features and customer question-answer data improves the accuracy of predicting usefulness scores. The characteristics of product reviews among five different products were investigated in [43] and their effects on the usefulness of the review was identified. Four data mining methods have also been explored to determine the best way to predict the usefulness of comments for each product using five Amazon datasets. The results show that opinions for different types of product have different linguistic and psychological characteristics and their influencing factors are different.
The model presented in [44] assumes that product feedback sends signals to buyers. Using the Amazon product reviews, the researchers tested their model and found that signals related to comment content (e.g., specific comment content and writing style) and comment-related signals (such as reviewer experience and his/her popularity) are both influencing review usefulness. In addition, they showed that the signaling environment was influenced by the signal and that the motivations given to the reviewers influenced the signals sent.

B. EMOTIONAL FEATURES AND REVIEW USEFULNESS
Several psychological theories have suggested basic human emotions of various dimensions [45]- [47]. In [45], six emotional dimensions are suggested: pleasure, sadness, anger, fear, disgust, and surprise. Plutchik showed eight views of emotion, sadness, pleasure, fear, anger, disgust, trust, surprise, and anticipation using a wheel [48]. The Plutchik framework has strong foundations in psychological studies [10] and unlike some other models [45] in which negative emotions are predominant, in this framework, the balance between negative and positive emotional perspectives is established.
Several studies investigated the effect of different emotions on the usefulness of reviews. For example, researchers in [1] examined the relationship between few negative emotions (anger, fear, sadness) and the usefulness of online reviews. The results of this study showed that fear in a review has a positive effect on its usefulness and anger has a stronger negative effect on the perceived usefulness of the review for experimental goods than search goods. As the level of sadness in reviews increases, the perceptual usefulness of the review decreases.
It has been shown that the effect of negative consumer reviews on film choice was more effective than positive ones [49]. In contrast, positive reviews have a greater impact on film evaluation than negative ones. Also, consumer expectations have been a moderating effect of the capacity of consumer opinion on film selection and subsequent evaluations. Similarly, in [50], a statistical approach was proposed to establish the relationship between review quality and emotions with review usefulness. The results showed that high quality positive reviews increase product sales compared to low quality reviews [4]. In [6], the effect of four positive and four negative emotions on perceptual usefulness is investigated and a binary classification model is developed to predict usefulness based on deep neural network. In addition, product type features, visibility, readability, linguistic and sentiment characteristics of reviews are also used to compare and predict the usefulness. Experimental results showed that positive emotional traits perform better. However, negative emotions and visibility are also affected. Also, a combination of features with positive emotional traits provide the best performance for online feedback. Also, trust, pleasure, anticipation (positive emotions), anxiety, and sadness (negative emotions) are the most effective emotional dimensions and have a greater effect on perceptual usefulness.
Emotional content was assessed with eight emotional dimensions (pleasure, sadness, anger, fear, trust, hatred, expectation, and surprise) in the Plutchik emotional wheel [10]. The results of this study were analyzed using a negative binomial model which showed that anger, hate, and fear in reviews had a positive effect on the usefulness of the reviews, and pleasure, trust, and sadness had negative effects on the usefulness of the reviews. The anticipation and surprise in the reviews had no effect on the usefulness. In [11], the influence of emotions in the helpfulness of hotel-related online reviews is examined. The results of this study showed that negative online reviews are more useful than positive ones. It has been also shown that quantitative and capacity-based approaches are not sufficient to identify and evaluate the quality and performance of hotel services in information seeking and decision-making processes for consumers. Table 1 depicts the summary of literature and major determinants of perceived review usefulness in studies.

C. SCORE FUSION METHOD USING AN EVIDENTIAL APPROACH
One way of improving the efficiency and accuracy of machine learning systems is to fuse the results of various classifiers [51]. This can be achieved through the use of score fusion algorithms [51]. In the existing review usefulness systems, most employed fusion methods have no theoretical basis and not designed for these systems. To address this problem, we exploit an evidential fusion method based on the Dempster-Shafer (D-S) theory. D-S method is one of the most prominent score fusion methods which has been exploited in recent years for polarity detection [52], rating prediction [15], multimodal emotion recognition [53], and project risk assessment [54]. In terms of uncertainty in the validity of the hypotheses, Dempster and Shafer presented a general form of Bayesian theory in which multiple probabilities (e.g., derived from multiple classifiers' outputs) were used to determine the final output on the basis of evidence from uncertain outputs [51]. Using the D-S theory, evidences are first extracted from the classifiers' outputs. These evidences are then used as basic knowledge in finding the degree of membership of the input to each class. Based on this evidence, the probability  masses of each class are determined. Finally, the masses are fused to determine the final class [51].
In [51], a D-S theory-based approach for combining multiple classifiers was designed and the class outputs are modeled by a 2-point Triplet structure. The results showed that the first and second best classifier performs better than the separate classifiers and hybrid classifiers. The triplet structure also performs better than the simplet and quartet structures. In [55], a hybrid method for text classification was designed based on the D-S theory that used combination of best outputs. This method also used the outputs of the classifiers using the quartet or 3-point structure. They compared the performance of separate classifiers with the hybrid SM (the combination of SVM and KNNM) classifier. The researchers found that the best hybrid classifiers had higher performance than the best independent ones.
The original D-S theory is used in [18] to design a method to detect sentiment in online reviews. The researchers showed that the D-S theory-based fusion method performs better than simple fusion methods such as weighted average and sum using both lexicon-based and machine learning-based methods. They improved their method for detecting documentlevel review polarity in [17]. In this study, the TripAdvisor and CitySearch datasets was used to evaluate the performance of the improved D-S-based fusion method. The researchers found that their proposed method was more accurate than the original D-S fusion method.
In [56], a D-S theory-based approach was designed to combine conflicting evidence with different weighting coefficients and provide a high-performance decision support system that can effectively solve the collision problem. In [57], a method based on the D-S theory was designed that used absolute and relative difference factors for two pieces of evidence. The resources were divided into two categories of collision and non-collision, and the cumulative probabilities in the collision category was combined with those of the non-collision group. The advantages of this method are better management of evidences and improving reliability.
In [58], a new method is proposed to incorporate social media comments with audio-visual contents in video. For the fusion stage, the decision-level fusion method was used based on the D-S theory of evidence. The results showed that the D-S-based fusion system performs much better than the baseline method, which uses only audio-visual content for emotional video retrieval. In a similar study [59], a new lexicon-based method for information fusion based on D-S theory was proposed. This method does not require a human-coded corpus for training and operates much faster than the supervised method. The results showed that inclusion of song lyrics with audio-visual content had no positive effect on the retrieval performance, but utilizing users' comments had a significant improvement for the emotional retrieval system. To address the main drawback of combining multimodal information in emotional video retrieval systems which is assigning equally weights to modalities, a new D-S method was proposed in [60]. This method gives different weights based on the correlation and the level of confidence. This method has been recently improved for the same task using a hybrid architecture consisting of latent information obtained through canonical correlation analysis (CCA) [61] and Marginal Fisher Analysis (MFA) [53]. As stated in some of previous studies, the original D-S theory has some limitations [51]; One of the most influential limitations is the production of contradictory results. To solve this problem, the triplet structure is used to improve the D-S algorithm [51].

III. THE PROPOSED METHOD
The overall view of the proposed model is shown in Figure 1.
In this system, reviews are first preprocessed. Then, using the proposed emotion extraction algorithm, the emotion-related features are extracted by considering the emotion intensity of VOLUME 8, 2020 the words in different emotional categories and the affective shifters. Additionally, other features such as linguistic features, readability, valence, arousal, dominance values, review length, and polarity are also extracted for each review. In the next step, using different machine learning algorithms, different models are developed to predict the usefulness of reviews and the best three models are selected based on the model evaluation criteria. Finally, to improve the learning accuracy, the results of the best classifiers are fused using the D-S fusion algorithm. The proposed method is explained in more details in the following sections.

A. PROBLEM FORMULATION
Assume R is the initial unprocessed dataset of size n×3 where n is the number of reviews in R and the columns contain review text, title, and helpfulness score of the review. The goal is to predict the helpfulness score s ∈ {0, 1} of a given test review r using the proposed classification method trained on R. To this aim, the problem is first reduced to extracting k representative features from the first and second columns of R then, to use the proposed method to classify reviews. The feature matrix F of size n × k is constructed by extracting review text and title, different features such as emotion vectors, linguistic features, and other derivable features from these two.

B. FEATURES
The final features extracted from the initial dataset are presented in Table 2. In the following section, each of these features is described in detail.

C. EMOTION VECTOR EXTRACTION FROM EACH REVIEW
The NRC emotion lexicon [13], [14] contains 14,183 words that provide a distinct emotion for each of the 8 dimensions. In this study, four positive emotions of positive surprise, negative surprise, positive expectation, and negative expectation were added to the NRC lexicon. If a word had a surprise emotion, it would have been considered a positive surprise if it had positive polarity, otherwise it would have been considered as negative surprise. In case the surprised word had no polarity, these two new feelings would have been assigned zero. Also, if the word did not have a surprise emotion, the two new emotions will be assigned zero. This will be performed similarly for the expectation emotion. Ultimately, an upgraded dictionary contains 12 distinct emotions for each word. Also, since many of the words in the NRC lexicon have zero values for all of their separate emotions and hence have no effect on the final feature vector, these words have also been removed from the improved dictionary. This changed the size of the improved NRC lexicon to 6469 distinct words. The details of the improved lexicon are listed in Table 3.
To improve the accuracy of computing the perception of review texts, the NRC Affect Intensity Lexicon was also used in this study [62]. Specifically, for each review, a 12-elements vector is extracted each of which is equivalent to the amount of emotion in a separate emotional dimension. This vector can be expressed as EV = (Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, Trust, Positive-Surprise, Negative-Surprise, Positive-Anticipation, Negative-Anticipation). To calculate this vector for each review, after preprocessing using Algorithm 1, affective words are extracted from the text. Then, using Algorithm 2, the emotion vector is extracted for this set of words which is equivalent to the whole emotion vector of the text being processed.
To correctly extract the sentiment of the text, it is necessary to identify the emotionally effective words as accurately as possible. In this study, the following preprocesses were made on the text of the reviews.

1) THE EFFECT OF NEGATION
In any linguistic structure, negation words such as ''not'' may be used to reverse the emotion of words with a positive affective sense [63]. In this study, Python Spacy library was used to identify the words that were negatively affected. Then, each word was replaced with its best antagonist using WordNet, finally, the negation word was removed.

2) INFLUENCE OF QUALIFIERS ON TEXT
Qualifiers in English reduce the impact of the word they are directly associated with. For example, in the term ''hardly crashes'', the word ''hardly'' reduces the semantic impact of the verb ''crashes'', and if the verb has a negative affect, it is more appropriate to reduce the negative effect by considering the effect of the word ''hardly''. The list of words of this category considered in the current study is as follows: Qualifiers = ['hardly', 'rarely', 'infrequently', 'seldom', 'sporadically', 'scarcely'] To identify words that are affected by qualifiers in the review text, when examining the negative relation for each word, if the word has an adverb modal relation with another word and it (i.e., the original word) has a POS tag of ''pronoun'', it is checked whether it is in the list of qualifiers or not; if any, a word that is directly affected is identified and added to the list. The relationships and POS tags are also extracted using the Python Spacy library. Finally, to reduce the effect of emotion in practice, when calculating the emotion, a constant value of -0.2 is added to the emotional value of each affected word.

3) THE EFFECT OF POSITIVE INTENSIFIER IN THE TEXT
In English there are words that if they come before another word, increase the semantic effect of the word affected. for example, in the phrase ''very good'', the word ''very'' intensifies the semantic effect of the word ''good''. In this study, the following words are considered in the intensifier list: Positive intensifiers = ['very', 'extremely', 'absolutely', 'completely', 'greatly', 'too', 'so', 'totally', 'utterly', 'highly', 'rather', 'really', 'exceptionally', 'particularly', 'seriously] To exaggerate the emotion value of the words affected by the intensifiers in the review text, a similar approach described for qualifiers is done except that the words affected by this list are in a separate list and when the emotion value is calculated, a constant value of 0.2 is added to the emotional value of each affected word [63].

4) THE EFFECT OF BOTH POSITIVE INTENSIFIERS AND NEGATIONS
If a word is simultaneously affected by both negative and positive influencers, instead of increasing the emotion value, it should be reduced. For example, to extract the correct emotion value of the word ''not very good'', the word ''good'' as mentioned before is replaced by its opposite word (i.e., ''bad''). Then, the intensifier word ''very'' is considered as qualifier and reduce the emotion value by −0.2.
According to Algorithm 1, first all possible relationships between the two words in the sentence are extracted by the Spacy library in Python (line 7). Then, in lines 8-11, if there is a negative relationship between the two words w i and w j (i.e., the word w j is affected by the negative word w i ), it is checked whether the word w j is also affected by the word intensifier w z and if it is affected, it is added to the list of words affected by the diminished words. Then, according to lines 13-14, the word w j is replaced by the WordNet opposite of w j and the word w i is deleted.
In lines [16][17][18][19], if the relationship between the two words is an adverb modal relation and the role of the word w i is intensifier, it is first examined whether the word w i is included in the list of intensifier words. If w i exist, wj is added to the list of words affected by the intensifier. If the word w i is included in the list of reducer words, the word affected by w j is added to the list of words affected by the diminished words (lines 20-21). Finally, lines 25-29 are added to the final word list by examining each word in the sentence s i if these words are other than stop words and presented as the output of the algorithm.
The objective of Algorithm 2 is to extract the emotion vector (EV) for each review u i . This vector has 12 elements, If relation is negation relation then 10: If w j have adverb modal relation with w z And w z is in Intensifiers list then 11: Add w j to Diminished words 12: End if 13: Opposite affected word w j using wordnet in s i 14: Delete negated word w i 15: End if 16: If relation is adverb modal and w i pos is ADV then 17: If w i is in Intensifiers list then 18: Add w j to Intensified words 19: End if 20: If w i is in Qualifiers list then 21: Add w j to Diminished words 22: End if 23: End if 24: End for 25: For each w i in si do 26: If w i is not stopWord then 27: Add w i to finalWords 28: End if 29: End for each corresponding to a separate emotion. Therefore, each element of this vector indicates how much emotion of the corresponding dimension is in the review. To extract the EV ui , first break the sentence and then execute Algorithm 1 for each sentence. This provides a list of the effective words of the sentence, as well as the words that are affected by the intensifier and reducers (Lines 1-15). Then, for each word in the list, the EVw i vector, which represents the values of the individual emotions for the word, should be extracted. This will be done using the values available for the word or, if such values do not exist, the values for the word lemma in the NRC emotion lexicon, as well as the values of the intensity available for the word, or if such values do not exist, the values available for the word lemma in the NRC affect lexicon (lines [16][17][18][19][20][21]. calculatedEmotionEffectForWord function takes each word and its emotion vector and adds corresponding values of emotion intensity in different groups to emotion vector of the word. If the word or its lemma is not present in NRC lexicons, its existence in each of the affect dictionary groups is checked and its values are added when values are found (lines 22-30). For word w i in s i final words do 8: Intensify value = 0; 9: Diminishing value = 0; 10: W i _lemma= lemma of w i ; 11: If w i in Diminished words then 12: Diminishing value = -0. According to [12], in addition to the emotional dimensions that are transmitted through words, the three semantic dimensions of valence, arousal, and dominance (VAD) are also transmitted through the words of a text. The v (positive or negative/pleasant or unpleasant) dimension is a measure of whether or not a word is favorable. For example, the word ''party'' indicates a higher level of positive than ''funeral''. The A (irritability/not irritability) dimension measures how energetic or crooked it is felt and this is not a measure of emotion intensity. Sadness and depression can be low irritations and severe emotions. While anger and wrath are unpleasant emotions, they have higher irritability than laziness. The dimension d (dominant-submissive and subordinate) represents the sense of being obedient and dominant. For example, the ''battle'' is more dominant than ''delicate''.
Considering these three semantic dimensions along with the dimensions of emotion for the context of each review can lead to a more detailed analysis. The analysis of these three dimensions is very effective in understanding the meaning of the text [12]. As to the usefulness of comments, it is also expected that the more accurate the meaning of the review is, the more useful it will be. Algorithm 3 extracts the VAD value for each text.
To extract the VAD vector of each text after extracting the effective word list, the existence of each of these words in the VAD dictionary is checked and, if any, the VAD vector is extracted for each word (Lines 1-10). Finally, by averaging over these vectors, the VAD vector of each sentence is obtained in the text (Line 11-15). Finally, to extract the VAD vector of the text, it is sufficient to carry out the averaging text sentences on the VAD vectors (Lines16-18).
The process of pre-processing and extracting the emotion vector for the text in Example 1 is as follows: Example 1: ''it is not good, beautiful and very cold but very delicious and rarely uses''.

1) PRE-PROCESSING
1. Determine qualifier affected words: First, it is determined that the word ''uses'' in the sentence is affected by a reducer. As can be seen, the words that are simultaneously negatively affected and intensified can be obtained by intersecting the two lists of positive intensifier-affected and negativeaffected words.
Finally, after replacing the negatively affected words with their antagonisms and extracting the correct effective words, the following is a final list of preprocessing algorithm outputs.
Word  The emotional vector is not extracted for the word ''uses'' because neither the word nor its lemma are present in the NRC, and therefore it will not have any effect even though the word has been in the scope of a reducer. In fact, ''uses'' is affected by qualifier but it is not in NRC.
After performing this process for all reviews in the dataset, each sentiment column is normalized using (1)

E. EXTRACTING TITLE EMOTIONS
Having an appropriate title can help the review to be read completely. This can have a positive effect on the usefulness of the review. Therefore, in this study, considering the existence of this data in the original dataset, we considered the emotion conveyed by the comment title as a feature. To extract this feature, the same emotion extraction algorithm was used except that, instead of the review text, the input is the review title. Then, since the title is usually short and contains a few words, the emotion vector is sparse (i.e., most of its elements have a value of zero). Therefore, the corresponding emotion of the largest value in this vector is chosen as the title's emotion. If all the emotions have a value of zero, ''no sense'' will be assigned to the feature.

F. D-S FUSION METHOD FOR PREDICTING THE USEFULNESS OF REVIEWS
Predicting the usefulness of reviews using the D-S score fusion method has the following steps.
• Definition of evidence: At this stage, according to the output of different classification algorithms, evidence is extracted. This is used as basic knowledge in finding the probability of belonging to each class for each review.
• Definition of mass function: According to (2), a mass function is a basic probability assignment for all subsets A of θ [15], [57].  [57]. In predicting the usefulness of comments, θ is the probability of belonging to one of the five classes.
• Score fusion: we obtain L evidence in the output of the classifiers which are then fused using (3) and (4) [51], [57].
k ij is conflict factor showing the degree of inconsistency between the evidence i and j and is a number between 0 and 1.
k ij = 0 means that the evidence i and j have no conflicts while, k ij = 1 or 0 < k ij < 1 indicates that two evidence have complete or partial collision to support a review.
As an example for illustrating how the D-S fusion method is applied in review helpfulness prediction settings consider the following review.
''Color Confidence is subtitled ''the digital photographer's guide to color management,'' and is a good overview of the subject. If you want to buy only one book, then Colour Confidence is a good choice. If you want lots of detail, then you're better off buying three separate books - The probabilities resulting from the output of the three classifiers based on evidence are shown in Table 5. There are five classes in the table above and the objective is to use the D-S combination rule to get the probability of text belonging to different classes. Therefore, according to (4), the conflict factor is calculated as: Therefore, the probability of belonging to classes 2 and 4 is greater for the existing evidence than the other two classes. This is arguable given the initial probabilities for classes 2 and 4. It is clear that the likelihood of scores 1, 3, and 5 decreased after the fusion.
G. TRIPLET STRUCTURE [51] D-S fusion rule may produce contradictory results in cases there are contradictory evidence. Contradictory evidence in review helpfulness prediction problem may occur when one classifier predicts the helpfulness score of a review to be close to zero (i.e., one star) while other classifiers predicts the score to be completely close to one (i.e., five star). To address this problem and to increase the accuracy of the review usefulness prediction, the triplet structure is used in this study to fuse the evidence. This structure employs the second best decision in combining the classifiers. The improved D-S method using the triple structure is given below [51]: Definition: If {θ} and {β} are focal elements and C is the framework of recognition and m is the mass function, the expression in the form Y =< {θ}, {β}, C > is a triplet defined as: The mass function m is called a triplet mass function [51].
Definition:If C is the detection framework and we have |n| ≥ 2: In this case, according to the following equation, ϕ (d) is broken by the law m σ : {β} = argmax m({a}|a ∈ a 1 , . . . , a n } − {θ }) (8) In particular, m σ is a ternary mass function and is also known as a two-point mass function [51]. So, we have: written for the sake of simplicity as: To improve the equation for combining two triplet mass functions, we need to consider the relationship between the two single pairs in both triplets. For example, if we have the following two triplets (with the corresponding triplet mass functions m 1 and m 2 ): (m 1 ⊕ m 2 )(C) = km 1 (C)m 2 (21)

3-Quite different focal points
A general formula for computing the combination of two triplet mass functions is: A practical example of the application of the improved D-S fusion method with the triplet structure for the problem of predicting review usefulness is shown in Table 4. For example, for the following review, the probabilities of predicting the rating by the three classifiers are shown in the following table.
''There is nothing majorly wrong with this game. The plot is well-developed, the characters are customizable, and the battles are strategic. This is probably primarily subjective, but I just didn't enjoy this game. It's not because there were too many movies-it's because I didn't like the movie. I also didn't like the characters or the villains or for that matter the aesthetics. There were some minor but annoying flaws in the game which further contributed to my displeasure. Which button to press was often counterintuitive, and so I often found myself pressing the wrong button. Also, the game badly needs a journal and/or a destination guide so you know where to go-I once spent one hour doing nothing other than walking around a space ship trying to figure out where to go'' According to Table 6, the first and second records in S 1 are of the fifth and fourth classes, respectively. So, we have: Next, the results of S 1 and S 2 are fused with S 3 . The first and second records (from the outputs of the previous step) are from the fifth and fourth classes, respectively. So, we have: So, the final score,Index Final which is Class 5 is more likely than the other classes.

H. DESCRIPTION OF THE DATASET
In this study, two datasets namely video_games and books were extracted from amazon.com [64]. For each review, rating (5-point scale: 1 to 5 stars), review content, and review title were collected. In total, the first dataset contains 20,000 reviews on the books and related products and the second dataset contains 20,000 reviews on video_games. The description of datasets is shown in Figures 2 and 3.

I. MODELING AND EVALUATION METHODS
In this research different machine learning methods including: decision tree, SVM, random forest, Bagging, naïve Bayes, j48 and AdaBoost were used to construct the classification model. These models are developed in the Python language using the sklearn library [65]. In the evaluations, 10-fold cross validation method is used to prevent overfitting. The criteria for evaluation of the models are the precision, accuracy, F1-score, recall, and Mean Squared  Error (MSE) [15]:

A. MODELS' PERFORMANCE
The performance of review rating prediction (from 1 to 5 stars) on two datasets using different machine learning methods is shown in Figure 4. These results are compared and the best algorithms is selected to be used later by the improved D-S fusion method. We used all the features in this experiment.
As can be seen, random forest, AdaBoost, and bagging performed best on both datasets. In dataset 1, the accuracy using random forest, Addaboost, and bagging are 0.43, 0.42, and 0.41, respectively. In dataset 2, they are 0.47, 0.5, and 0.48, respectively.
In subsequent evaluations, in order to compare the fusion algorithms with the separate clusters, the reviews are considered in two different scenarios: five-classes and twoclasses. Based on the features used, we also created four models for classification: Case1, Case 2, Case 3, Case 4. In Case 1 only text-related features are used, while in Case 2 only VAD and in Case 3 emotion-related features are used. In Case4, all features are employed. In all four modes, text-related features are included.

B. CLASSIFICATION OF 5 CLASSES
Comparison of the results of the 5-class classification using all features (Case 4), by machine learning algorithms, original D-S based and improved D-S method on two-dataset are shown in Figures 5 and 6.
According to Figures 5 and 6, comparing the results of the algorithms shows the superiority of the results of the  improved D-S algorithm in ranking the reviews over the original D-S fusion algorithm and machine learning algorithms. As shown in the figures, accuracy, f1-measure, recall and precision on dataset 1 using the improved D-S fusion method are 0.66, 0.63, 0.67, and 0.61, respectively. On dataset 2, these criteria are 0.72, 0.71, 0.73 and 0.69, respectively. Also the MSE criterion on dataset 1 is 0.58 which was improved on dataset 2 by 0.32 using the improved D-S method. Thus, in response to Question 2, it can be said that the improvement of the D-S algorithm using the triplet structure has improved the fusion system and increased the accuracy of the review usefulness prediction system. Hence, we evaluated the other three comparison models only with the improved D-S fusion method.
The test results for each case namely, Case1, Case2, Case3, Case4 are shown in Figures 7 and 8 for dataset 1 and 2, respectively As shown in Figures 7 and 8, for the first dataset, the accuracy and f-measure criteria were 0.53 and 0.52, respectively, using text-related features. However, in case 2, these criteria are 0.58, 0.55, respectively, and in case 3, using textual and emotion-related features, are 0.61 and 0.6. In case 4 they are 0.66 and 0.63, respectively. Case 4 where all the features were used obtained the best results. Similarly, for the second dataset, in the first case using text-related features, the accuracy and f1-measure criteria were 0.6 and 0.56, respectively. In case 2, these criteria were 0.66 and 0.64 and in case 3, they are 0.69 and 0.66, respectively. Again, the best result obtained VOLUME 8, 2020 F. Fouladfar et al.: Predicting the Helpfulness Score of Product Reviews  using case 4 where the accuracy and f1-measure criteria were 0.72 and 0.71 (highest score, respectively). Thus, in answer to Question 1, it can be said that considering three dimensions of valence, arousal, and dominance (VAD) along with contextual and emotion-related features affect the usefulness of the review and significantly improve classification accuracy.

C. BINARY CLASSIFICATION
Given that the outputs of this study are 5 classes, classes 2, 1 and 3 are considered as non-useful classes and are shown by 0 and classes 4 and 5 are shown by 1 and are interpreted as useful classes. Figure 9 shows the results of binary classification based on all properties on two datasets using the original and improved D-S fusion algorithm and separate classifiers.
From the graph, it is clear that the improved D-S fusion algorithm has the best performance in predicting review usefulness on the two datasets and has achieved effective results in improving the fusion system.
As can be seen, the accuracy, f-measure, recall, and precision in the dataset1 using the improved D-S method are 0.89, 0.88, 0.88 and 0.88, respectively. In the dataset2 using the improved D-S method these values are 0.94, 0.93, 0.92 and 0.92, respectively. Figure 10 illustrates the MSE criterion obtained using the original and improved D-S fusion methods and machine learning algorithms for binary classification. The MSE criterion for books and video_games datasets using the improved D-S method decreased by 15% and 11%, respectively, compared to the original D-S method.  The results of the experiments of Case1, Case2, Case3, Case4 using the improved D-S fusion algorithm for dataset1 and dataset2 are shown in Figures 11 and 12.
According to Figures 11 and 12, it is clear that for the first dataset, the accuracy criterion was 0.80 in the first case using the features associated with the review text. However, accuracy in case 2 is 0.83, and in case 3 was increased to 0.86 using contextual and emotion-related features. The best classifier for the model being case 4, where the contextual, VAD, and emotion-related features were used and the accuracy of 0.89 was obtained. Similarly, for the second dataset, the accuracy increased by 0.85 in case 1, 0.89 in case 2, and 0.90 in case 3. The best result was obtained in case 4 where the accuracy is 0.94. Table 7 shows the feature-wise analysis of 2-class and 5-class classification on two datasets.In this analysis, taking into account the feature or group of features, each time the criteria for the performance of the improved Dempster-Shafer model are obtained, to determine the characteristics or composition of the decisive features.
As shown in table 7, The results of the analysis are similar on the 2-class and 5-class classification on two datasets and it can be seen that the combining the features associated with emotions, features of VAD and text-related features have better helpfulness recognition ability.

V. DISCUSSION
This paper presents a model for predicting review usefulness. To determine review usefulness on emotion-related features such as title's emotion, 12 distinct emotions, and other features such as linguistic features, context-related attributes, valence, arousal, and dominance (VAD) for each review, length and polarity of opinion is used. Finally, after extracting the required features, predicting review usefulness based on the mentioned features was presented using the improved D-S fusion algorithms and separate classifiers.
We created four models for classification to show the effectiveness of different types of features: Case1, Case2, Case3, Case4, which differ based on the set of properties included as follows; Case 1: Classification using text-related features without including VAD values and emotion-related features. Case 2: Classification with text related features and three VAD dimensions. Case 3: Classification with text related feature and emotion related features. Case 4: Classification using all feature.
The results of the 5-class classification show that the original and improved D-S fusion algorithm and machine learning algorithms have achieved effective results in improving the review helpfulness prediction system. The best result is obtained in case 4, using all features, where on the books and video_games dataset the improved D-S algorithm obtained 15% and 9% higher accuracy than the original D-S algorithm, respectively. The MSE criterion for books and video_games datasets also decreased by 14% and 20%, respectively, compared to the original D-S method.
The best results for 2-class review helpfulness problem were obtained using the improved D-S algorithm with all the features. The MSE criterion for the books and video_games dataset decreased by 15% and 11%, respectively, compared to the baseline method. The accuracy of the classification of books and video_games datasets using the improved D-S algorithm is 14% and 11% higher than the baseline, respectively. Therefore, it can be concluded that these improved results are obtained by exploiting the improved D-S fusion algorithm.
In tables 8, comparing results of 2-class and 5-class classification on two datasets shows that on average, 2-class classification results outperform 5-class classification for both datasets. This may be due to the fact that in 2-class problem, the sensitivity of belonging to different classes is reduced and the likelihood of having an opinion with the predicted class is increased to two existing classes. Table 9 summarizes the two-class classification results and compares them.
In Table 9, the effect of the proposed features on the two datasets is shown. For the first dataset, accuracy was 0.8 using text-related features. However, the accuracy increased to 0.83 in Case 2 using textual and VAD features and to 0.86 in textual and emotion-related features. The best feature set for the model is Case 4 which used all the features. This Case resulted in accuracy of 0.89. Similarly, for the second dataset, in the first case, accuracy was 0.85 which increased to 0.89 in case 2, and 0.9 in case 3. The best result on this dataset was again obtained in Case 4 where the accuracy reached to 0.94.
The results show that considering the three semantic dimensions of valence, arousal, and dominance (VAD) along with the emotion dimensions and context-related features improves the accuracy of predicting review usefulness scores.
The results were compared with four previous work ( Table 6). Ren and Hong [1] used text-related features as well as emotion-related features to predict the usefulness of online consumer opinions and used regression classifiers to classify comments into two categories and reported accuracy of 0.60 and 0.63 on the books and video_games dataset from Amazon. Zhang and Tran worked on text-related features of digital camera reviews from the Amazon site and achieved an accuracy of 0.76 [66]. Ghose and Ipeirotis [33] obtained an accuracy of 0.78 and 0.87 on the DVD, audio, and video and digital camera dataset from the Amazon site. Similary, Krishnamoorthy [35] used linguistic features along with metadata features to predict the usefulness of online consumer opinions and obtained an accuracy of 0.77 and 0.87 on the Amazon dataset and Blitzer et al. [67]. The results show that for both datasets, our method performs better in classifying comments into two categories. Thus, it can be said that the improvement of the D-S algorithm using the triplet structure has improved the fusion system and increased the accuracy of the review usefulness prediction system. This structure employs the second best decision in combining the classifiers. The benefits of this method are that it not only provides valuable information that is ignored in  class labels but also partially avoids the deterioration of performance created by a single prominent class that produces high confidence values.

VI. CONCLUSION
In this study, a model was presented to identify the usefulness of online reviews, using 12 distinct emotions, valence, arousal, and dominance (VAD) vector for each review, other context-related features such as linguistic features, length, and review polarity. Track. Of the 12 mentioned emotions, 8 are from NRC lexicon and 4 are proposed and added in this study as positive surprise, negative surprise, positive expectation, and negative expectation.
In this study, an algorithm was proposed to extract distinct emotions from the text that also improves the emotional intensity of words in different emotion groups. An algorithm for extracting VAD values for each text is also presented. Then, using different machine learning algorithms, the original and improved D-S algorithms different models were developed to predict review helpfulness. Two datasets were used in this study and precision, accuracy, f-score, recall, and Mean Squared Error (MSE) were used to evaluate the results.
According to the results of the five-class and two-class classification, the improved D-S algorithm with triplet structure outperforms the original D-S method and machine learning algorithms. It also improves the accuracy of predicting the usefulness of reviews by combining emotions-related and text-related features.
The overall results for the 2-class scenario is higher than 5-class problem.
Finally, based on obtained results from 5-class and 2-class classification, it could express proposed approach advantages as follow: Confirming effectiveness of using word emotion intensity vocabulary in different emotion groups in identifying emotions Confirming effectiveness of using VAD vocabulary to extract VAD vector for each text Confirming effectiveness of using improved algorithms that consider influential changer on emotions and emotion intensity in different emotion groups in calculations to identifying emotions. Confirming effectiveness of using features that related to emotions and VAD in determine review usefulness. Increasing precision of review usefulness determine system by improving basic Dempster-Shafer score fusion algorithm. In future works, we plan to new emotional features introduce and their effect on review usefulness prediction investigate. also, applying deep neural networks to improve emotion recognition system will be investigated as a future work. One of the future works is identification of review usefulness in different types of review and products separately in order to examine more separately and preciously that how much effect different features have on various reviews and products. In addition, applying hybrid evolutional algorithms to increase the accuracy of the review usefulness prediction system is proposed for future research. Other future research is applying proposed method architecture for other languages and also use of proposed variables in other domains such as sentiment analyze, text summarization, recommendation systems and etc.
MOHAMMAD NADERI DEHKORDI received the bachelor's and master's degrees in computer engineering, in 1999 and 2001, respectively, and the Ph.D. degree in computer engineering from the Science and Research Branch, Islamic Azad University, Tehran, Iran, in 2009. His Ph.D. Thesis focused on privacy-preserving data mining. He is currently an Assistant Professor and the Dean of the Faculty. He has published over 60 articles in the journal and refereed conference proceedings. He is the author of two research books in mobile database and privacy preserving data mining. His research interests include data mining and knowledge discovery, privacy-preserving data mining/publishing, big data analytics (in the context of scalable, distributed, and mobile platforms), design and configuration of meta-heuristic algorithms in optimization problems, and data analytics, such as novel data mining algorithms, distributed/mobile databases, and social network analysis. He is a Reviewer of some top-ranked journals.
MOHAMMAD EHSAN BASIRI (Member, IEEE) received the B.S. degree in software engineering from Shiraz University, Shiraz, Iran, in 2006, and the M.S. and Ph.D. degrees in artificial intelligence from the University of Isfahan, Isfahan, Iran, in 2009 and 2014, respectively.
Since 2014, he has been an Assistant Professor with the Computer Engineering Department, Shahrekord University, Shahrekord, Iran. He is the author of three books and more than 35 articles. His current research interests include sentiment analysis, natural language processing, machine learning, and data mining. VOLUME 8, 2020