Aggregating Customer Review Attributes for Online Reputation Generation

In this paper, we face the problem of generating reputation for movies, products, hotels, restaurants and services by mining customer reviews expressed in natural language. To the best of our knowledge, previous studies on reputation generation for online entities have primarily examined semantic and sentiment orientation of customer reviews, disregarding other useful information that could be extracted from reviews, such as review helpfulness and review time. Therefore, we propose a new approach that combines review helpfulness, review time, review attached rating and review sentiment orientation for the purpose of generating a single reputation value toward various entities. The contribution of the paper is threefold. First, we design two equations to compute review helpfulness and review time scores, and we fine-tune Bidirectional Encoder Representations from Transformers (BERT) model to predict the review sentiment orientation probability. Second, we design a formula to assign a numerical score to each review. Then, we propose a new formula to compute reputation value toward the target entity (movie, product, hotel, restaurant, service, etc). Finally, we propose a new form to visualize reputation that depicts numerical reputation value, opinion categories, top positive review and top negative review. Experimental results coming from several real-world data sets of miscellaneous domains collected from IMDb, TripAdvisor and Amazon websites show the effectiveness of the proposed method in generating and visualizing reputation compared to three state-of-the-art reputation systems.


I. INTRODUCTION
The exponential growth of Web 2.0 has dramatically impacted the evolution of e-commerce platforms [1]- [4]. Recent online shopping statistics showed that the number of users of some famous e-commerce websites such as Jingdong, 1 Alibaba 2 and Amazon 3 has exceeded 1 billion [5]. Thereby, customer reviews attached to a product can easily surpass thousands [2], [6], [7]. In fact, while, a good number of reviews could indeed give a hint about the quality of an item, a potential customer may not have the time or effort to read all reviews for the purpose of making a decision [8].
The associate editor coordinating the review of this manuscript and approving it for publication was Biju Issac . Thus, the need for the right tools and technologies to help in such a task becomes a necessity for the buyer as for the seller.
Currently, little work has been performed to support customer decision making in E-commerce using natural language processing techniques. We identify principally two techniques. The first one is feature based summarization that aims to identify the target entity (product, movie, hotel, restaurant, service) features and its corresponding opinions polarity (positive/negative), then, a feature-based summary of the reviews is generated [3], [4], [7]. While the second technique is called reputation generation, whose main focus is to produce an estimation value in which an entity is held based on mining customer reviews expressed in natural languages [5], [9]- [11].
Previous studies on reputation generation have primarily focused on using semantic and sentiment analysis [5], [9]- [11], disregarding other useful information that could be extracted from user reviews, such as ''review helpfulness, which implies that reviews that receive higher votes from other users typically provide more information'', and ''review time, which implies that more recent reviews generally provide users with more up-to-date information ''.
An accurate and reliable reputation system should consider exploiting more online reviews features such as review attached rating, review helpfulness, review time and review sentiment orientation. For that reason, we propose a reputation system that incorporates all these attributes during the process of generating and visualizing reputation for various entities (movies, products, hotels, restaurants and services). In this manner, this study addressed the following research question: with the consideration of review helpfulness, review time, review sentiment orientation probability and review attached rating, can the proposed reputation system offer better results in terms of reputation generation and visualization than the previous reputation systems (consider only semantic and sentiment relations)?
The contributions of this work are summarized as follows: • Firstly, we propose a novel system that incorporates review time, review helpfulness, review sentiment orientation and review attached rating for the purpose of generating a numerical reputation value toward various entities (movies, products, hotels, restaurants, services, etc).
• Secondly, we propose a new holistic form to visualize reputation by showing numerical reputation value, opinion categories, top positive review and top negative review in order to support customers during their decision making process in E-commerce (buying, renting, booking). The article is organized in the following way: Section 2 gives a literature review of related work for document level sentiment analysis and natural language processing techniques for decision making in E-commerce. Section 3 presents problem statement. In section 4, we elaborate our reputation system. The conducted experiments and discussion are presented in section 5. In section 6, the conclusion of this work is provided.

II. LITERATURE REVIEW
This section describes and examines previous research work done in the area of natural language processing (NLP) techniques for decision making in E-commerce and document level sentiment analysis.

A. NLP TECHNIQUES FOR DECISION MAKING IN E-COMMERCE
The BusinessDictionary 4 defines decision making as: ''The thought process of selecting a logical choice from the available options''. During the last twenty years, few approaches have been proposed to help potential customers making decisions in E-commerce websites using mainly two NLP techniques: feature-based summarization of customer reviews and reputation generation. Hu and Liu (2004) [7] were the first to design and build a system that produces a feature-based summary from customer reviews. The proposed system performs three tasks: (1) association rule mining [12] is used to extract product features from customer reviews, (2) WordNet [13] is utilized to predict the semantic orientations of opinion words, (3) a featurebased summary is produced. Over the last two decades, few systems have been proposed to perform feature-based summarization. The summarizers are applied on various domains: product reviews [7], [14]- [17], movie reviews [4], local services reviews [18] and hotel reviews [2], [19], etc.
Backing to reputation generation. The pioneer work that tackles the task of reputation generation based on mining opinions expressed in natural languages was firstly proposed by Yan et al. (2017) [5] in which reviews are fused into different opinion sets based on their semantic relations, then, a single reputation value is generated by aggregating the fused and grouped opinions statistics (sum of similarities, sum of ratings, number of reviews). In [9], the authors applied K-means clustering algorithm to group similar reviews into the same cluster using Latent Semantic Analysis (LSA) before producing a reputation value using the statistics of each cluster. However, both approaches have relied on extracting semantic relations between reviews and have disregarded the fact that the majority of online customer and user reviews are opinionated. Benlahbib and Nfaoui (2019) [10] proposed a fourfold approach to improve [5]. First, Naïve Bayes and Linear Support Vector Machines classifiers were applied to separate reviews into positives and negatives by predicting their sentiment polarity. Second, positive and negative reviews were fused into different sets based on their semantic similarity (Latent Semantic Analysis and cosine similarity). Third, a custom reputation value is computed separately for both positive opinion sets and negative opinion sets. Finally, a single reputation value is calculated using the weighted arithmetic mean.
Since all of the above-mentioned reputation systems exploit only semantic and sentiment features, we propose a new reputation system that incorporates more features for the purpose of generating an accurate and reliable reputation value toward various entities.

B. DOCUMENT LEVEL SENTIMENT ANALYSIS
Ahlgren (2016) [20] defines sentiment analysis as: ''the process of identifying and detecting subjective information using natural language processing, text analysis, and computational linguistics''. Generally, sentiment analysis can be divided into three levels: sentence level opinion mining, document level opinion mining and fine-grained opinion mining. Since we have applied document level sentiment analysis to extract the sentiment orientation of customer and user reviews, this section will mainly focus on previous research work done in the area of document level opinion mining.
According to [21], document level opinion mining is: ''a task of extracting the overall sentiment polarities of given documents, such as movie reviews, product reviews, tweets and blogs.''.
Many approaches have been used to handle the task of document level sentiment analysis: • Supervised approaches: These approaches require annotated corpus to train machine learning models. The first work for supervised document level opinion mining was proposed by Pang et al. (2002) [22]. Three machine learning classifiers (Support Vector Machines (SVMs) [23], Naïve Bayes classifier [24] and Maximum Entropy classifier [25]) were trained with movie reviews labeled by sentiment (positive/negative). The authors trained the three models on various kinds of features (unigrams, bigrams, parts of speech and position) and found that the sentiment classification task preforms well when adopting unigrams as features. Kennedy and Inkpen (2006) [26] trained Support Vector Machine classifiers on unigrams and bigrams by incorporating three types of context valence shifters: ''intensifiers'', ''negations'' and ''diminshers''. The trained model achieved an accuracy of 0.859 on movie review data 5 [27]. Koppel & Schler (2006) [28] defined the sentiment classification task as a three-category problem (positive, negative and neutral) and used different learning algorithms: SVM, J48 Decision Tree [29]. Naïve Bayes, Linear Regression [30] [36], Logistic Regression and AdaBoost [37], [38]. They conducted experiment on Amazon reviews dataset [39] and found that Logistic Regression classifier outperforms the other classifiers in predicting sentiment polarity of product reviews.
• Unsupervised approaches: They attempt to determine the sentiment orientation of a text by applying a set of rules and heuristics obtained from language knowledge. Turney (2002) [40] was the first to propose an unsupervised sentiment analysis technique to classify reviews as ''recommended'' or ''not recommended''. The semantic orientation of a phrase is computed as the pointwise mutual information (PMI) [41] between the given phrase and the word ''excellent'' minus the pointwise mutual information between the given phrase and the word ''poor''. The proposed algorithm achieved an accuracy of 84% for automobile reviews, 80% for bank reviews, 71% for travel destination reviews and 66% for movie reviews. In [42], the authors proposed a lexicon-based method to opinion mining text by using a dictionary of sentiment words and their semantic orientations varied between −5 and +5. The authors also incorporated amplifiers, downtoners and negation words to compute a sentiment score for each document. Vashishtha and Susan (2020) [43] proposed a fuzzy rule-based approach to perform opinion mining of tweet. The authors use a novel unsupervised nine fuzzy rule based system to predict the sentiment orientation of the post (positive, negative or neutral). In [44], Fernández-Gavilanes et al.
(2016) proposed a sentiment analysis approach to predict the polarity in online textual messages such as tweets and reviews using an unsupervised dependency parsing-based text classification method.
• Deep learning approaches: Over the past few years, deep learning models have greatly improved the stateof-the-art of opinion mining. Moraes et al. (2013) [45] made a comparative study between Support Vector Machines (SVM) and Artificial Neural networks (ANN) for document-level opinion mining and found that ANN results are at least comparable or superior to SVMs.
To overcome the weakness of bag-of-words (BoW) model, the authors [46] proposed an unsupervised algorithm named paragraph vector (doc2vec), an extension to word2vec approach [47]. The proposed algorithm learns vector representations for variable-length texts such as sentences, paragraphs, and documents. Experimental results depict that doc2vec algorithm achieved new state-of-the-art results on several sentiment analysis tasks. Johnson and Zhang (2015) [48] trained a parallel Convolutional Neural Network (CNN) [49] without using pre-trained word vectors: word2vec, doc2vec and GloVe 6 [50]. Instead, convolutions are directly applied to one-hot encoding vectors to leave the network solely with information about the word order. The proposed approach achieved an accuracy rate of 92.33% on Large Movie Review Dataset 7 outperforming both SVM [51] and NB-LM [52]. Baktha and Tripathy (2017) [53] investigated the performance of Long Short-Term Memory (LSTM) [54], vanilla RNNs and Gated Recurrent Units (GRU) on the Amazon health product reviews dataset and sentiment analysis benchmark datasets SST-1 and SST-2. The results depict that GRU achieved the highest sentiment classification accuracy. In [55], the authors combined unsupervised data augmentation (UDA) with Bidirectional Encoder Representations from Transformers (BERT) [ [59] proposed ALBERT: ''A Lite BERT for Self-supervised Learning of Language Representations''. The paper describes parameter reduction techniques to lower memory reduction and increase the training speed and accuracy of BERT models. In [60], the authors introduced a novel ''Text-to-Text Transfer Transformer'' (T5) neural network model pre-trained on a large text corpus which can convert any language problem into a textto-text format. The T5 model achieved state-of-the-art results on SST-2 Binary classification dataset with an accuracy of 97.4%. Recently, Clark et al. (2020) [61] presented ELECTRA that uses new pre-training task called replaced token detection (RTD). The experiment results showed that RTD is more efficient than masked language modeling (MLM) pre-training models such as BERT.

III. PRELIMINARIES
This section covers the necessary background for understanding the remainder of the paper, including the problem definition and BERT model which is fine-tuned to determine the sentiment orientation of customer and user reviews in our proposed system. 7 https://ai.stanford.edu/~amaas/data/sentiment/

A. PROBLEM DEFINITION
In this paper, we face the problem of generating reputation for movies, products, hotels, restaurants and services by aggregating review time, review helpfulness votes, review sentiment orientation and review attached rating. Given a set of reviews R j = {r 1j , r 2j , . . . , r nj } expressed for an entity E j , the set of their attached ratings 5] or v ij ∈ [1, 10] depending on the rating system, the set of their attached helpfulness votes RH j = {rh 1j , rh 2j , . . . , rh nj } where rh ij ∈ N * , the set of their posting time RT j = {rt 1j , rt 2j , . . . , rt nj } and the set of their sentiment orientation probabilities predicted by fine-tuned BERT base The goal is to compute a review score for each review RS j = {rs 1j , rs 2j , . . . , rs nj } based on its helpfulness votes, its posting time and its sentiment orientation, and finally, compute a reputation value Rep for an entity j by averaging the product of reviews score and reviews attached rating. Table 1 presents the descriptions of notations used in the rest of this paper.

IV. PROPOSED APPROACH A. SYSTEM OVERVIEW
Our approach consists mainly on four steps: • Firstly, we collect real data from websites that specialize in gathering customer reviews such as IMDb, 8 TripAdvisor 9 and Amazon 10 using web scraping tools, then, we preprocess them.
• Secondly, we assign three numerical scores to each review: helpfulness score, time score and sentiment orientation score.
• Thirdly, we compute a review score based on the precomputed scores (helpfulness score, time score and sentiment orientation score).
• Finally, we generate a numerical reputation value toward the target entity (product, movie, hotel, restaurant, service, etc). Then, we propose a new form to visualize reputation by depicting numerical reputation value, opinion categories, positive review with the highest score and negative review with the highest score. Figure 3 describes the pipeline of our work.

B. DATA COLLECTION AND PREPROCESSING
Differently from previous studies on reputation generation, which mainly focus on extracting semantic and sentiment relations of reviews, our work incorporates other factors such as review helpfulness and review time. Hopefully, the majority of popular E-commerce websites such as TripAdvisor 11 and Amazon 12 gather online reviews with respect to the following structure: textual review, review helpfulness votes and review posting time. Figure 4 describes online reviews structure.
With the use of a web scraping tool, we have been able to collect raw data from some real data suppliers like Amazon, TripAdvisor and IMDb.
After collecting all reviews, we applied some preprocessing techniques (lowercasing, tokenization, . . . ). Technical details of data collection and preprocessing phase are described in section V. EXPERIMENT RESULTS subsection A. EXPERIMENTAL DATA COLLECTION AND PRE-PROCESSING.

C. REVIEW HELPFULNESS
The number of helpfulness votes attached to a review indicates how informative it is, which implies that reviews that receive higher votes from other users typically provide more information. Thus, we design formula (1) to compute review helpfulness score.
We denote: The helpfulness score for a review ranges between 0.75 and 1 because we don't want to assign a low score to reviews with a small number of helpfulness votes.
We mention that log 10 (rh ij ) log 10 (N j ) 0.75 means that log N j (rh ij ) 0.75 due to the fact that: log N (n) = log base (n) log base (N ) = log 10 (n) log 10 (N ) By applying equation (1), the most voted review will receive a review helpfulness score of 1 since log N j (N j ) = 1. Reviews with high helpfulness votes will receive a high review helpfulness score and reviews with low helpfulness votes will receive a low review helpfulness score since for x ∈ [1, N ] and y ∈ [1, N ]: x ≤ y implies that log N (x) ≤ log N (y).
Algorithm 1 computes the helpfulness score for review r ij .

D. REVIEW TIME
Could you tell what would happen if we take a very wellreviewed gaming laptop from 10 years ago and put it on an online store? To answer this question, let us travel back in time to 20 years ago, where the gaming industry witnessed a great competition between gaming consoles, and where the  enjoyment of a hardcore gaming experience was limited to that kind of tech. In that era, a gamer had to have a fat and heavy TV, with cables attached to a relatively big dedicated gaming console and its controllers in order to play a video game. All this bunch of materials and cables remain in one place in the house. Next, the industry shifted to computers, and then to mobile computers also known as laptops, which brought enough satisfaction to all the gaming consumers over the world. Although, laptops have been made much heavier than they should be, yet, it was very exciting to have the ability to enjoy your favorite games wherever you want just by packing your laptop on a backpack rather than having to be stuck in a room to play. Spatial freedom was a gift for the republic of players and so going mobile was their prior preference at that time.
Time Goes Forward, and so, the Consumer Preferences and Choices: By today's standards, just going mobile is not good enough, gamers want lighter laptops, more performance, high end graphic cards, high resolution/fps screens, mechanical keyboard, and the list is long . . . Today, gamers all over the planet become more demanding, their preferences changed drastically and so the industry does while trying to keep up with the human desires.
Back to our question, it is obvious that nobody will care about a 20 year old gaming laptop, even if it was a best seller with 1 million 5-star reviews at that time. Why is that? simply because it becomes obsolete by the modern user criteria. Its 1 million review doesn't matter anymore. And so, as all the things and beings, reviews also have an expiration date. where they become irrelevant to the buyer.
Although, a product, laptop, movie or hotel may had very good reviews once, but time took off their power, their importance and their effect over the judgement and decision of the consumer. At the end, ''you cannot beat time''. To conclude, we believe that more recent reviews generally provide users with more up-to-date information. Therefore, we design formula (2) to assign a time score to each review.
We denote: T (r ij ): Time score of review r ij . rt ij : Publication year of review r ij . y: Current year. The time score for a review ranges between 0.8 and 1, which implies that a higher time score is assigned to the most recent reviews.
Algorithm 2 computes the time score for review r ij . With the help of a film critic, we have been able to determine suitable minimum values for each of the review helpfulness and review time scores. Indeed, using different minimum values for both scores as parameters, multiple experiments have been performed on a various range of movies, where in each one among these, we compare the generated reputation value to the film critic's own rating regarding a given movie. Which leads to 0.75 and 0.8 to be chosen successively as the fittest minimum values for review helpfulness and review time scores. Next, given the high accuracy achieved through our reputation system, the same last experiments have been done on other domains such as products, restaurants and services, where we noticed very good results, particularly when using 0.75 and 0.8 as the minimum values for each of the scores.

E. REVIEW SENTIMENT ORIENTATION
We fine-tuned BERT model to determine the sentiment orientation probability of a target review due to the fact that it has achieved state-of-the-art results in a wide variety of natural language processing tasks by learning contextual relations between words or sub-words in a text. In this paper, we have interest in assigning a sentiment orientation score to each review. Since fine-tuned BERT returns an array of 2 values: probability of being negative and probability of being positive (Softmax activation function), we apply the max function to

Algorithm 3 Review Sentiment Orientation Score
Define: R j = {r 1j , r 2j , . . . , r nj }: The set of reviews expressed for the entity j. BERT j = {bert(r 1j ), bert(r 2j ), . . . , bert(r nj )}: The set of output vectors of fine-tuned BERT Base (the sentiment orientation probability of reviews expressed for the entity j.

return S 4 End Function
the fine-tuned BERT output vector. The highest probability is kept as the sentiment orientation score of the target review.

S(r ij
We denote: S(r ij ): Sentiment orientation score for review r ij . P negative r ij : BERT model output prediction for review r ij being negative.

F. REVIEW SCORE
Based on the above scores, we design formula (4) to compute a numerical score for each review: We denote: RS(r ij ): Review score for review r ij . H (n ij ): Helpfulness score of review r ij . T (r ij ): Time score of review r ij . S(r ij ): Sentiment orientation score for review r ij . Since review helpfulness score, review time score and review sentiment orientation score range between 0 and 1, the generated review score is also between 0 and 1.
Algorithm 4 computes the review score for all reviews. Table 2 represents an example results of review score.

G. REPUTATION GENERATION
We propose formula (5) to compute a single reputation value toward the target entity using review score RS(r ij ) and review attached rating v ij :

Algorithm 4 Review Score
Define : R j = {r 1j , r 2j , . . . , r nj }: The set of reviews expressed for the entity j. RH j = {rh 1j , rh 2j , . . . , rh nj }: The set of reviews helpfulness votes expressed for the entity j. RT j = {rt 1j , rt 2j , . . . , rt nj }: The set of reviews posting time expressed for the entity j. RS j = {rs 1j , rs 2j , . . . , rs nj }: The set of reviews score expressed for the entity j.

Algorithm 5 Reputation Generation
Define The reputation value varies from 1 to 5 or 1 to 10 depending on the target entity attached rating values range.
Algorithm 5 computes the reputation value toward a target item.
Assuming that an entity E j contains three reviews where  formula (1) and (2), we get the helpfulness and time scores: H (r 1j ) = 1, H (r 2j ) = 0.849, H (r 3j ) = 0.75, T (r 1j ) = 1, T (r 2j ) = 0.98 and T (r 3j ) = 0.96. After applying formula (4), we get the reviews scores: RS j = {0.999, 0.942, 0.902}. In order to compute the reputation value toward E j , we need to compute the product of rs ij and v ij . We get rs 1j .v 1j = 9.99, rs 2j .v 2j = 9.42 and rs 3j .v 3j = 9.02. Since , we can conclude that the first review r 1j has the highest impact (the highest product 9.99) on the reputation value of the entity E j since it is very helpful and recent. In the contrary, the third review r 3j has the lowest impact (the lowest product 9.02). In fact, while it has the same attached rating as the first review, but, being both unhelpful and old made it by far less influential.
To conclude, recent and helpful reviews have more impact on the reputation value than old and unhelpful ones.

H. REPUTATION VISUALIZATION
It is important to provide a potential customer or user with sufficient information for the purpose of assisting his decision. Thus, we propose a new way to visualize reputation by depicting the produced numerical reputation value toward the target entity, opinion categories, positive review with the highest review score (formula 4) and negative review with the highest review score ( Figure 6).

A. EXPERIMENTAL DATA COLLECTION AND PREPROCESSING
Five miscellaneous domains were addressed in our experiments, movie, TV show, product, hotel, and restaurant. We collected user reviews from IMDb, 13 TripAdvisor 14 and Amazon 15 using a web scraping tool called ScrapeStorm. 16 We extracted 400 reviews for 4 movies, 400 reviews for 4 TV shows, 200 reviews for 2 products, 100 reviews for 1 hotel and 100 reviews for 1 restaurant. Each extracted review contains: raw text, review time, review helpfulness votes and review attached rating (Figure 4). The statistical information of dataset is shown in Table 3.
After collecting the reviews, we: 1) lowercase our text since we are using a BERT lowercase model 2) tokenize it 3) break words into WordPieces 4) map our words to indexes using a vocab file that BERT provides 5) add special ''CLS'' and ''SEP'' tokens 6) append ''index'' and ''segment'' tokens to each input

B. SENTIMENT ANALYSIS
We fine-tune BERT Base model to predict the sentiment orientation of the collected reviews. We build the model by creating a single new layer that will be trained with Large Movie Review Dataset v1.0 [64] 17 which contains 25,000 positive and 25,000 negative processed movie reviews. We set the sequence lenght to 128, the batch size to 32, the learning rate to 0.00002 and the number of epochs to 3. Table 4 depicts the performance of fine-tuned BERT-base model on Large Movie Review Dataset v1.0.
We compared BERT-Base model with Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM). GloVe embeddings were used to train LSTM, BiLSTM and CNN. Table 5  We can see from Table 5 that BERT-Base model achieves the highest sentiment analysis accuracy compared to Vanilla CNN, Vanilla LSTM and Vanilla BiLSTM.
We have mentioned in the literature review section some successful pre-trained models that achieve state-of-the-art results on Large Movie Review Dataset v1.0 such as XLNet-Large that achieves an accuracy of 96.21 and BERT-Large that achieves an accuracy of 95.79. However, these models require the combination of GPUs with plenty of computing power and a massive amount of memory.
We test BERT-Base model on our collected dataset. Figure 5 represents the accuracy of the model in predicting sentiment orientation for the collected reviews.
We observe from Figure 5 that the model achieves good results in predicting the sentiment polarity on the extracted reviews. Even more impressively, the model performs well on dataset 9, 10, 11 and 12 that contain product, hotel and restaurant reviews despite the fact that it's trained with movie reviews. 17 https://ai.stanford.edu/~amaas/data/sentiment/

C. REPUTATION VISUALIZATION
We propose a new way to visualize reputation by depicting the produced numerical reputation value toward the target entity, opinion categories, top positive review and top negative review.
As illustrated in Figure 6, our system provides users and potential customers with a reputation visualization form that shows the numerical reputation value toward the target entity, opinion categories (very good, good, neutral, very bad and bad) in a pie chart, top positive review (positive review that holds the highest review score) and top negative review (negative review that holds the highest review score).
Compared to previous studies on reputation generation [5], [9], [10], our proposed system is the only one that presents all of these helpful information in order to support users and customers during the decision making process in e-commerce websites. Table 6 shows comparison results between our system and previous reputation systems in term of reputation visualization.

D. SYSTEM EVALUATION
Previous studies on reputation generation based on mining user and customer reviews expressed in natural language have mainly focused on exploiting semantic and sentiment relations between reviews to generate a single reputation value toward various entities. However, customer and user reviews contain a lot of other useful information that could be exploited during the reputation generation phase like review posting time and review helpfulness votes. Unfortunately, up-to-date, no work has incorporated review time, review helpfulness votes and review sentiment polarity to produce a single numerical reputation value. Therefore, we propose a new reputation system that combines review posting time, review helpfulness votes and review sentiment orientation 96558 VOLUME 8, 2020 A. Benlahbib, E. H. Nfaoui: Aggregating Customer Review Attributes for Online Reputation Generation in order to generate an accurate and reliable reputation value toward different entities. Table 7 depicts the difference between previous reputation systems [5], [9], [10] and our proposed reputation system.
Since there are no standard evaluation metrics to assess the effectiveness and robustness of reputation systems, we conduct a user and expert survey as adopted in many research papers [65]. We have invited 32 users and 3 experts to rate four reputation generation systems: System 1 (our reputation system), system 2 [5], system 3 [9] and system 4 [10]. Each user and expert assigns a satisfaction score to each reputation system. The score is ranged between 1 and 10.
The 32 users are from different backgrounds: 6 computer science PhD students, 2 math PhD students, an electrical engineer, an undergraduate student in mathematics, 2 computer science engineers, a physics teacher, 4 mathematics VOLUME 8, 2020 teachers, a research engineer in computer science, an electronic engineering student, an information systems engineer, a third year student at the National School of Commerce and Management, a quality control technician, a sixth year medical student, a housewife, 7 second year medical students and a software engineer. Table 8 presents the average satisfaction scores for each reputation system given by the thirty users.
The formula of the average satisfaction score is: . , x N } are the observed values of the sample items and N is the number of observations in the sample. The standard deviation is a measure of the amount of variation or dispersion of a set of values [66]. The formula for the standard deviation is: . , x N } are the observed values of the sample items, µ is the mean value of these observations, and N is the number of observations in the sample.
We can see from Table 8 that 31 users favor our reputation system over the three other systems in term of helpfulness and effectiveness in generating reputation and visualization since it achieves the highest average satisfaction scores and the lowest standard deviation of satisfaction scores. Moreover, only one user (user 19) favors system 2 [9]. System 2 takes the second place by achieving an average satisfaction scores of 7.83. System 4 [5] comes next with a 7.01 average satisfaction scores, which sounds very reasonable since the main goal of system 2 was to improve system 4 by exploiting both sentiment and semantic analysis techniques. System 3 [9] takes the last place by achieving an average satisfaction scores of 5.625. System 3 doesn't provide users and customers with sufficient information to support their decision since providing reputation value alone isn't enough to help them make a judgment about a target item, the customers need more helpful information that could support them during their decision making process such as opinion categories, top positive review and top negative review.
We enrich our experiment results by inviting 3 experts to rate each reputation system with a satisfaction score. Expert 1 is a former owner of an e-commerce website whose main field of interest is natural language processing and machine learning, while expert 2 is an active e-commerce buyer and seller with more than 8 years of experience. As for expert 3, he is a second year PhD student in economics sciences.  Table 9 presents the average satisfaction scores for each reputation system given by the three experts.
Based on the average satisfaction scores given by the three experts (Table 9), reputation system 1 takes the first place with an average satisfaction scores of 9.17, preceded by system 2, system 4, and system 3 comes in last place with 5.67 as average satisfaction scores. Figure 7 combines the results of Table 8 and Table 9. Figure 7 shows that both users and experts choose system 1 as the best in term of reputation generation and visualization. system 2 holds the second place, preceded by system 4. system 3 comes at fourth place.
We asked the three experts to share their opinions about system 1 strengths and weaknesses. Table 10 contains expert reviews toward system 1.

E. FURTHER DISCUSSION
In summary, our reputation system exhibits the following advantages: • Accuracy: The system incorporates review helpfulness, review time, review sentiment orientation probability and review attached rating in order to generate an accurate reputation value.
• Holistic: The system proposes a new form of reputation visualization that depicts numerical reputation value, opinion categories, top positive review and top negative review. The system also has the ability to output the top-k positive reviews and the top-k negative reviews. This new form of reputation visualization provides customers with sufficient information toward the target item in order to make a decision (buying, renting, booking) toward it.
• Generality: The system can be applied in any website that allows web users to: (1) post their reviews expressed in natural languages, (2) share their numerical or star ratings and (3) vote for helpful reviews. Furthermore, the system can be applied on various domains (products, movies, services, hotels).
• Usefulness: The system is very useful in term of supporting web customers during their decision making  process in E-commerce by instantly providing them with sufficient information toward the target item, saving them from spending both their time and effort on reading thousands of online reviews.
However, our reputation system suffers from: • Safety: Due to the openness of Internet, many malicious users post fake reviews (false positive/false negative) aiming to impact the popularity and credibility of online VOLUME 8, 2020 products. Therefore, our system should incorporate a filtering phase in order to detect and remove fake and irrelevant reviews.

VI. CONCLUSION
In this paper, we have proposed a reputation system that generates reputation toward various items (products, movies, TV shows, hotels, restaurants, services) by mining customer and user reviews expressed in natural language. The system incorporates four review attributes: review helpfulness, review time, review sentiment polarity and review rating. The system also provides a holistic reputation visualization form by depicting the numerical reputation value, opinion group categories, top positive review and top negative negative.
To better evaluate the effectiveness of our reputation system, 32 users and 3 experts were invited to assign a score of one (least satisfaction) to ten (highest satisfaction) to four reputation generation systems. Our reputation system achieved the highest average satisfaction scores given by both users and experts. The three experts were also invited to share their point of view toward the proposed system in term of reputation generation and visualization. We believe that the proposed system represents an interesting online reputation system, full of fascinating insights into customer's decision-making process in e-commerce web sites.
Future studies will focus on: • exploiting further features such as user credibility (prolific reviewers) and user's online behavior as suggested by expert 1 (Table 10).
• detecting and removing fake and irrelevant reviews by applying a filtering phase and therefore reducing the processing time and increasing the efficiency of the system at once since only relevant and useful reviews will be taken into account.
• incorporating aspect based opinion mining during the phase of reputation generation and visualization. As a result, the reputation visualization will be enhanced. Indeed, the system will depict more useful information toward the target entity E such as its features (E featureX , E featureY , E featureZ . . . ), the number of positive reviews toward feature E featureX , and the number of negative reviews toward feature E featureY . . .

ACKNOWLEDGEMENT
A sincere thank you to Mohammed El Moutaouakkil (mohammed.elmoutaouakkil@usmba.ac.ma) for his diligent proofreading of this paper.