Research on Product Reviews Hot Spot Discovery Algorithm Based on Mapreduce

In recent years, with the development of e-commerce, the scale of comment data has shown an exponential growth trend. In this paper, a product review hot spot discovery algorithm based on MapReduce–PR-HD is proposed. The algorithm uses the Vector Space Model to vectorize the text data of the reviews, and utilize the TF-IDF algorithm to calculate the position weight of the feature words, then combines the Canopy algorithm and the K-Means algorithm to achieve the hot spot discovery of product reviews. At the same time, the algorithm obtain the ability to process massive data through the MapReduce framework. Experiments demonstrate that the PR-HD algorithm has high accuracy and parallel efficiency. This allows product developers to obtain more direct and effective suggestions and feedback, which allows product developers to obtain more direct and effective suggestions and feedback.


I. INTRODUCTION
At the age of rapid development in information technology and the popularization of Internet technology, e-commerce has gradually developing and perfecting. Nowadays, the convenience of online shopping and the busy lifestyle of modern people make people's demand for e-commerce higher and higher. This status quo not only brings opportunities to the development of e-commerce platforms, but also brings about fierce competition. It is commonly understood that online reviews can reduce consumer uncertainty about product characteristics and, therefore, have the potential to increase product demand and firm profits [1]. As a result, the major e-commerce platforms should pay much more attention to the excavation of product reviews.
The problem that many companies and scholars face together is how to get the most hidden information by quickly and accurately analyzing the large-scale data [2]. In the case of opening a popular shopping site, we can find that most of the products have a large number of comments. These huge data make it difficult for producers to get product feedback in a timely manner. Besides, the information with significant commercial value is often hidden in these large-scale The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. data [3]. The product reviews hot spot discovery algorithm can effectively filter information and solve the problem of comment analysis that is impossible only by manpower. This algorithm firstly preprocesses the comments and then carries out text clustering, and finally finds out the key points of the comments from each different category. Through this steps, the commodity producers can quickly understand the needs of users.
The purposed product review hot spot discovery algorithm based on MapReduce-PR-HD demonstrates the effective compared with others. This algorithm combines the hotspots detection algorithm with the MapReduce distributed computing framework, as well as mines the product reviews dataset through multiple computers, which finally realizes the hot spot discovery of commodity reviews. This research has certain research value in the field of product review mining.

II. RELATED WORK
Driven by the fifth wave of information technology revolution and global informationization, theoretical research and practical operation of e-commerce have arisen [4]. Nowadays, the mining of product reviews has become a research hotspot that has attracted much attention [5]. The mining of product reviews is the process of collecting user comments on the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Internet as mining objects, and discovering information about all aspects of products. In recent years, the value mining algorithm of product reviews has been continuously proposed especially in sentiment analysis [6], [7]. For example, we can use the sentiment dictionary to process and represent the product reviews, and then assigns the weights of different words in the sentiment dictionary and then uses the Bayesian classification model to classify the product reviews [8]- [10]. The above work can achieve the emotional division of product reviews, but the Naive Bayes model can obtain the best experimental results when the attributes are independent of each other. The uncertainty of the actual product reviews will greatly affect the accuracy of the algorithm. Maneket al. [11], Al-Smadi et al. [12] and Basari et al. [13] selected different SVM model, currently widely used in Machine learning, as the basic classifier to extract reviews value. Compared with the Bayesian algorithm, these research contents have higher accuracy and avoid ''curse of dimensionality'' in the field of product review mining [14]. However, commodity reviews often involve various aspects of products. The above literature only divides comments into two aspects according to their emotions, and the mining of comments is not comprehensive enough.
In order to fully exploit the data value of comment data hiding, Babuet al. [15] using K-means clustering algorithm to mine commodity reviews. It can be seen that the comment mining algorithm based on clustering has greater advantages in diversity of commodity review mining. But as the size of the data increases, the accuracy of the algorithm faces new challenges. In the subsequent research, we can obtain better results by improving the initial center point selection of the K-means algorithm [16]- [18]. It can further improve the accuracy in the data analysis of shopping malls. At the same time, various clustering algorithms have also been applied to comment mining. Shao-Hua et al. [19] makes use of LDA on restaurant reviews to get the useful topics. Yu et al. [20] examines how the Latent Dirichlet Allocation (LDA) model combined with natural language processing techniques can be used to identify hot topics from free-text customer reviews. All the above research work has greatly improved the comprehensive of comment mining, but it is still limited by the size of the data in the actual environment [21]. Another problem caused by the increase in data volume is that it is difficult to determine the number of hot spots discussed in the review data set.
In the actual environment, the clustering algorithm has complexity and the comment data has high-dimensional and sparse features. This will cause the algorithm to run longer, and even cluster conflicts. Therefore, many classic clustering algorithms have been designed in parallel by Hadoop, and many optimization algorithms have been continuously proposed. For example, the FCM algorithm is designed in parallel by Liguang and Qicheng [22] using the MapReduce framework, which can more efficiently discover hot topics on microblog. Yiming et al. [23] proposed an improved K-means parallel algorithm has also achieved good results. Sinha and Jana [24] combines genetic algorithm with k-means algorithm and proposed a novel clustering algorithm for distributed datasets. The above work proves that the algorithm based on MapReduce can well avoid the limitation of data size, and makes the mining of hyper-scale product review data possible. At present, in order to adapt to the real environment of product review data mining, algorithmic parallel design has gradually become one of the research contents that we need to focus on.
According to the characteristics of the current reviews, a MapReduce-based product reviews hot spot discovery algorithm-PR-HD algorithm is proposed in this paper, which aims to conduct in-depth value mining in many aspects of commodity reviews in parallel. The work of this paper mainly includes the following two points. First, PR-HD algorithm can comprehensively explore the hidden value of commodity reviews, and extend the mining of commodity reviews from simple sentiment analysis to multi-faceted hotspots discovery, thus solving the problem of insufficient comment mining. It uses the Canopy algorithm to determine the number of hotspots for commodity reviews, thus solving the problem that large-scale commodity review data cannot determine the number of hotspots. Secondly, PR-HD algorithm is based on the MapReduce framework design, which shows good performance when dealing with very large-scale data, and solves the problem that serial algorithms cannot satisfy largescale data processing. Therefore, the PR-HD algorithm is suitable for product review data that is very large and difficult to determine the discussion hotspot.

III. RELATED KNOWLEDGE
A. TEXT VECTORIZATION VSM (Vector Space Model) is a common method of vectorizing texts. It treats the contents of all documents as a collection of words, each of which is assigned a separate index value that points to the vector dimension of the word. The dimensions of all the words in an article constitute the vector of this article. For each word in the vector, we use the TF-IDF algorithm (Term Frequency-Inverse Document Frequency) to calculate the value of its corresponding vector dimension.
Let's assume that there are a total of N documents. The number of words used in these documents is i, which we respectively record as w 1 , w 2 , w 3 , . . ., w i . The frequencies of these words are f 1 , f 2 , f 3 , . . ., f i . For each word w i , the value W i in its corresponding dimension can be obtained by Equation 1: In Equation 1, TF i is the word frequency of the word w i , and the corresponding value is f i . DF i is the document frequency of the word w i , which refers to the number of documents containing the word w i . IDF i is the inverse document frequency corresponding to the word w i , and its value is represented by log N DF i .

B. CANOPY ALGORITHM
The Canopy algorithm is a practical clustering algorithm because of high speed and easy implementation. The Canopy algorithm aggregates data into different clusters by presetting two thresholds T 1 and T 2 . The Canopy algorithm can only roughly divide the data and the results are not accurate enough. The canopy algorithm is described as follows: • Put the data set into the List and input the thresholds T 1 and T 2 ; • Randomly pick a point as the center point, add it to the Canopy collection, and remove the point from the List collection; • Compare the distance between the points of the Canopy set and other points. If the distance is less than T 1 , divide the two points into the same cluster and delete the point from the List; If the distance is greater than T 2 , the point is added to the Canopy collection and then deleted from the List collection; If the distance is greater than T 2 and less than T 1 , then this point is still saved in the List; • The algorithm ends until the List is empty.

C. K-MEANS ALGORITHM
K-means algorithm is a partition-based clustering algorithm that divides data into k clusters by inputting value of k. K-means algorithm has the advantages of simplicity and speed, but the algorithm is greatly affected by the k value, and different initial points tend to have large differences in the results. The algorithm is described as follows: • Randomly select k values from the data set as the initial center point; • Select any other point, calculate its distance from the initial center point and then fall into the cluster where the nearest center point is located; • Update the center point of the cluster, calculate a new center point; • Repeat iteration until the cluster's error satisfies the threshold and output the clustering result.
The threshold of the K-means algorithm is given by E is the sum of the squared errors of all points. x is every point in the set, andx i˜i s the average distance of each point within cluster C i ;

D. COSINE DISTANCE
In order to compare the similarity between two vectors, we need to determine the distance between them. For the characteristics of text vectors, we use the cosine distance to calculate the similarity of two vectors. We represent the cosine distance between two n-dimensional vectors d(d 1 , d 2 , . . . , d n ) and c(c 1 , c 2 , . . . , c n )as sim. The value of sim is obtained by Equation 3:

IV. PR-HD PARALLEL ALGORITHM DESIGN
The PR-HD algorithm is mainly divided into data preprocessing, text vectorization, determining the cluster center vector and cluster analysis. Through the PR-HD algorithm, the original large and unordered comment data can be datanormalized and extracted from it. Comment hotspots from different aspects of the product. These comment hotspots provide valuable insights for producers, sellers and consumers.

A. PREPROCESSING AND TEXT VECTORIZATION
The preprocessing and text vectorization phase of the PR-HD algorithm collects product reviews in various ways, eliminates useless text and turns the comments into a collection of words. In order to convert the text into a vector format that can be calculated, the TF-IDF algorithm is used to calculate the weight of the vocabulary, so that all comments can be converted into a vector format of< w 1 : tfidf ; w 2 : tfidf >. w i is the word id corresponding to the word, followed by the weight of the word, which integrates all the articles into a vector format and passes it to the next stage.

B. DETERMINING THE CLUSTER CENTER VECTOR
In order to find the approximate number of clusters in the product review data, PR-HD algorithm first performs ''rough clustering'' on the data set passed in the previous stage, and uses the Canopy algorithm to determine the number of center points (k value) needed in the next stage. The focus of this phase is to select the appropriate thresholds T 1 and T 2 . We stipulate that T 1 is greater than T 2 and the selection of the threshold should be adjusted according to the actual situation to obtain more satisfactory results. When T 1 is set higher, more vectors will belong to multiple Canopy, which makes the center points close, and the clusters are not much different; When the T 1 setting is low, the number of clusters is too large, and the clustering effect is poor. When T 2 is set high, more vectors are marked as strong marks, which reduces the number of clusters; When the T 2 setting is too small, the number of clusters will increase, and the running time of the algorithm will increase. The parallelization mechanism of the Canopy phase is that each node generates a number of Canopy in the local comment data set D i . We summarize these Canopy and finally get k clusters.
The MapReduce framework consists of two parts: Map task and Reduce task. In the Map phase of the Canopy phase, each node randomly extracts the vector v i in the vector set D i of the machine as a Canopy center vector, and then generates a set of central vector canopies. We calculate the distance between v i and other vectors, use the cosine distance sim to represent the similarity between two vectors, and output the local center vector < centerid, vector >. The Map task is described in algorithm 1.
The Reduce phase is mainly responsible for summarizing the output of local Canopy center vectors by each node in the Map phase, and executing the Canopy algorithm again to obtain the global center vector. The threshold T 3 , T 4 is equivalent to T 1 , T 2 by default, and the output is < key1, value1 >. Key1 is the id value of the final Canopy, and Value1 is the global center vector. The Reduce phase is described in algorithm 2. The stage of cluster number determination solves the problem that the number of topics in the commodity review discussion cannot be determined. At this stage, the number of categories of the commodity review data set is obtained and the cluster center required for the next stage is given.

C. CLUSTERING ANALYSIS
Firstly, the cluster analysis phase of the PR-HD algorithm needs to obtain the cluster center output from the previous stage, and then clusters the K-means algorithm to obtain the final cluster. Finally, we analyze the vocabulary with higher weight in each cluster to get the hot information of the comment.
The process of K-means algorithm first takes the central vector obtained in the previous stage as the k value, and then traversing the comment vector to classify the vector into clusters that closest to the distance by calculating the distance. The cosine distance sim is used to calculate the similarity between the vectors.
In the Map phase of the K-means algorithm, the main task is to read the comment vector set D i in the local node one by one, and calculate which center point center[i] (the initial center point set is List < Canopy >) is closest to it. We divide it into the cluster corresponding to the nearest center vector. The key1 value of < key1, value1 > of the output result is the id of the cluster, and the value is the corresponding vector. The description of each Map task in the Map phase is described in algorithm 3. The Reduce phase receives the output of each Map task and summarizes it, and recalculates the new center vector corresponding to the cluster with the same id as the input of the next Map phase. The output of the Reduce stage is < key1, value1 >. The key1 is the id of the cluster, and the value1 is the new central vector. The description of the Reduce phase is described in algorithm 4.
The output of the Reduce phase is re-introduced as an input to the Map phase, and the algorithm enters multiple iterations until a predetermined number of times is reached or the distance between the new center vector and the original center vector is less than a certain threshold.

Algorithm 4 Reduce Phase of Clustering Analysis
Input: cluster_id, List<vector>C Output: < Cluster_id, New center comment vector > Take out the result of the last iteration, the content is the cluster id and the vector in each cluster. We take out the word with the highest weight in each cluster, so that we can get the key information in this cluster. By analyzing the key vocabulary of all clusters, we can get the hotspots of this product's comments, thus achieving the purpose of obtaining commodity evaluation hotspots.

V. EXPERIMENTAL RESULTS AND ANALYSIS
In order to test the effect of the PR-HD algorithm on commodity hotspot discovery, we evaluated the algorithm through relevant experiments in this section. First, the experimental design uses a Hadoop cluster composed of 5 computer nodes. Based on the mobile phone review data set crawled in the web, the accuracy and scalability of the algorithm are evaluated through corresponding indicators.

A. THE DESIGN OF EXPERIMENTS 1) EXPERIMENTAL ENVIRONMENT
The experiment uses a Hadoop cluster consisting of 5 nodes. The configuration of each node is identical. The configuration is as follows: CPU is Intel(R) Core(TM) i7-7700, core number is 4, frequency is 3.6 GHz; The memory size is 16GB; Ubuntu 16.04 is installed on each computer; Hadoop version is 2.2.0; The JDK version is 1.8.0.

2) EXPERIMENTAL DATA SET
The experimental data comes from the real comment data published on the Internet, and its content is the reviews under a mobile phone on the Jingdong e-commerce platform (www.jd.com) crawled by the crawler. The scale of the experimental data is about 96,000, and its content is mainly from the real feelings of all aspects of the mobile phone. Comment hotspots mainly include 9 aspects such as screen, appearance, configuration, battery life, camera, system, logistics, call quality and after-sales. The language of the reviews is Chinese, and it is not manually annotated. The specific details of the experimental data set are shown in table 1: In order to test the feasibility of the algorithm, 60 reviews were randomly selected from 9 different aspects. A total of 540 reviews were manually marked for accuracy detection. Each review in the new reviews dataset consists of 3 parts: id, content and the category to which the review belongs.

B. EVALUATION CRITERION
Whether an algorithm can effectively solve a problem often requires accuracy detection. In order to detect whether the PR-HD algorithm can successfully extract the hotspots from different aspects, we use the accuracy rate as the comment indicator of the algorithm. We verify the accuracy of the PR-HD algorithm by judging whether the algorithm aggregates two vectors of the same class into the same cluster. First, we need to construct the confusion matrix as shown in Table 2 according to the relationship between different vectors: TP is the number that the same class vector pair is correctly clustered into the same cluster; FP is the number that the different class vector pair is incorrectlgy into the same cluster; FN is the number that the same class vector pair is incorrectly clustered into the different cluster; TN is the number that the differnt class vector pair is correctly clustered into the different cluster.
According to the four values of the above confusion matrix, the accuracy of the algorithm can be calculated. We assume that there are a total of n comments, so there are a total of C 2 n comment vector combinations, and each element in the confusion matrix has the relationship shown in Equation 4: In all the same class of comment vector pairs, the ratio of the correctly clustered comment vector pairs is called positive accuracy rate (PA). It can be obtained by Equation 5: In all the different class of comment vector pairs, the ratio of the correctly clustered comment vector pairs is called negative accuracy rate (NA). It can be obtained by Equation 6: After combining the values of positive correctness rate and negative accuracy rate, we use the average accuracy rate (AA) as the final evaluation of the algorithm. The value of average accuracy rate can be obtained by Equation 7: The average accuracy rate (AA) represents the accurate situation after the overall vector clustering. When the value of AA is higher, the effect of clustering is better. In general, we want to increase the average accuracy rate (AA) of the algorithm as much as possible.
Speedup is usually used to measure the parallelism of an algorithm. Let us definite the time required for one processor to complete an algorithm is T s , and the time required for p processors to complete an algorithm is T p , then the Speedup S is obtained by Equation 8:

C. RESULTS ANALYSIS
When we use PR-HD algorithm for hot spot discovery on product reviews, We need to consider the threshold which is the value of the thresholds T 1 and T 2 mentioned in Section IV-B required by the algorithm. The values of T 1 and T 2 are dynamically adjusted according to the actual situation of the product data, so the threshold needs to be determined at the beginning of the experiment. Because the cosine distance is used to calculate the text similarity, the threshold is selected between 0-1 (T 1 > T 2 ). Multiple sets of tests with commonly used values are performed on the experimental data set mentioned in section V-A2, and the results are shown below. The results obtained through multiple sets of experiments are shown in Table 3. Through observation, it can be seen that when the threshold value of T 1 is 0.7 and T 2 is 0.5, the setting of this parameter is more in line with the actual situation of the review data set. Therefore, set this threshold as a parameter for subsequent experiments in this article.

1) ACCURACY ANALYSIS
First, we use the data set mentioned in section V-A2 to test the accuracy of the PR-HD algorithm. Parameter configuration is performed according to the final threshold adjustment result given, We finally set T 1 to 0.7 and T 2 to 0.5, and got the experimental results more in line with the actual situation. It is more in line with the expected result of finding out 9 hotspots. Finally, the confusion matrix constructed based on the clustering results is shown in Table 4: We can see from the above table that 540 comments can be combined into 145530 pairs of vectors. Among them, the value of TP is 13843, the value of FP is 2138, the value of FN is 2087, and the value of TN is 127462. We substitute these values into Equations 5, 6, and 7 to get the accuracy of the algorithm. Its results are shown in Table 5. We can get from the above table that the positive accuracy rate is 86.8%, and the negative accuracy rate is 98.3%. Combined with the positive accuracy rate and negative accuracy rate, we know that the value of average accuracy rate is 92.6%. It can be concluded that the PR-HD algorithm can accurately find the key information in the product review.
In order to further verify the feasibility of the PR-HD algorithm, we choose a total of three MapReduce-based algorithms from reference [22], [23] and [24] as the comparison algorithm. Reference [22] improved VSM model, and designed a parallel fuzzy c-means algorithm for hot microblogging topics discovery (HTD-PFCM). Reference [23] proposed a novel K-means algorithm (PMCSKM) for text clustering based on the selection of initial clustering centroids on density peaks. In reference [24], Ankita proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. It is different from the method of PR-HD algorithm based on Canopy to quickly find hot spots of comments. The other three documents cannot rely on themselves to determine the number of hot spots in the review data set. Reference [22] clusters the review text by given the number of review centers and membership. Reference [23] and [24] improve the quality of comment mining by optimizing the selection of comment centers during text clustering. The confusion matrix and accuracy of the above algorithm are shown in Table 6,7: In order to compare several algorithms more intuitively, the above data is shown in Figure 1:    Figure 1 show the performance comparison of all algorithms. Overall, the average accuracy of the PR-HD algorithm is 92.6%, which is the highest among several algorithms. The average accuracy of HTD-PFCM algorithm is about 4.4% lower than PR-HD algorithm. This is because in the product review data set, the PR-HD algorithm uses the Canopy algorithm for pre-clustering, and selects a suitable review hot spot center from the entire review data set. In contrast, HTD-PFCM algorithm is greatly affected by the initial clustering center, and it is easy to fall into the trouble of local optimization. However, the selection of the center point in the product review data is extremely difficult. Ankita using combination of genetic algorithm (GA) with Mahalanobis distance, and considers covariance between the data points and thus provides a better representation of initial data. DPM-CSKM algorithm uses the density peak to avoid the blind selection of the central point of the review data set. Therefore, the average accuracy of DPMCSKM algorithm and Ankita's algorithm is about 1.4% higher than HTD-PFCM algorithm, but they are both about 3% lower than the PR-HD algorithm.
In summary, the PR-HD algorithm is superior to other algorithms in accuracy, so it is more suitable for hot spot discovery of product review data.

2) SPEEDUP ANALYSIS
In order to measure the parallel effect of the PR-HD algorithm, we selected 3 different scales of comment data sets and completed the operations such as filter and word segmentation in advance. The resulting formatted data sizes are 362.8MB, 603.4MB, and 1.38GB. We tested the running time of these data when they are run on 1, 2, 3, 4 and 5 computers. The final result is shown in Table 8: According to the above table, the speedup of different scale data at different nodes can be calculated by Equation 8, as shown in Table 9:   We can get the following conclusions from the table above. In case of that the data size remain unchanged, when we increase the number of nodes in the cluster, the overall performance of cluster will also increase. In case of that the data node is unchanged, when the size of the comment data set increases, the speedup also increases. The curve of the speedup is shown in Figure 2: We can get the following conclusions from the figure above. When the number of nodes increases, the speedup of large-scale data sets will increase faster than small-scale data sets. When the data size increases, the curve of the speedup is more linear. The experimental results show that the PR-HD algorithm can effectively improve the execution efficiency of the algorithm when dealing with large-scale data. Therefore, it can be concluded that the PR-HD algorithm can meet the higher demand brought by the massive data set.

VI. CONCLUSION
A MapReduce-based product reviews hot spot discovery algorithm-PR-HD algorithm is proposed in this paper. This algorithm combines the text clustering algorithm with the MapReduce distributed computing framework, which aims to conduct in-depth value mining in many aspects of commodity reviews in parallel. We tested the PR-HD algorithm in the multi-node clusters. The experiment results show that the PR-HD algorithm has higher accuracy and can better extract the hotspots of commodity reviews to get feedback of products in different aspects. At the same time, the PR-HD algorithm has the ability to process large-scale data, and can significantly improve the speedup when the nodes of the cluster increases, which is suitable for the mining of largescale data.
In addition, the PR-HD algorithm avoids need to manually set the number of hotspots. But it introduces the concept of threshold. In some cases, the value of threshold will become a new factor influencing the outcome. The next step is designing a new algorithm to automatically select the appropriate threshold. It can further reduce the interference of human factors and improve the accuracy of hot spot discovery.