Product Pre-Launch Prediction From Resilient Distributed e-WOM Data

Pre-launch success prediction of a product is a challenge in today’s electronic world. Based on this prediction, industries can avoid huge losses by deciding on whether to launch or not to launch a product into the market. We have implemented a Multithreaded Hash join Resilient Distributed Dataset (MHRDD) with a prediction classifier for pre-launch prediction. MHRDD helps to remove the redundancy in the input dataset and improves the performance of the prediction model. Large volume of e-Word of Mouth (e-WOM) data like product reviews, comments and ratings available on internet about products can be used for pre-launch product prediction. In MHRDD, to identify features a distance similarity score is used. In order to remove duplicates, a hash key and join operations are used to create a hash table of significant features. With in-memory computations and hashing on the join operations, this model reduces redundancy of data. This model is scalable and can handle large datasets with good prediction accuracy. This paper presents a novel big data processing method that predicts product success before its launch in the market. Proposed method helps to identify features that are significant for the product to be successful. Based on the pre-launch prediction, companies can reduce cost, effort and time with improved product success.


I. INTRODUCTION
In social networking sites and other e-commerce applications, the product reviews are available in huge volumes. To a certain extent, product sale depends on customer reviews. Life of product depends on its quality where as online product reviews help to improve the product quality [1]. Online product reviews help in identifying the key changes needed for the product. Based on the customer suggestions manufacturer can design a good quality product. WOM is one of the information transfer methods which gives direct opinions about the product for the customers. Similarly, online customer reviews from the users are another source of information about a product [7]. Customer feedback, criticism, comments, suggestions, reviews, ratings, opinions etc which are available in the online mode as huge data are treated as Electronic Word of Mouth. Usually word of mouth refers to customer feedback of products or services in direct marketing. After online purchasing of the product, customers give their valuable pros and cons about the product to help others. No one-to-one physical The associate editor coordinating the review of this manuscript and approving it for publication was Nilanjan Dey. communication with customers are there. E-WOM is a rich source of information for manufacturers to launch new products or improved versions of their products. Anyhow, techniques to extract useful decision-making information from this kind of data are rare. This is challenging as the data is unstructured, redundant and with huge volume [37], [44].
Success of a product depends on product reviews. Online shopping sites give several benefits to customers [7]. Online marketing methods can acquire a large number of customer suggestions in the form of product reviews and descriptive information about the product without any marginal cost [25]. Companies believe that almost all sites should provide effective content about the products to build loyalty [2], [5]. Poor quality products potentially affect the goodwill of industries. In the design stage itself, we can include quality features to improve the success of the product. The users should provide their valuable true reviews about the products which they use. Customers may give duplicate product reviews, which are redundant [3], [4]. These redundant reviews are handled in our proposed system. Usually, unauthorized users or biased interested users can give duplicate reviews which can devitalize the product sale [9]. Thus, to promote the manufacturing VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ quality, the proposed model for successful product launch prediction can be adapted. In online marketing strategies, mainly three types of product reviews are examined: volume, strength, and disbandment [9]. A large volume of product reviews on shopping sites, blogs and forums create awareness among the users about the product. The rating on the products and the ratio of positive to negative opinions about the product are considered as the strength. Higher-strength implies better quality [34], [36]. Now studies are there for e-WOM product ratings as a revenue-forecasting tool for products such as television shows, movies, books and other products [8], [9]. The disbandment of communication measures how fast these customer suggestions extend over communities [10]. Thus, scalable infrastructural models are used to handle this type of big data [16], [19]. Compared to traditional database management systems, the challenges of big data [28] are more complex [20], [21], [22].
The online media has changed the way people express themselves and interact with others. Anyone can post reviews of products at online shopping sites (e.g., Flip kart, Amazon, Club factory, Snap deal) and they can express their views about the pros and cons of a product. It is identified that such user-generated contents, on online sites provide useful information that can be exploited for different applications. Several works exist on extracting positive and negative reviews using natural language processing techniques and spam reviews recognition [11]- [13], [15]. These works do not provide any method on the feature extraction of these reviews, which helps to build quality products. Since quality control over reviews is not there, anyone can write on the web, which results in many duplicate reviews and spam reviews [23], [30].
In this paper, we focus on e-WOM customer suggestions about the product, which contain information of user views on the product and are useful to both prospective users and product manufacturers. The problem of data redundancy and significant feature extraction of the customer reviews dataset are handled to enhance the quality of dataset formation and cleaning. This can be helpful for a successful new product launch. The input raw datasets used for analysis contain duplicates reviews of products that are alternatives to the same product. This duplication is possible due to varied sizes, color and material for clothing items, blue-ray and DVD versions (for movies), color of the casing for mobile phones etc. The customers decide a specific alternative of the product, e.g., ''red case of iphone7S'', but their customer reviews and ratings also appear in the dataset review of the web pages of all the several variants of the product. Since the e-WOM data are acquired by crawling, all the same, product options have the same reviews and ratings, leading to redundant reviews [45].
In our proposed paper, problems such as data redundancy, scalability and prediction accuracy of huge datasets have been handled. With this, a scalable model to predict the success or failure of a new product prior to its launch in the market is implemented. Thus, it contributes to the manufacturing of the product with desired quality. Implementation of the model is achieved by Multithreaded Hash-join Resilient Distributed Dataset (MHRDD) method with machine learning prediction methods. This results in enhanced performance of the prediction model by removing the redundancy in the input dataset. The result analysis exhibits that the proposed model is more effective. Proposed product pre-launch prediction helps to design a good quality product, which can be helpful for the manufactures as well as consumers.
This paper is organized as follows. Section 2 reviews the literature. Section 3 describes our methodology. The result analysis is discussed in section 4 and the conclusion is provided in Section 5.

II. RELATED WORKS
The state-of-the-art techniques on duplicate data removal and data cleaning are mainly dependent based [51], [52], [54], and adaptive window-based [53]. The data cleaning method implemented by Bertossi et al. [51] describes the matching dependencies. This single matching dependency on tuples produces a collection of clean occurrences concerning a particular pair of bad tuples. But this method raises the computational complexity problem of query answering and also necessitates a space requirement problem.
Another adaptive approach is the Sorted Neighborhood Method [50], with respect to a key. This method sorts the data and then passes a window across the data matching unique documents that arrive within the same window. The disadvantage of this method is the fixed window size. For small window size, the redundant data is missed, and for large window size, unnecessary comparisons occur.
Raymond et al. [18] paper proposed the technique for untrusted review on spam detection and the same was done using text mining model concatenated with the semantic language model. Non-review spam detection is done by identifying different SVM classifiers [29] for different analysis methods. The results which obtained from this semantic language modelling and batch processing type text processing computational model are effective for the detection of untrusted customer reviews. This is also effective if spammers exercise confusion strategies in customer reviews [18], [32]. Gutierrez et al. [54] suggested an automated interpretation of complex text data for a crisis event. An unsupervised learning model called random forest method is trained for data pre-processing and feature extraction. Like the random forest, an ensemble approach is weak in interpretation and prediction. Also, the random forest method utilizes more memory and the application execution time is more considerable. Gutierrez et al. [31] illustrate a linear predictor model to preserve a subset of the essential features based on their association with the solicited output values. The model modifies the initial data set, exhibiting its various significant variable features. The authors considered nearby 10,000 features in the experiment. The limitations of the model is that it is sensitive to data redundancy and outliers.
The proposed MHRDD approach makes use of in-memory computation with distributed computing which makes the application execution faster. As the data increments or updation occur, proposed multithreaded hash join can optimize for multiple queries at the time of development with that massive data. Data redundancy can be eliminated to improve the prediction accuracy of the application. Significant features are identified which helps to improve the reliability of the predictive model.
Recently, Wang et al. [33] proposed the Online Group Feature Selection algorithm in which data instances sequentially added to the application. Another online feature selection method assumes that the total number of data elements is fixed while the number of features changes over time.
Perkins et al. [42] proposed different stages of gradient descent approach with the grafting algorithm (Graft-GD). This grafting technique treats the feature selection methods as essential for predictor learning in a framework. The model works with the iterative programming method. The gradient descent model trains the predictor model and builds up the feature set.
Leung et al. [27] discussed an Alpha investing method, which appends features to a prediction model. The system builds with a dynamically generated candidate feature set. This Alpha investing technique requires the experience of the original input feature set, and this model never evaluates the duplicates among the selected features. The major disadvantage of the above feature selection methods is high computational cost and large dimensionality. Liu et al. [43] suggested a model for movie review summarization and rating. Latent Semantic Analysis (LSA-based) method is used to extract variable features of the product and summarize the product based on various features. The limitation of the LSA-based approach is that it cannot be realized efficiently; hence, it is hard to index based on specific dimensions. Thus decreases the prediction accuracy in unstructured, massive datasets.
Based on the grouping semantics, Balasundaram and Vengadeswaran [17] suggested an optimal data placement approach, which can reduce the time for query execution and query latency. This approach is implemented using the Hadoop framework with map-reduce processing. Hadoop's data placement approach designates the data chunks arbitrarily over the group of nodes without examining execution parameters. In the proposed MHRDD, multiple map tasks with in-memory computation are achieved by the resilient distributed dataset (RDD). The massive iterative processing applications in-memory computation makes the proposed approach faster than map-reduce data processing.
Pre-processing of the massive unstructured dataset plays a crucial role in predictive analytics applications. Manohara et al. [58] implemented feature extraction using dimensionality reduction for processing large financial dataset. Proposed MHRDD method with proper data preprocessing improves the prediction accuracy of the model. The proposed method eliminates the unwanted data; also, significant features are extracted in a distributed and reliable manner with fast in-memory computation. Chen et al. [57] described a Tight center loss function approach for iris image feature extraction from the large dataset using Tensor flow analysis. Time taken for processing the application is large compared to the proposed distributed in-memory computation. Proposed method over come the limitation of variable length sequence processing of symbolic looping approach.
In our proposed paper, data pre-processing problems such as data redundancy, significant feature identification are handled properly. Also scalability and the prediction accuracy of huge datasets have been handled in a better manner. A scalable model to predict the success or failure of a new product before its launch in the industry is developed. The related works show that the proposed model is more effective compared to the state of-the-art techniques.

III. PROPOSED METHODOLOGY
In this proposed work, the success or failure of a product is predicted before its launch with distance similarity score and Multithreaded Hash-join Resilient Distributed Dataset (MHRDD) method using appropriate prediction classifiers. Fig 1. shows the functional block diagram for the proposed product pre-launch prediction. The system consists of dataset aggregation, data pre-processing i.e., construction of duplicate data removal method and feature identification, building prediction classifier and testing. In this system dataset collection phase learns different product features from customer reviews, product ratings and product sales details.
Intended purpose is to acquire better knowledge of the input dataset for good prediction accuracy. One of the important steps consists of pre-processing of the dataset. In this phase, duplicate customer reviews are eliminated, best features are identified as well as missing and irrelevant data are handled. Resilient distribution of Spark framework is adopted to handle this large dataset in a scalable and fault tolerant manner. In the final phase, prediction accuracy is tested and compared using different classifiers.
A. DATASET AGGREGATION E-commerce sites (like Amazon, Flipkart, Club-Factory, Snapdeal, etc.) provide customer reviews and ratings. Several datasets are available as public datasets [5], [6]; the proposed VOLUME 8, 2020 methodology can be used for different datasets. The product considered in this work is seven brands of mobile phones. The dataset we have considered contains reviews for a period of 3, 6, 12, 18 and 24 months, ratings and product sales details [5], [6]. Table 1 shows the sample customer review and ratings. Fig 2. shows the aggregate revenue generated in one year of respective mobile brand sales.
The e-WOM dataset consists of the pros and cons of the product. The software and hardware descriptions of the product is taken into consideration for the design of the product. Hence, hardware and software components of the product are considered as a product feature. In the raw input dataset, fifty five variable features exist. Identifying a feature is essential for this model to improve system performance. A distance similarity score approach has been implemented to identify the significant customer review features. Features having a distance similarity score value greater than 0.5 are considered for further processing.
Previous product sales in crore for different mobile brands in 12 months is shown in Fig 2. Product sale feature along with the customer reviews plays an important role for better prediction.

B. DATASET PRE-PROCESSING
Data Pre-processing uses resilient distributed dataset of Spark framework. Significant features of input dataset are identified using distance similarity score in the feature identification phase. Using significant features, duplicates and irrelevant data are removed by applying multithreaded hash join resilient distribution method.

1) DATASET PROCESSED IN RESILIENT DISTRIBUTION
Resilient distributed dataset (RDD) is collection of files with file partitioning in it. RDD's can be built using functions called transformations and actions. RDD is written through deterministic transformations. Hence, resilient distribution limits huge volume of writes but it permits fault tolerance. Lineage property of RDD helps to recover the information which avoids check point overhead. If failure occurs in any partitions of RDD, they can be recomputed parallel on different nodes without having to restart the complete program [26], [35].
The main actions of RDD are: The main Transformations of RDD are: • map(x=>y): RDD in which the elements processed independently.
• filter(x=> TRUE): RDD in which the function returns true value as result.
• flatmap(x=>Range(x,y)): Each element is mapped to zero or more.
• Join(x[]): equi-join operation is performed on the key elements.
An RDD framework consists of a Master program, cluster manager and several slave nodes as workers. The workers consist of memory nodes and data nodes. The applications work as independent processes on each worker node, coordinated by the collect function in the main program. To program on a cluster, the collect function can connect to several types of partitions, which allocate resources across applications. Once connected, framework acquires memory nodes in the cluster, which are processes that run computations and store data for our application. Next, partition sends our application code to the data nodes. Finally, collect program sends data for the memory nodes to run [35].

2) FEATURE IDENTIFICATION
In data cleaning, feature identification plays a major role.
In the customer review dataset of mobile phone, a large number of features exist. Identifying significant feature is important for this model to improve the prediction accuracy. This stage is again subdivided into product feature identification, opinion identification of product and weighted feature based on opinion rank. In product feature identification, significant features are identified using distance similarity measure. In opinion identification of the product, polarity of the customer review is measured. Based on the opinion rank feature weight is calculated. A distance measure similarity score ratio has been implemented to identify significant features. Features distance similarity score greater than 0.3 is considered as significant. In this work the threshold is kept as 0.3.
Let R i and R j be the reviews of customer Ci. For each customer Ci the feature count of review R i is denoted as f ci (R i ), feature count of review R j is denoted as f ci (R j ), and N be the total number of features. If the count of R i is less than R j , then the feature ratio δ for each customer Ci is computed as If feature count of R j is less than R i then, If feature count of R i equals that of R j then this ratio is considered as 1. Using the feature ratio for each customer Ci, distance similarity score is computed to identify the customer features. Customer review opinion has to be identified in-order to find the polarity of the review. Hence, to identify the review opinion we parse the sentence using MINI-PAR [41]. Senti-WordNet [14] is used to classify the identified opinions from the polarity of individual reviews. For each review, the opinion sentences are examined and mapped into the positive or negative class based on the polarity value of the associated opinions. The overall weight of a feature is calculated by difference between the two polarity values of the opinion word multiplied with the number of sentences in which that opinion word repeats. Let W p be the weight of the positive opinion of the feature i.e., positive polarity value of the opinion word multiplied with the number of sentences in which the opinion word repeats. and W m be the weight of negative opinion of the feature, i.e., negative polarity value of the word multiplied with the number of sentences in which the opinion word repeats. Let z be the number of features in a review comment.
Let's take an example for the feature battery life, W p value is 0.871 and W m value is 0.214 from the review content. Then the overall weight O w is 0.657. The significance of the feature is identified using distance similarity score with overall weight of the feature and feature ratio. Distance similarity score is represented as ∂ f and it is calculated as, Let's take the example of the review content as shown in Table 1, Battery is too good. Consider 90 reviews with 55 reviews includes feature 'Battery'. The distance similarity score with respect to the identified feature battery is (0.657 × 55)/90 = 0.401. The similarity score calculated will be between [0,1]. Features with score value less than 0.3 is neglected due to less significance. Table 2. shows the significant features identified from e-WOM dataset which is used for further prediction.

3) HASH JOIN RESILIENT DISTRIBUTION
Hash Join works with in-memory computation using RDD. Duplicates in the customer reviews are removed using a hash phase followed by join phase. We hash the two relations into partitions on disk using a hash function, and later join them to VOLUME 8, 2020 get the final result. Customers are partitioned into 2 divisions K i and P i based on the review features and opinions. Then, we will match K i with P i partition. In Join phase, we build hash table to perform the joining. The hash table is implemented using RDD. Hashing function works in different worker nodes in parallel and duplicate entry is recognized without writing to the same bucket.
Hash Key generation: Let i = 1,2,3, . . .,k represents the customer indices, j =  1,2, 3,. . . ,n represents review indices and m = 1,2,3, . . .,t represents feature indices • N denotes the total number of customer reviews, X denotes the significant features. • The j th value of a particular subset selected by the customer is denoted by C j x δ m . • The hash value denoted by h v is taken as the L1 norm of the feature vector where, • Hash function key is defined as: As shown in Table 3, the hash function will map the features from each review with unique customer to the integers corresponding to the index of hash

C. PRE-LAUNCH PREDICTION WITH DIFFERENT CLASSIFIERS
The next step in our approach is to build prediction classifiers. The classifiers used are Support Vector Machine (SVM), Naïve Bayes (NB), Decision tree (DT) and XGBoost

1) SUPPORT VECTOR MACHINE (SVM)
SVM is the supervised machine learning technique. The aim of SVM is to find out the better separating hyperplane between two class training datasets. This hyperplane should be far from the dataset elements of the other class. Support vectors are the dataset elements that lie close to the classifier margin. To find the optimal plane minimize the two decision boundaries distance. For the prelaunch prediction of the product the hyperplane which divides these classes have to be determined. For each product vector r i , hyperplane is defined as for r i to be in class 1.
for r i to be in class −1.
The product in the positive 1 class is considered as successful product, [from equation (7)] and if it is in the negative 1 class [from equation (8)] then the product is considered a failure.

2) NAÏVE BAYESIAN
Naive Bayes [27] is a classification technique which works on probabilistic model. This model requires that probabilistic features are independent and are not related to one another. Naïve Bayes probabilistic method for classification involves modeling the conditional probability distribution P(S| R), where S ranges over classes and R over product reviews. During prelaunch product prediction it is denoted by success class value as '1' and failure class value as '0'.
where 's' is the class instance, 'R' is the product feature vector of size 'k', where R = (r 1, r 2 , r 3, ..., r k ). The classifier model is

3) DECISION TREE
Decision Tree [39] is a machine learning algorithm, where prediction of target class is done based on decision rules. These rules are generated from past data obtained. We built a decision tree for predicting products status in the market before its launch. In each stage, decision tree selects each node by calculating the highest information gain of all the product feature attributes. Here decision tree is built on using the review featured dataset based on the Iterative Dichotomiser 3 (ID3) [39] method. Table 4. shows the decision tree algorithm for the pre-launch prediction. Figure 3 shows a sample decision tree. The root attribute element is taken as the product. Tree is constructed based on the random input dataset. The product attribute is categorized into features of the customer reviews and the customer ratings. These feature and the ratings nodes are the descendent. Depending on the features and ratings, success or failure prediction of the product occurs. Characteristics of the product features can be grouped under into good, bad and average, depending on that success or failure prediction occurs. Rating scale is taken as 0-5 and a condition is set with rating scale of greater than 3 for the successful product and less than or equal to 3 in the case of failure.

4) XGBoost ALGORITHM
XGBoost algorithm [48] is a machine learning algorithm that utilizes a Gradient boosting framework. The training and testing dataset is used for the evaluation and also for the k-fold cross-validation model. The cross-validation model is one in which, rather than one training and testing set, 't' sets are built, called ''folds,'' and then t-1 folds are used to train the dataset, as well as the t th fold, is used for testing purpose. This process is repeated until the test folds are divided. The mean of the individual folds is considered as the final result. We implemented the L2 regularization of the XGBoost algorithm to obtain a generalized model [46], [47]. Each level of the progression of the XGBoost algorithm can be observed as a version of the pre-launch product prediction process.
• Decision Tree: Every product has a set of features such as battery life, cost, etc. A decision tree is comparable to a product success prediction based on the features of the product.  • Bagging: There are more than one product with a large number of review comments. Bagging collection involves combining reviews from different customers about all products for the final decision through significant feature extraction and data pre-processing and predictive analytics.
• Random Forest: This is a bagging algorithm with a subset of features that are picked at random.
• Boosting: It is an alternative method where each feature contributing to the success of the prediction modifies the evaluation measures based on features selected and prediction classifier. This 'boosts' the performance of the predictive analysis process.
• Gradient Boosting: This algorithm reduces the error rate, which is a special case of the gradient descent algorithm.
• XGBoost: XGBoost algorithm is an extreme variant of the gradient boosting method. It is an excellent aggre-VOLUME 8, 2020  gation of software and hardware optimization methods to generate better results utilizing fewer computing resources with minimum time.

IV. EXPERIMENTAL SETUP
The intended system was realized using Apache Spark framework. PySpark version 2.1.2. Amazon Web Services is used to run some components of the software system, having four Intel Xeon E5-2699V4 2.2G Hz processors with four cores and 16 GB of RAM on Spark cluster configurations. According to the scalability requirements, the software components can be configured and can run on separate servers. This model helps to predict the failure or success of a unique product in the market by analysing significant features from product customer reviews. A case study is conducted using customer reviews of 7 brands of mobile phones. Success or failure is the feature variable used for training and testing the dataset. For training purposes, 75 % of the dataset is used and for testing the model, the remaining 25% is used.

V. RESULT ANALYSIS AND DISCUSSION
Prediction classifier is built and tested using SVM, Naïve Bayes, Decision tree and XGBoost algorithms. When dataset is preprocessed with MHRDD method we get good prediction results. Further the prediction percentage does not vary much with different classifiers. This shows that data pre-processing plays an important role in future prediction. Table 5 shows prediction of seven brands of products with different classifiers using proposed method.
Products pre-launch prediction of 7 brands of mobile phones are tested as shown in Table 5. This can be noted from Figure 4, where the failure prediction of product with Support vector machine, Naïve Bayes and Decision tree are shown. Figure 5, shows the effect of the success prediction of product with different classifiers with different number of customer reviews. Comparing with these classifiers XGBoost is having better prediction accuracy compared to other classifiers.. Figures 4 and 5 show that with different classifiers prediction do not vary much in the product pre-launch prediction using the e-WOM dataset. Brand 2 has the highest success prediction percentage and Brand 5 has highest failure prediction percentage.
The reliability of the MHRDD method depends on precision, recall and performance accuracy [35], [38] measurement. Table 7 shows a comparison of precision, recall and accuracy measures of MHRDD, Grafting Gradient Descent and LSA-based methods with Support Vector Machine and Naïve Bayes, Decision tree classifier and XGBoost algorithm. The results shown in Table 6 are best proved using MHRDD with XGBoost classification with an accuracy of 97.9%. The MHRDD outperforms Grafting GradientDescent and LSA-based methods in P@R, R@R and P_Accuracy measures. Using proposed method, false negative (FN), true positive(TP), false positive (FP) and true negative (TN) are found out. The performance parameters such as Prediction accuracy (P-Accuracy), precision (P@R) and recall (R@R)are computed using equations (11), (12), and (13) respectively.
As shown in Figure 6 (a), (b), (c), (d), (e), (f), (g) and (h) a scalability comparison of MHRDD with LSA-based and Graf_GD methods has conducted. As shown in figure 6, as the dataset size of the customer reviews increases precision and recall rate decreases for the Graf_GD and LSA-based methods. As the number of months increases the dataset size also increases, proposed MHRDD shows almost constant performance for large and small dataset. The result analysis shows that MHRDD approach outperforms the other two methods in big data analysis. MHRDD method is more scalable for different sizes of datasets compared to other methods. Figure 7 shows the comparison of the time taken for execution of the MHRDD model with the state-of-the-art techniques. MHRDD method executes the application in lesser time when compared to Grafting gradient descent and latent  semantic analysis method. Result analysis shows that the proposed model is scalable and fast.
Its performance is high, processing with large dataset. This shows the MHRDD applicability in big data analytics, whereas Graf_GD and LSA-based methods processing time is larger for large volume of dataset. Multithreaded programming using distributed computation as well as in-memory computation increases the model performance by reducing the execution time of the application. For big data analytics applications proposed system improves the execution performance. Figure.8 illustrates important features that are required for the product to be successful. Customer feature from predictive analytics. Storage requirement has been identified by 85% of customers as significant feature and so on. With this evaluation customer requirements for a reviews, ratings and sales details of 7 brands of mobile phones are identified and evaluated with MHRDD using SVM, Naïve Bayes, Decision tree and XGBoost classifiers. The graph shows the significant features identified by the model against the percentage of customers whose reviews are analyzed. As shown in Figure 8, larger number of customers identified product price as one the significant feature from predictive analytics. Storage requirement has been identified by 85% of customers as significant feature and so on. With this evaluation customer requirements for a product can be analyzed in a better manner, thus can improve the design of the product for better product quality and for product sustainability in the market. Significant features identification of the product plays an important role in predictive analytics. Figure 9 shows, data size versus percentage of redundant data removed during pre-processing in three datasets by the proposed method. In the first 3GB dataset, 14% of redundant data has been removed. In the 10GB and 20GB datasets 20% and 27% of redundant data has been identified and removed. Removal of more redundant data increases the accuracy of the prediction model. Table 7, a comparison of the proposed work with state-of-the-art techniques is detailed. The novelty of the proposed method with other approaches is shown here. Table 7 compares the proposed MHRDD with the stateof-the-art techniques, dataset cleaning, method, type of the dataset, distributed data processing and application used for the implementation of the model.

VI. CONCLUSION
With fast technological developments, new products with innovative features are launched into the market. In this work, a novel big data processing has been implemented that predicts product success before its launch in the market. This helps the industrialists to launch and sustain a successful product in the market and also the consumers to get a good quality product. Customer product reviews, rating and product sales details are taken as the training dataset and for testing. A distance similarity scores along with a multithreaded hash-join method with a resilient distributed dataset for unwanted data removal and significant feature selection has been done. Along with this model, classification algorithms are implemented for prediction. We have given a priority weightage to product-based features on the opinion of the customer reviews. The model is faulttolerant as it uses a resilient distributed dataset. The prediction accuracy, precision and recall of the MHRDD method outperforms the Grafting Gradient Descent and LSA-based methods.
Compared to the state-of-the-art techniques the prediction accuracy of the proposed method increases by 11% using significant feature identification and eliminating redundancy from dataset. Results show that the proposed MHRDD model performance is excellent as well as time taken for processing the application is less compared to the state of the art techniques. 27% of redundant unwanted customer reviews and ratings have been removed from the original raw dataset, which increases the model's prediction accuracy. Resilient dataset distribution property on Multithreaded Hash Join method has a long lineage; hence the aforementioned can achieve fault-tolerance. The MHRDD model is fast because of the distributed in-memory computation approach. The proposed approach can be extended to other product feature identification of big data predictive analytics. As future work, the model may be improved to make real-time streaming forecasts through a centralized API that explores customer suggestions, credentials, ratings and surveys from different reliable online sites.
PHILIP SAMUEL received the M.Tech. degree in computer and information science from the Cochin University of Science and Technology (CUSAT) and the Ph.D. degree in computer science and engineering from the IIT Kharagpur. He has more than 20 years of experience in teaching and research as a Faculty Member of the CUSAT, where he is currently a Professor with the Department of Computer Science. He has published more than 60 research papers in international conferences and journals. His research interests include big data analytics, distributed computing, automated software engineering, and artificial intelligence.
MARIAMMA CHACKO was born Changanacherry, India, in 1961. She received the bachelor's degree in electrical engineering from the University of Kerala, in 1985, and the master's degree in electronics and the Ph.D. degree in computer engineering from the Cochin University of Science and Technology (CUSAT), in 1987 and 2012, respectively. She has been working as a Faculty Member of the Department of Ship Technology, CUSAT, since 1990, where she is currently a Professor with the Department of Ship Technology. She has more than 25 research publications to her credit. Her research interests include the validation and optimization of embedded software, power quality in ships electrical systems, and the sensor less control of BLDC motors. VOLUME 8, 2020