Distribution-Adapted Model for Helpful Vote Prediction

The number of helpful votes on a review is an essential indicator of how much impact the review has on other customers in electronic commerce. Therefore, predicting the number of helpful votes is an important task. Regression analysis and Tobit modeling are typical methods of prediction. Those methods come from the same initial assumption that the number of helpful votes follows a normal distribution on any dataset. However, the assumption is not usually confirmed, and the distribution of the helpful votes often follows other distributions. This paper proposes a framework for investigating the feasibility of building a model that predicts the number of helpful votes according to the distribution of the number of helpful votes. On top of that, considering the review age, we propose an adaptive window size sampling method to evaluate the model on review datasets sorted chronologically. The experimental results validated that the model adapting to the best approximate distribution gives a significant improvement compared to the baseline models. In addition, model evaluation using the adaptive window size sampling method has significant impacts on the performance on large datasets.


I. INTRODUCTION
The customer often writes a review to describe their opinion about the quality of a product. This review might help other customers with their purchase decision. The number of helpful votes in a product review indicates the impact of a review has on other customers. Hence, it is crucial to estimate the number of helpful votes.
Previous studies examine the distribution of helpful votes 1 in selecting a suitable model by some simple indicators. Negative binomial regression [1], [2], [3] is chosen instead of Poisson regression because they consider that helpful votes are in a count distribution with an over-dispersion problem [1]. Over-dispersion is a phenomenon where the equality of mean and variance is not fulfilled in a count distribution. For the same reason, some studies employ the The associate editor coordinating the review of this manuscript and approving it for publication was Taous Meriem Laleg-Kirati . 1 We use this expression 'the distribution of helpful votes' in the same meaning as 'the distribution of the number of helpful votes'. regression [4], [5] and Tobit model [6], [7] by first taking a normalization or transformation of helpful votes. Normalization and transformation, such as helpful ratio [1], [8], [9] and log-transformation [5], are used to take the helpful votes into a continuous distribution form. However, it has never been confirmed that the distribution initially assumed by the model conforms to or even approximate the distribution of helpful votes. If regression models above are applied to an unsuitable distribution, it may not achieve optimal results and even not be acceptable.
The importance of confirming the target distribution has been introduced to optimize the result on a normal distribution with Gaussian process [10]. The generalized linear model (GLM) provides a solution when the target is not in a normal distribution. The main idea of GLM is to build a model by generalizing regression analysis to other distributions that fit the target [11], [12]. The next problem is to provide the target distribution before developing a GLM.
The goodness of fit test is usually performed to find the best fit data distribution. However, the goodness of fit test cannot often find the best fit distribution. Therefore, we use Mean Squared Error (MSE) and Akaike Information Criterion (AIC) [13], [14] to approximate the distribution by comparing the score of several distributions. Since the normal distribution is in the Exponential Dispersion Model (EDM) family, we use four other distributions: Gamma, Inverse Gaussian (InvGauss), Exponential (Expon), and Wald.
The critical step to approximate the distribution by AIC and MSE is to create a histogram with a certain number of bins. Any constant is often applied as the number of bins without considering dataset characteristics, which lead to wrong identification of the distributions. In this study, we apply Scott's rule [15] to calculate the number of bins and use the Kolmogorov-Smirnov (KS) score [16], [17], [18] for calibration. Later, we also investigate the possibility of the model performance following the rank of distributions identified.
Subsequently, we generate and evaluate the model on the dataset by using a sampling method. Cross-validation with 10-fold sampling is popularly employed to evaluate the helpful-review models [3], [9]. However, a new review under actual conditions does not have any votes yet when posted. Saptono and Mine [19] proposed time-based sampling (TBS) methods with Cochran's formula, which assumes a binomial distribution for classification tasks. Their formula uses the binomial variance of the helpfulness rating calculated from the whole dataset. Besides, only data in the training set are assumed labeled, and the others are unlabeled. Here, the variance formula for the binomial distribution is changed to that for the other distributions so that the TBS method can be used correctly and more effectively.
To address the problems described above, we propose a framework to correctly implement a model adapting to the distribution of helpful votes. Our framework collaborates three main modules: distribution identification, model generation, and sampling methods. Each module employs a particular technique and contributes as follows: 1) We propose a method for identifying the distribution of the helpful votes. The proposed method approximates the distribution in more detail by computing MSE and AIC scores by means of a histogram whose bin counts are computed by Scott's rule. We apply the KS score for calibration. 2) On the model generation, we employ a model adapting to the distribution of helpful votes to predict the number of helpful votes. We call the model the distributionadapted model. We build the model in three machine learning models: linear model, extreme gradient boosting [20], and convolutional neural network [9], [21]. 3) On the sampling methods to evaluate the models, we adjust the window size of the TBS method [19] so that it can be applied to a dataset even in a continuous distribution. Next, we conduct extensive experiments on Amazon.com datasets [22] and IMDb datasets [23].
In this paper, we answer the following research questions: Q1 Does distributional identification by MSE or AIC score yield the same results as the KS score? Q2 Does the performance of the distribution-adapted model follow the rank of distribution identification results? Q3 Does the adjustment of window size of the TBS method improve the model performance? Q4 How are the effect of the implementation and evaluation to the time consumption of the distribution-adapted model with the AWS sampling method compared to baseline models? The rest of the paper is structured as follows: we present an overview of existing prediction models, factors, and sampling methods to estimate the helpful votes in Section II and the typical structure of the EDM family density function in Section III. Subsequently, we elaborate on our proposed framework in Section IV. In Section V, we describe our experimental setup and report the results. Finally, we summarize our contributions and discuss further tasks in Section VI.

II. RELATED WORK
In this section, we briefly describe some previous studies related to ours. We first discuss some metrics to measure helpfulness and then describe some models employed in helpful vote prediction. Subsequently, we discuss some factors used in previous research projects and trends to use the text factor. We next elaborate on previously implemented sampling methods. Finally, we summarize related studies and compare them with this study, as shown in Table 1.

A. HELPFULNESS METRICS
The previous paper used some metrics to measure how helpful the review is for the customer. A helpfulness rating also called a helpful ratio, is applied if there are two types of feedback captured by the system: helpful and not helpful [8], [25], [26], [35], [36]. In this case, the helpfulness rating is a ratio of the number of helpful votes to the total votes. A higher helpfulness rating means the review has helped other customers to make a purchasing decision. Amazon.com also used this metric on their dataset [37]. This metric is also used to binarize the helpfulness rating with a threshold [4].
Recently, most commerce systems, including Amazon.com and Yelp.com, have eliminated the unhelpful button as customer feedback for product review. Consequently, the current Amazon.com 2018 dataset [22], the updated version of the previous Amazon.com 2014 dataset [37], has dropped the total votes information and provides the number of helpful votes as the only feature indicating helpfulness. Previous research used the helpful votes to represent the number of helpful votes [4], [6], [7]. Moreover, the categorized helpful vote form is also used as a target variable [3], [19], [33], [34].

B. HELPFUL VOTE PREDICTION MODELS
Regression analysis is a representative model for predicting review helpfulness [4], [9], [24], [26], especially in terms of VOLUME 10, 2022 helpfulness rating, which comes from the ratio of the number of helpful votes to the total votes. Tobit modeling, a zerocensored regression, revises the regression model on massive zero-value problems and has become a popular model for predicting the helpfulness rating [7], [8], [25], [27], [28], [29]. When implemented in machine learning, the regression and Tobit modeling employ the same objective functions, Sum Squared Error (SSE) or MSE. Those objective functions come from the initial assumption that the dependent variable in the model is normally distributed [12].
Researchers were motivated by the results in the helpfulness rating to continue to employ both models to estimate the number of helpful votes [4], [6], [7], [32] although the distribution is not normal. The central limit theory also supports this condition that a large dataset tends toward a normal distribution in many situations, even if the original variables themselves are not normally distributed [38].
Recently, considering the discrete form of helpful votes, negative binomial regression has been on the rise as a popular model for predicting the number of helpful votes [1], [2], [3]. This model assumes that the number of helpful votes is in a discrete distribution with an over-dispersion problem [1], [2], [3], where the variance is far from the mean value [39].
However, recent studies focus on the review contents represented by word embedding [9], [19], [40] or bag-of-words vector [19], [24]. Moreover, a model with mixed numerical and text factors gives no significant improvement compared to models using either text or numerical factors [19].

D. SAMPLING METHODS
Helpfulness prediction studies generally use random sampling methods on the Amazon dataset. This method randomly chooses elements of the training and testing data. The 10-fold cross-validation sampling method is one of the most popular random sampling methods [3], [9]. However, the random sampling-based models do not consider review age, which neglects the obsolescence of the product functions or features in the reviews [19]. Considering review age, Saptono and Mine [19] proposed TBS methods. Their methods use Cochran's formula and time range to calculate the adequate training set size in classification tasks.

III. PROBABILITY DISTRIBUTION FUNCTION
This study assumes that the helpful vote y is in the continuous EDM family. The native members of EDM family are the Normal, Gamma, and InvGauss distributions [11]. Regarding the central limit theorem [38], of these distributions, distributional approximation for a wide range of data tend to approach a normal distribution. Therefore, we add Expon and Wald, as the particular case of Gamma and InvGauss, respectively. Both are also EDM family members.
Each distribution in the EDM family has a different probability density function (PDF). However, we can generate a common structure of the distribution PDF f (y, θ, φ) from the response variable y, with parameters θ and φ, as follows: where θ is called the canonical function, κ(θ) is called the cumulant function, φ is the dispersion parameter and a(y, φ) is a normalizing function ensuring that (1) is a probability function [11]. We employ (1) to identify the distribution of helpful votes and develop the models.
The distribution-adapted model is generalized from linear regression analysis, the normal distribution-adapted model. Therefore, we use the mean symbol µ of the normal distribution to represent the estimator E[y] for the variable y in all distributions.

IV. PROPOSED FRAMEWORK
In this section, we elaborate on our proposed framework. Fig. 1 shows the overall steps in our framework. We first select the review dataset in Step A and preprocess it in Step B. From Step B, we choose the helpful votes as the dependent variable and the text part of reviews as the independent variable. Subsequently, we identify the distribution of helpful votes in Step C. In Step D, a generalized linear model is formulated based on the distribution of helpful votes. Next, in Step E, we implement the distribution-adapted models in machine learning. Text factors extracted from a review dataset take two forms: bag-of-words and word-embedding, which follow machine learning, in Step F. Finally, in Step G, the models are evaluated on the dataset using adaptive window sampling methods and measuring the performance by the MAE metric.

A. DATASET SELECTION AND PREPROCESSING
This study uses three categories of Amazon dataset [22] for Step: 1) Automotive (AD1), 2) Cell Phones and Accessories (AD2), and 3) Industrial and Scientific (AD3). Those datasets are a combination of many products, each of which has the same category.
We also use movies of IMDb dataset [23]. We select three movies dataset as follows: 1) La La Land 2016 (ID1), 2) X-Men Apocalypse 2016 (ID2), Table 2 contain massive numbers of inapplicable votes, and then in Step B, we apply three rules to select the data that are involved in the experiments. First, we only use non-zero/applicable vote reviews in the experiments because it is unclear whether an inapplicable vote review is new or unhelpful. Even though the position of the zero votes reviews is in the middle of voted reviews, it could be a 'never seen' review due to the system design which gives a priority to popular reviews. Second, we drop duplicate reviews and leave the original one in the dataset. The removed duplicate comes from the system that shares one review for items with variations, such as color and size. Each variation has a unique identity number but shares the same reviews, making the duplicate review not directly related to the item. That is why the duplicate reviews more frequently appear on Amazon datasets than on IMDb datasets, as shown in Table 2. Third, we apply L2-normalization [41] to the number of helpful votes. The big difference between mean and variance in Table 2 shows that the over-dispersion problem occurs in the helpful votes of all datasets.

3) 3 Idiots 2009 (ID3). Amazon datasets in
From Step B, we select the helpful votes as the dependent variable and the text reviews as the independent variable. We feed the helpful votes to Step C and the text reviews to Step F.

In
Step C, we employ the MSE and AIC scores to determine the goodness of fit [13], [14] to identify the distribution of helpful votes. We compared those scores among five distributions in EDM C: Normal, Gamma, InvGauss, Expon, and Wald. We initially generate a histogram based on the whole helpful votes of each dataset to obtain MSE and AIC. We employ Scott's rule [15] to find the number of histogram bins. This rule considers the data characteristics and the size of data in the number of bin formulations, as shown in (2).
The steps of obtaining the MSE and AIC scores for each distribution in C are described as follows: 1) We first select a distribution c in C and fit it to the helpful votes of the dataset to get parameters of c. 2) We generate a histogram of the helpful votes of the dataset. The number of bins n b in the histogram is calculated using Scott's rule [15] in (2).
where σ , N , max, and min are standard deviation, the dataset size, the maximum and the minimum value of the helpful votes, respectively. In this step, we also get n b of (x i , y i ) for each bar in the histogram, where y i represents the actual value of helpful votes in the axis x i . 3) Based on n b calculated in step-2 and the parameters obtained in step-1, we generate n b ofŷ i by using the density function of c. 4) MSE and AIC are calculated using y i from step-2 and generated data,ŷ i in step-3 where k is the number of parameters in the distribution c. 5) The steps above are repeated for other distributions in C. The distribution with the least MSE and AIC scores is the best approximate distribution. We calibrate those results with a KS score obtained by the KS test output, as (5): where D m,n is a KS score for two sample with size n and m, sup is the supremum function, F 1,m and F 2,n are empirical cumulative distribution functions from sample 1 and sample 2, respectively. We also use this calibration to answer Q1 in Section I.  [11] with canonical function (θ), dispersion parameter (φ), cumulant function (κ(θ)), estimator of y (µ) and variance of y (σ 2 ).

C. MODEL GENERATION
The main task of this study is to generate a model that adapts the suitable distribution of helpful votes in Step. The critical process is to generate the unit deviance for the objective function. The deviance is a generalization of using SSE in regression analysis, which also plays a role as a cost function and has to be minimized [12]. Because µ estimator E[y] and θ in (1) are a one-to-one function [11], then we get the formula of unit deviance d for response y and the estimator µ is as follows: where t(y, µ) is the order of exponential in (1), which is defined as follows: where θ is a function of µ.
Generalizing linear regression, we get the deviance as the summation of unit deviance in (6). The unit deviance of the distribution used in this paper is shown in Table 3.
Expon is a particular form of Gamma distribution with shape parameter equal to one and scale parameter θ, so the PDF of the Expon distribution f (y, θ) is shown in (8) [11].
Based on (1), (6), (7) and E(y) = µ = θ, we get and the unit deviance of the Expon distribution is as follows: We found the equivalence of the Expon unit deviance shown in (10) with the parent distribution, Gamma, as shown in Table 3.
On the other hand, Wald is a particular case of InvGauss with µ as the estimator of y is equal to one. Generalizing Expon, we generate the unit deviance of Wald, which is equivalent to the parent distribution, InvGauss, as shown in Table 3. The unit deviance functions instead of the normal deviance have µ for the denominator. This condition will affect the result if the prediction is close or equal to zero. Therefore, we need to apply a translation to deviance.
For the normal deviance d(y, µ), it applies deviance translation as follows: where µ is an estimator for the response variable y and is the translation coefficient. A typical example of the deviance translation is implemented in squared log error (SLE), with equal to one. Based on the deviance translation for the normal distribution and SLE, we generalize deviance translation for other deviance to prevent error division by zero or an anomaly result by a number close to zero.

D. MACHINE LEARNING AND FEATURE EXTRACTION
We employ three types of machine learning in Step. First, considering the widespread use of regression and Tobit model in previous studies, we employ the linear model (LM). Second, we employ XGBoost (XGB) [20] since it has an extraordinary result on classifier task as mentioned in [19]. Finally, regarding a state-of-the-art helpfulness rating prediction model [9], we employ CNN to implement models adapting to the distribution of helpful votes. Furthermore, we use the unit deviance, the output of Step, as the objective function in machine learning to be minimized. For Gamma and Expon distributions, we develop only the best one according to the distribution identified. This condition is also applied to the InvGauss and Wald distributions.
XGB employs gradient boosting and Taylor expansion to support the development of a custom objective function. Therefore, we need to provide the first and second derivatives from each deviance [20] in Table 3. We employ the CNN based on the architecture proposed in [9], which is a state-of-the-art helpfulness rating prediction model, with modifications of the loss function and output layer, as shown in Fig. 2. We build the loss function based on the deviance of the distribution, which is the summation of the unit deviance shown in Table 3. When implementing Tobit model and models adapting to Gamma/Expon, and InvGauss/Wald distributions in CNN, we employ an output layer to prevent the result from negative value. Two activation functions: ReLU [42] and LeakyReLU [43] with negative coefficient, a, as in (12) are possible to use. Since ReLU has a problem called the 'dying' phenomenon, where the prediction is always zero for every value in the dependent variable, we use LeakyReLU. Later, we prove the dying phenomenon when using ReLU for the activation function at the output layer.
In LM, we employ the same objective function as in CNN. We use the Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) approach [44] to minimize the objective function.
As independent variables, we use the results of feature extraction from the text review, in Step, depending on machine learning. In the LM and XGB-based models, the text parts of the reviews are transformed into unigrams, and bigrams term frequency-inverse document frequency (TF-IDF) [45] weighted bag-of-words format as independent variables. Meanwhile, in CNN, the text parts are transformed into word embedding by applying GloVe [46], [47] with six billion words and 100 dimension vectors (Glove.6B.100d).

Finally, in
Step G, to evaluate the model, we propose an adaptive window size (AWS) sampling method. AWS is inspired by the TBS method [19], as shown in Fig. 3(b). The basic idea of the TBS method is to get the training set as close as possible to the testing set, under the assumption that, as the training set gets closer to the testing set, it shares a more similar characteristic with the testing set, and the model performance becomes improved [19].
AWS uses a variable length of training set instead of a fixed-length training set. Since the testing set is the same, we can select a training set suitable to the testing set. However, Cochran's formula uses the variance of the binomial distribution to determine the sample size [48]. According to the central limit theorem, we need to deal with continuous distributions. Therefore, we adjust the formula as in (13) to get the sample size n from the dataset size N .
where Z is a standard score for the desired confidence level, σ 2 is the variance of helpful votes, w is a unit margin of error, and n 0 is the number of samples if the dataset size is unknown.
In this paper, we calculate σ from the helpful votes of reviews in the training set because we assume that reviews after the training set are not voted yet. If the dataset size is N , and fold-size is f , then AWS is as follows: 1) We first sort the dataset with size N in chronological order and divide the dataset by fold-size f . We use the first data with size N t = N /f for the candidate training set and the next dataset with N t for the testing set. 2) We calculate the variance of the helpful votes σ 2 t in the candidate training set in step (1). Furthermore, we fed σ 2 t and the size of the candidate training set size N t to Cochran's formula in (13), replacing σ 2 and N to get the sample size n.
3) If n in step (2) ≥ N t in step (1), then we use N t as the training set. Otherwise, we use n elements in the candidate training set closest to the testing set. 4) We add the testing set to the candidate training set and select the following N t data as the new testing set. We feed the new candidate training set and testing set to step (2). 5) We repeat the above for f − 1 samples. For comparison, we run the model in 10-fold crossvalidation (CV). Meanwhile, we need to adjust the 10-fold CV in Fig. 3(a) when applying the model to the dataset in chronological order, as shown in Fig. 3(b). The white cells are used as a training set in random sampling. Since the review dataset is sorted in chronological order, the white cells are unlabeled. The testing set in the current row increments the training set in the next row, and we call it the adjusted crossvalidation (ACV) sampling method.
We employ mean absolute error (MAE) in (14), to evaluate the performance of each model and compare it among the models to find the best model. The model with smaller MAE means better model.
where y,ŷ are the actual and predicted numbers of helpful votes, respectively, and n is the testing set size. On the other hand, we need to use mean absolute percentage error (MAPE), as shown in (15), to prove the dying phenomenon of CNN with ReLU on the output layer. Using MAPE makes it easier to detect zero prediction on any values of normalized helpful votes. If the average MAPE is 1 with 0 standard deviations, we can conclude that the prediction value is always zero.
We also use MAE for CTR with ReLU on the output layer in determining the acceptance of a model performance.
If the model has a smaller MAE than the threshold, it has an acceptable result, otherwise is an unacceptable one.

V. PERFORMANCE EVALUATION
We conduct experiments to validate our proposed models. We first identify the distribution using MSE and AIC scores. Subsequently, we implement the distribution-adapted models using three machine learning methods: LM, XGB, and CNN, and evaluate them using the AWS and ACV sampling methods. We employ a statistical analysis of variance to check the significance of the mean difference. Finally, we investigate the effect of machine learning, sampling methods, and distribution on model performance.

A. EXPERIMENT SETUP
We build two baseline models adapted to the normal distribution: linear regression [4] and Tobit regression [6], [7] when the best approximation distribution is not the normal distribution. In addition, we also involve a model adapting to the second-best approximate distribution to investigate the possibility of model performance following the distribution of helpful votes. We develop each model in three machine learning contexts, where the details are shown in Table 4.
Here, we implement our proposed framework in Python. We use a linear model library on sklearn. linear_model to implement linear regression, Tobit regression, and the Gamma/Expon distribution-adapted model. For the InvGauss/Wald distribution-adapted model, we use a Tweedie Regression and set the power with three. Since we use a linear model on sklearn.linear_ model, we also use all machine learning in the library and ensemble, as shown in Table 4, which employ SSE/MSE or its modification for the objective function as baseline models [49], [50], [51], [52], [53], [54], [55], [56].
We also use xgboost library to implement XGB. Two problems arise when we use XGB: 1) The native objective function of Gamma/Expon can not handle a normalized value of the independent variable, 2) The objective function of InvGauss/Wald has not been implemented in XGB yet, while Tweedie regression in XGB can not accept power equal to or close to three. Therefore, we need to provide a custom objective function for models adapting to Gamma/Expon and InvGauss/Wald distributions. We first use a translation of unit deviance in Table 3, as generalization from (11). Equations (16) and (17) show the translation of the Gamma/Expon and InvGauss/Wald unit deviance, which we call GAMMA+ and WALD+ , respectively. Subsequently, we feed the first and second derivatives of the translation of unit deviance to develop custom objective functions of XGB.
Appendix A Figs. 12 and 13 show that the GAMMA+ handles normalized helpful votes better than the original Gamma objective function in XGB. GAMMA+ performs better when ≥ 1. Identically, the WALD+ implementation in XGB also handles the normalized helpful votes better than the original Wald objective function, as shown in Figs. 14 and 15 in Appendix A. WALD+ performs better when ≥ 1 with AWS on ID3 and ≥ 2 for the rest. We resume the -values in Table 5.
Moreover, we use PyTorch to develop CNN. We use MSE Loss for models adapted to a normal distribution. However, we need to provide a custom loss function for models adapting to Gamma/Expon and InvGauss/Wald distributions based on the mean of summation from unit deviance, as shown in Table 3.
We can use two activation functions: ReLU and negative coefficient LeakyReLU, as mentioned in Subsection IV-D, for the output layer of CTR, CEX, and CWA. Since CTR is one of the baseline models, we use the CTR model to prove that CTR with ReLU for the output layer will give the 'dying' phenomenon on normalized helpful votes. On the other hand, CNN with a negative coefficient LeakyReLU will solve the ReLU problem.
To prove the 'dying' phenomenon, we combine CTR with ReLU for the output layer. CTR model with ReLU gives the average of MAPE equal to 1 with 0 standard deviations for all datasets with both sampling methods, ACV and AWS. This result proves that the combination of CTR and ReLU always gives zero results for any value of the normalized helpful votes. Since the output is always zero, we get the average MAE of CTR-ReLU as the average sum squared of the absolute actual number of helpful votes. So, it is unacceptable if the value of MAE of any model is greater than or equal to the average of MAE of CTR-ReLU. Therefore, we use the average MAE of CTR-ReLU as a threshold to determine the acceptance of the model performance.
Furthermore, considering the value of the helpful votes after L2-normalization, we use negative values for the LeakyReLU coefficient in the range [−1e −3 , −1e −9 ] for the output layer of CTR to solve the ReLU problem. The negative coefficient of LeakyReLU, as in (12), ensures the output is above the axis line (y = 0), except on 0. In addition, within that range, we also get a gentle slope of LeakyReLU. Fig. 16 in Appendix shows that CTR with a negative coefficient  LeakyReLU has an improvement compared to CTR with ReLU. The LeakyReLU coefficient for each dataset is shown in Table 5.
We need to provide some parameters in machine learning and the AWS sampling method. We use a learning rate of 0.1 for XGB and 0.01 for CNN. We also set a negative parameter of 1e-5 to get a positive gradient function on LeakyReLU. In the AWS sampling method, we set the unit margin of error w as two and Z -score with a confidence level of 99%.

B. APPROXIMATE DISTRIBUTION
Since there is no best fit distribution based on the p-value of the KS test with a value less than 0.01 for all distributions, we use MSE and AIC scores to establish the approximate distribution to identify the distribution of helpful votes. Table 6 shows the result of approximate distribution identification using the MSE and AIC scores compared to the KS score as calibrator.
We get Wald as the best approximate distribution for Amazon datasets, either with MSE or AIC. This result is the same as the outcome of the KS score best approximate distribution, as shown in Table 6. Table 6 shows the different results for IMDb datasets, although the score difference is small between Expon and Wald with AIC.
Above results answer Q1 that our approach gives the same result on the best approximate distribution as the KS score on Amazon datasets. However, we get dynamic results on small and homogenous datasets: IMDb. MSE gives Wald on ID1 and ID2 datasets and Expon on ID3. Meanwhile, AIC provides Expon on ID1 and ID3 datasets and Wald on ID2. Those  TABLE 7. Performance of models on Amazon datasets with the ACV and AWS sampling methods. The performance is measured by the average of MAE followed by the standard deviation in the parentheses. We use the result of CTR-ReLU as the acceptance threshold of models. results differ from the results of the KS score, which gives InvGauss on ID1 and ID2, and Wald on ID3. Furthermore, we feed the results of distribution identification to the model generation step.
According to the results in Table 6, the best approximate distribution on Amazon datasets is Wald, whose effect may depend on machine learning methods. We also implement Expon, the second-best approximate distribution, instead of the parent Gamma to investigate the distribution effect on model performance. Meanwhile, we implement Wald/ InvGauss and Expon, which provide the best approximate distribution on IMDb datasets.

C. MODEL PERFORMANCE
Here, we show that implementing a model adapting to unsuitable distribution tends to give an unacceptable and suboptimal result. We first develop the model that adapts to the best approximate distribution, as in Table 6. We then compare the results of the model adapting to the best approximate distribution along with those of the other baseline models with the average MAE of CTR-ReLU, the acceptance threshold, as mentioned in Subsection V-A. Subsequently, we check the impact of sampling methods. Table 7 shows that the models adapted to the best approximate distribution always give an acceptable result on Amazon datasets, primarily when implemented in LM (LWA) and CNN (CWA). All miscellaneous and LM-based models give acceptable results when evaluated with AWS on the AD1 dataset, where the average MAE is under the threshold, the average MAE of CTR-ReLU. Meanwhile, with ACV, we find that only SGD, RFR, GBR, ETR, and LWA models provide acceptable results. Models built in XGB and CNN, except CSE, also provide acceptable results on AD1. However, we get fewer models (LWA, CTR, CEX, and CWA) which give acceptable results on AD2 and AD3 datasets. VOLUME 10, 2022 TABLE 8. Performance of models on IMDb datasets with the ACV and AWS sampling methods. The performance is measured by the average of MAE followed by the standard deviation in the parentheses. We use the result of CTR-ReLU as the acceptance threshold of models. Overall, CWA gives the best results on Amazon datasets, as shown in Table 7 and Fig. 4. These results follow the pattern of those of distribution identification with all approaches, as shown in Table 6. MSE, AIC, and KS approaches give the same best approximate distribution, Wald on Amazon datasets.
We also find that LWA, CTR, CEX, and CWA models give acceptable results on IMDb datasets as shown in Table 8 and Fig. 5. On ID1 and ID2, CWA gives the best result when evaluated with ACV. On the smallest dataset, ID3, CEX has the best achievement with AWS. These results are in the same pattern as the best approximate distribution of the MSE score, as shown in Table 6.
The best model on Amazon datasets CWA is achieved when evaluated with AWS. In addition, on two most extensive datasets: AD1 and AD2, model evaluation with AWS has a significant impact, as shown in Figs. 4(a) and 4(b). However, on the AD3 and IMDb datasets, model evaluation with ACV has no significant difference compared to AWS, as shown in Figs. 4(c) and 5.
Furthermore, we analyze effects of distributions to which the model is adapted, the sampling methods used, and the time consumed by the model.

D. EFFECT OF DISTRIBUTION
We show the effect of distribution on the model performance in Figs. 6 to 8 and answer Q2 in Section I. We also show the improvement in the model performance when models follow the identified approximate distribution in Table 9.
The models that adapted to the best approximate distribution LWA give acceptable results on all datasets. In addition, the performance of the model built in LM follows the rank of the identified distributions, as shown in Table 9. The results on Amazon datasets shown in Figs. 6(a) to 6(c) follow the   Table 6. On the other hand, the results on IMDb datasets, as shown in Figs. 6(d) to 6(f), follow KS on all datasets, MSE (ID1 and ID2), and AIC (ID2).

MSE, AIC and KS scores as in
The effect of identified distribution, which was used in XGB, is perfectly shown in Fig. 7. However, distributionadapted models have only a slight effect on AD1. Still, overall, XEX and XWA models gave a consistent effect according to the identification rank of the distributions,  as shown in Table 6, especially with the KS score. Those results also follow the rank by MSE (despite ID3) and AIC (despite ID1 and ID3). Table 9 confirms those results.
While the model performance in LM and XGB fully follows the rank by the KS score, the model performance in CNN entirely follows the rank by MSE. Tables 7 and 8 show that the distribution-adapted models give a significant drop on MAE compared to CSE. Fig. 8 shows that CEX and CWA also perform better than CTR, as in Table 9, except on AD1 and AD2 when evaluated in AWS. On AD1 and AD2, CTR performs better than CEX when evaluated with AWS. Following the distribution identification with the MSE scores, as in Table 6, CEX performs the best on ID3, while CWA on the rest.
The implementation of distribution-adapted models affects LM more than other machine learning, in Table 9. LWA gives more than 15% improvement compared to the best model adapting a normal distribution. The same pattern with smaller improvement appears in XGB, as in Table 9. Implementation with AWS in ID1 gets the greatest effect. Meanwhile, the implementation with ACV on AD2 gets the least impact. The smaller effect is in CNN, with less than 1% improvement.
Overall, the implementation of the model that adapts to the distribution gives a positive improvement in all machine learning.

E. EFFECT OF SAMPLING METHODS
Next, we answer Q3 in Section I. We calculate the improvement of each model performance with the AWS sampling method compared to ACV, as shown in Table 10. Table 10 shows that evaluation with AWS affects the model performance on large-size datasets, Amazon. AWS gives significant positive results for all models except for CSE on AD1 to AD3. ACV and AWS have dynamic results on IMDb datasets with no significant difference. The best model on ID1 and ID2 is CWA, achieved with ACV, and on ID3 is CEX when evaluated with AWS.
Based on the above results, we can answer Q3 that the model evaluation with AWS improves the performance, especially on large datasets. Moreover, it gives an almost constant improvement in LM and XGB models.

F. TIME CONSUMPTION
Here, we answer Q3 in Section I by comparing the time consumption of each model in each machine learning and dataset. We calculate the time consumed from the start of training the model to obtaining the test results.
We find that all models run far faster when evaluated in the AWS sampling methods on the two most extensive datasets, AD1 and AD2. We find that distribution-adapted models spend various run times depending on the machine learning, the dataset characteristics, and sampling methods. The details are shown in Appendix C.
In LM, the best approximate distribution model LWA double the time of the fastest model to get a result, as shown in Fig. 9. LAR is the quickest model on the two largest datasets,  AD1 and AD2, while SGD is the fastest on the rest. We also find that the model evaluation with AWS reduces the run time by more than 60% compared to ACV on AD1 and AD2. On the rest, using AWS has no significant effect on the time consumed to run the model.
In XGB, the best approximate distribution-adapted model, XWA, spends less than other models on IMDb datasets, as shown in Figs. 10(d) to 10(f). On Amazon datasets, XWA spends more time than XSE, as shown in Figs. 10(b) and 10(c), except on AD1 with AWS. Moreover, using AWS reduces the time consumption up to 75% on AD1 and 30% on AD2.
CNN is the most time-consuming machine learning, as shown in Figs. 11(a) to 11(c), which is almost 10 times XGB and 20 times LM on Amazon datasets. However, on IMDb datasets, CNN models perform faster than XGB, as shown in Figs. 11(d) to 11(f). Among CNN models, CWA and CEX reach a level with the fastest models on all datasets, despite AD2 and ID1, as shown in Figs. 11(a), 11(c), 11(e) and 11(f). On AD2, CWA and CEX increase slightly compared to CSE, but gain a level with CTR, as shown in Fig. 11(b). Meanwhile, Fig. 11(d) shows that CEX reaches the top when CWA consumes slightly more time than CSE and CTR. Consistent with the LM and XGB, AWS reduces the time consumption by more than 60% on AD1 and AD2 for all models.

G. DISCUSSION
Previous studies commonly use the helpfulness rating as a dependent variable since their datasets, such as Amazon.com 2014 [37], have helpful and total votes to measure helpfulness. Moreover, their focus is on model factors' contribution to helpfulness, and many factors appear as independent variables in Table 1. So, we cannot make a direct comparison with previous research.
In this research, we use Amazon.com 2018 [22] as an updated version of Amazon.com 2014 [37], which has dropped total votes. With the result, we use the helpful votes as a dependent variable, even on IMDb dataset [23]. To make a comparison with the state-of-the-art helpfulness rating prediction models [9], [30], we rerun them in the helpful votes on Amazon.com [22], and IMDb [23] dataset. Our proposed framework, especially with LM and CNN approaches, does not have poor performance compared to previous studies, as shown in Table 11.

VI. CONCLUSION AND FUTURE WORK
This study discussed the benefits of checking the distribution of helpful votes of reviews in a dataset. It was proved that the distribution of helpful votes significantly affects the model performance. The performances consistently follow the rank of distribution identification results, especially when implementing LM and XGB.
The experimental results illustrated that the helpful votes are not statistically distributed in a continuous distribution. Meanwhile, MSE and AIC consistently show that Wald is the best approximate distribution of the helpful votes on Amazon datasets. This result follows the calibrator of the KS score. On the other hand, the best approximate distribution is dynamic among Expon, InvGauss, and Wald on IMDb datasets. MSE and AIC have a distinct result on ID1, where MSE gives Wald while AIC gives Expon. Both approaches give Wald and Expon for ID2 and ID3, respectively. These results do not follow KS, which has InvGauss for ID1 and ID2, and Wald for ID3.
Models adapting to Wald distribution are significantly improved compared to the other models, following the best approximate distribution on Amazon datasets. On IMDb datasets, Wald distribution-adapted model, CWA, gives the  best result on ID1 and ID2, while Expon distribution-adapted model, CEX, is on ID3. Those results are the same pattern as the distribution identified by the MSE score.
When predicting the number of helpful votes, it is important to take into account the sampling time elapsed since  the review was posted. It was proved that the model evaluation with AWS, an adjusted window size in the TBS method, has a greater effect with much less training data than using ACV, especially on large and medium size datasets. Moreover, AWS provides slightly better results on small-size datasets. In addition, AWS also positively affects model performance on average when implementing LM, XGB, and CNN.
Evaluation with the AWS sampling method on two large datasets, AD1 and AD2, decreases the time consumption significantly for all models. On the other hand, the time consumed by a model that follows the best approximate distribution to produce results is variable. It depends on machine learning, characteristics of the data set, and sampling methods. In CNN, normal distribution-adapted models perform faster: CSE on two large datasets, AD1 and AD2, and CTR on two small datasets, ID2 and ID3. However, models adapting to Expon and Wald distributions spend not far different from the fastest model. In XGB, Wald distribution-adapted models spend not far different on Amazon datasets and even faster than models adapting to other distributions on IMDb datasets. On the other hand, LWA consumes double compared to the quickest model in LM and miscellaneous models. However, it is still under a minute on the largest dataset and even a second on IMDb datasets.
The best approximate distribution is identified by measuring the distribution of whole helpful vote reviews in each dataset. Adaptively changing the distribution identification on the training set will be challenging. Moreover, there is a minor difference in the order of AIC, MSE, and KS scores on InvGauss and Gamma. Investigating the effect of the metric on the other datasets also becomes a further task. Considering the advantage of the RNN-based model in sequential data, developing an RNN-based model for helpful votes prediction also becomes a challenge.

APPENDIX C THE TIME CONSUMPTION
See Tables 12-13.