Identifying Consumer Preferences From User-Generated Content on Amazon.Com by Leveraging Machine Learning

Inexperienced consumers may have high uncertainty about experience goods that require technical knowledge and skills to operate effectively; therefore, experienced consumers’ prior reviews can be useful for inexperienced consumers. However, one-sided review systems (e.g., Amazon) only provide the opportunity for consumers to write a review as a buyer and contain no feedback from the seller’s side, so the information displayed about individual buyers is limited. Therefore, this study analyzes consumers’ digital footprints (DFs) for programmable thermostats to identify and predict unobserved consumer preferences, using a dataset of 141 million Amazon reviews. This paper proposes novel approaches (1) to identify unobserved consumer characteristics and preferences by analyzing the target consumers’ and other prior reviewers’ DFs; (2) to extract product-specific product content dimensions (PCDs) from review text data; (3) to predict individual consumers’ sentiment before they make a purchase or write a review; (4) to classify consumers’ sentiment toward a specific PCD by using context-based word embedding and deep learning models. Overall, this approach developed in this paper is applicable, scalable, and interpretable for distinguishing important drivers of consumer reviews for different goods in a specific industry and can be used by industry to design customer-oriented marketing strategies.


I. INTRODUCTION
In recent years, big data analysis has experienced remarkable growth. This growth has been fostered by innovations in computation performance and remarkable successes with artificial intelligence (AI) algorithms. Additionally, these advances have benefitted from increasing volume, diversity, and value of the data.
There are two types of big data: structured data (which have a well-defined data type) and unstructured data (which lack a well-defined data type, such as image, voice, video, and text). Online product reviews generated by consumers contain both structured and unstructured data. For example, while consumers' product star ratings fall into the category of structured data, their written reviews are unstructured data. User-generated online product review data are free, easy to access, and can provide useful information for inexperienced consumers because they contain feedback from actual The associate editor coordinating the review of this manuscript and approving it for publication was Liangxiu Han . consumers who reveal their preferences for products. Such data are quite different from the feedback provided by focus groups or experts.
When a consumer purchases a product through the online retail market, there is uncertainty about the quality of product because the consumer is not in physical contact with it. By leveraging the information from prior review data, inexperienced consumers can reduce their search cost and uncertainty about product quality. Firms can also employ user-generated review content to estimate individual consumer preferences, needs, satisfaction, and complaints and to design, develop, and promote new products. For example, Timoshenko and Hauser [1] demonstrated how to identify consumer needs from user-generated review text on Amazon.
Liu et al. [2] suggest that review data are more likely to be influential for consumers when the product group has more competition, a shorter product history, and weaker brand power. Accordingly, inexperienced consumers may have high uncertainty about experience goods when new innovative firms enter the market. Consumers' uncertainty about product VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ quality is relatively higher when they purchase an experience good than a search good, because they may not know the product quality before they make a purchase. This study investigates Amazon's reviews for a specific experience good (programmable thermostats) requiring enough technical knowledge and skills to install, set up, program, and use it. Products that require technical knowledge and skills to operate (e.g., thermostats) can be difficult for consumers to evaluate before purchase or even after adequately installing the product. Thermostats require time for purchasing consumers to assess the suitability for their needs because people usually do not know their real-time energy consumption, the cost, and the amount of energy saving that a new thermostat can provide in the early stages of thermostat usage. This means that thermostat consumers typically face high uncertainty, and ease of usage and consumer support services are essential for inexperienced consumers to mitigate their concerns and difficulties.
In addition, programmable thermostats (PTs) are not frequently purchased, and malfunctioning could cause flaws in other connected devices, additional repair costs, and physical discomfort. Further, the frequency of and exposure to thermostat advertisements are relatively lower than in other popular research subject products (e.g., movies, music, and books), so the sources of information on thermostats' product quality are less diverse than those on books, music, and movies.
Consumer uncertainty may be higher than normal when disruptive innovation happens because innovative new firms (e.g., Nest) enter the market, introduce innovative products (e.g., Wi-Fi thermostats that can provide remote access and control), and compete with the incumbent firms (e.g., Honeywell). Nest entered the market by releasing the first generation of its learning smart-thermostat on October 25, 2011, and it has been available to purchase from Amazon since December 15, 2011. The Nest released the second generation on October 2, 2012, and it was available from Amazon on the day of release. The Nest's first learning thermostat is an example of disruptive innovation and the internet of things (IoTs) for smart homes [3], [4]. In this regard, inexperienced consumers may have high uncertainty not only due to the required technological knowledge and skills but also to changes in the market structure and competition. This combination of these factors makes thermostats an ideal subject for studying the utility of online reviews to consumers; thermostats can be technically challenging, are of high importance to a home, and potential buyers have few avenues for gaining experience or information prior to purchase.
There are a number of ways to include preferences in models of consumer choice. Revealed-preference methods reflect the actual consumer choices in a real-life situation, while stated-preference methods reflect respondents' hypothetical choices in a well-designed survey or field experiment [5]. Prior studies have widely applied both revealed-and stated-preference methods to estimate consumer preferences. However, these methods may not be applicable for studying online reviews' effects on consumer preferences for technical products such as thermostats.
One-sided review systems like used by Amazon, only provide buyers with the opportunity to write a review which buyers can write without any fee [6], [7]. However, the information displayed about reviewers is limited. Consequently, the conventional revealed-and stated-preference methods cannot be used to directly identify unobserved consumer characteristics and preferences from reviews.
This study identifies unobserved consumer characteristics and preferences by extracting: (1) users' and prior other reviewers' digital footprints (DFs) from user-generated content (UGC) and (2) consumers' sentiment toward product content dimensions (PCDs) from review text data. This study defines this approach as the user-generated-preference (UGP) method.
Consumer review and product-specific review data (142.8 million reviews) from He and McAuley [8], gathered between May 1996 and July 2014, are used to generate DFs. In addition, this study identifies consumers' sentiment toward product content dimensions (PCDs) extracted from review text by applying topic modeling and domain expert annotations, while excluding questionable reviews (posted by ''suspicious one-time reviewers'' and ''always-the-same rating reviewers'').
After the data preprocessing is discussed, the following three questions are investigated: 1. Can consumers' preferences be identified through the analysis of digital footprints? 2. Can consumers' sentiment be predicted before they make a purchase or write a review? 3. Can consumers' sentiment toward a specific PCD in the review text be classified? This paper obtains three main results: first, the author finds that the factors that affect consumer ratings are: (a) users' DFs (e.g., average rating across all categories), (b) reviewers' attitudes toward eight product content dimensions (smart connectivity, easiness, energy saving, functionality, support, price value, privacy, and the Amazon's service quality effect), and (c) other prior reviewers DFs (e.g., length of the review summary). Second, extreme gradient boosting (XGBoost) is found to obtain the highest performance for predicting the sentiment of potential consumers before they make a purchase or write a review. Third, a convolutional neural network (CNN) on top of Bidirectional Encoder Representations from Transformers (BERT) embedding shows the highest performance for classifying consumers' sentiment toward a specific PCD.
These findings will potentially be helpful for firms to identify consumer preferences, predict potential consumer sentiment, extract product content dimensions for a specific product group from review text, and classify consumers' sentiment toward a specific product content dimension. Firms often want to know potential individual consumers' preferences concerning target product groups in a specific industry (e.g., thermostats) instead of a general product category level (e.g., book). Better short-term predictions of potential consumers' preferences for industry-specific product groups may also help firms to improve their business decisions. Section 2 describes the prior literature. Section 3 presents the data-preprocessing for cleaning noisy reviews and extracting target reviewers' sentiment toward the product content dimensions. Section 4 describes the discrete choice analysis. Section 5 demonstrates the ex-ante prediction of potential consumers' sentiment. Section 6 shows the sentiment classification of a specific product content dimension. Finally, section 7 offers conclusions.

II. LITERATURE REVIEW
Many previous studies have focused on the impact of reviews on sales [2], [7], [9]- [15]. Most studies have used summary statistics of aggregated review data at the product level (e.g., the average rating for a product, the volume of reviews for a product, and the average review length for a product).
On an individual level, Liu et al. [2] extracted product content dimensions from individual review text by using topic modeling. The authors demonstrated the classification of each product content dimension by using deep learning and measured the effect of each product content dimension on sales. Further, Timoshenko and Hauser [1] identified consumer needs from individual review text by using deep learning.
One possible challenge of using online review data is potential noise, bias, or promotional reviews [16]. As shown in Table 1, some previous studies have investigated the impact of ownership, reputation, and market competition on firms' incentives to write a promotional review by analyzing aggregated product level summary data [14], [17].
In contrast to previous research, which has used Amazon's online reviews for general experience goods (e.g., books, DVDs, and music), this study investigates Amazon's online reviews for a specific experience good (programmable thermostats).
To the best of the author's knowledge, there is little current research that addresses how to: (1) identify potential suspicious one-time or always-the-same rating reviewers; (2) estimate unobserved individual reviewers' characteristics from user DFs; (3) evaluate the effect of prior other reviewers' DFs on the target reviewers' ratings; (4) extract latent product content dimensions from review text; (5) predict potential consumers' sentiment before they make a purchase or write a review; and (6) classify reviewers' sentiment toward a product content dimension in the review.

III. DATA PRE-PROCESSING
This study aims to estimate and consumer preferences for the group of Amazon users who write a review by using the review data written by this group while excluding biased reviews (Appendix). Therefore, this paper implements specific data-preprocesses (Appendix) as follow: Step. 1: Selecting reviews with no missing values, Step. 2: Cleaning ''suspicious one-time reviewers'' and ''always-the-same-rating reviewers''; Step. 3: Deleting reviewers and reviews for products with no digital footprint (DFs); Step. 4: Selecting the top 6 from 26 brands; Step. 5: Identifying five product content dimensions (PCDs) in the review text using LDA; and Step. 6: Modifying the PCDs by leveraging a domain expert's knowledge.
Zhao et al. [18] indicated that fake reviews increase consumers' uncertainty about products and that more believable online reviews of experience goods have a larger effect on consumer choice. Some firms may write positive reviews about their products and negative ones about their rivals' products [14], [17], [19]. Accordingly, deleting potential suspicious reviews during pre-processing is essential to improve the credibility of reviews and reduce consumer uncertainty.
Mayzlin et al. [14] defined the ''suspicious reviewer'' as one who writes a review for a hotel for the first time only during the sample period (October 2011) and showed that their rating distribution is more polarized than that of the entire sample. This study takes suspicious reviewers into account by accessing individual reviewers' prior reviews in different categories over the entire sample period. A ''suspicious onetime reviewer'' is defined as one who writes only a review for a programmable thermostat (PT) as a first review and does not write reviews for any other products over the entire sample period.
Some reviewers always give a star rating at the same level for all reviewed products in all categories, so their reviews may contain self-selection bias. However, it is possible that the reviewers give the same rating level because the number of reviews is simply small. In this study, an ''always-the-same-rating reviewers (ASR)'' is a reviewer who writes more than 8 reviews with the same rating level. Only 69 reviewers write more than 8 reviews at the same star rating level (5 stars), and these reviewers' 69 reviews for PTs are removed.
The purpose of this study is to identify latent consumers' characteristics and preferences by analyzing DFs, so the sample group disregards reviewers and programmable thermostats containing no prior DFs. DFs from earlier reviewers (crowd) may have the greatest effect on subsequent reviewers when the reviewer posts his or her first review. This study therefore focuses on the target reviewers' first review of a programmable thermostat. After only selecting the first review of each reviewer for the thermostat group, the total number of reviewers and their first-time reviews is 5,307, and the total number of reviews for all products (including programmable thermostats) written by these reviewers in all categories over the entire sample period is 169,809.
In contrast to previous studies using aggregated review summary statistics at the product level, this study extracts individual reviewers' digital footprints for a specific product group from a dataset of 141 million Amazon reviews. In detail, digital footprints of individual target reviewers and other prior reviewers (the crowd) are extracted from all the reviews in all categories over the entire sample period and this information is used to identify and predict latent consumer preferences and sentiment.
The review text often contains information that is useful for identifying the latent PCDs [2], each reviewer's sentiment, and the direct or indirect reasons for the star rating given. Latent Dirichlet allocation (LDA) [22] is an unsupervised learning model used to identify latent topics and the distribution of these topics in each review. Therefore, the author determines five PCDs in the review text by applying LDA.
Passonneau et al. [23] suggested that annotation by experts transfers domain knowledge to machines for better prediction performance. Accordingly, the author (the domain expert) manually annotates 47,763 labeling tasks for the reviewers' sentiment toward each product content dimension (PCD) to transfer domain knowledge to the models into nine PCDs based on domain knowledge and the purpose of the research design (Appendix).

IV. ECONOMETRIC ANALYSIS
Amazon uses five-star ratings from one to five. Reviewers' observable ratings indicate the range of their unobservable  continuous preference [24] as follows: 5], is reviewer i's first star rating for a PT on day t. U * ipt denotes the unobservable continuous utility of reviewer i for product p on day t. The unknown cutting points (thresholds) are denoted as c k with the assumption that c 1 < c 2 < c 3 < c 4 . U * ipt can be represented as follows: indicates a vector of independent variables, ρ > 0 is a scale function to adjust the variance, and ε it is a homoskedastic error term following a standard normal distribution [25], [26]. Hu et al. [21] showed that the star rating distribution of some experience goods (books, DVDs, and videos) follows a bi-modal distribution on Amazon.
The frequency of observed star ratings (from 1 to 5 stars) in this study follows a bi-modal distribution, that is a nonnormal distribution. However, the cutting points adjust each rating probability (following a normal distribution) to match the observed rating distribution [27].
The ordered probit (OP) model assumes that ρ = 1, so there is no scaling effect on the underlying preferences. Some researchers have studied or applied heteroskedasticity to ordered response models [25], [28]- [33].
In contrast to linear regression models, the existence of latent heteroskedasticity will cause inconsistency in the maximum likelihood estimators of OP models [27]. The heteroskedasticity ordered probit (HETOP) model assumes its scaling function to be ρ i = exp(Z it γ ), where Z i denotes the regressors for the scaling function and γ are unknown coefficients for Z it . In addition, the variables in x it can overlap with those in Z it ; therefore, x a it denotes the variables involved in both x it and Z it while x b it denotes the variables that only belong to x it . Unknown parameters are estimated through the maximum likelihood estimation (Appendix).
This study assumes that the reviewers' different prior review experiences and patterns reflect their unobserved characteristics and preferences. The variables are divided into ''at time'' variables extracted from DFs at t i ; ''user DF'' variables extract reviewer i's prior reviews across all categories by t b i or at t b i ; and ''crowd DF'' variables extract the reviews written by other prior reviewers on the PT by t b j =i or at t b j =i . The number of prior reviews written by i in each subcategory by t b i is denoted as ''sum_+ subcategory name'' and 32 subcategories are defined by merging similar subcategories during the pre-processing. The category diversity is the Shannon index, for which higher values mean that reviewer i writes reviews in subcategories with greater diversity by t b i (Appendix). The digital footprints (DFs) and sentiment variables in this study are defined in Table 3.
As can be seen in Table 4, each model in this section contains a different combination of variables to identify the effects of DFs, sentiments, prices, and the volume of prior reviews on the consumers' star ratings. In particular, the review text data are divided into ''review summary (headline)'' and ''review body''. ''Review'' in this study denotes both the review summary and the review body text. In addition, other ex post reviewers' helpfulness votes for reviewer i's review after t i are an ex post variable that does not affect the reviewers' star rating at t i ; therefore, this study disregards helpfulness votes for reviews after t i .
Omitted variables and the existence of heteroskedasticity may cause inconsistency of parameters in OP models [27]. The models in this section contain the variables extracted from DFs and review text data to reduce the omitted variable problem.
The misspecification of the variation function in HETOP models leads to biased parameters [30]. The author compares the empirical results between the HETOP and the OP models with different sets of regressors to check the variation function's misspecification in the HETOP models. The notation ''model_o'' indicates an OP model and ''model_h'' indicates a HETOP model. Model_o1 is the base model, which contains only observable variables at t i .
The Akaike information criterion (AIC) and Bayesian information criterion (BIC) designate the model with fewer parameters and smaller sample sizes as a better-fitted model [27]. A smaller AIC or BIC value means a better model fit. All the HETOP models show better model fits than the OP models with the same set of regressors (Appendix). All the HETOP models also show the existence of heteroskedasticity in the likelihood ratio test.
Surprisingly, the models with price (at the time of web scraping) variables show a lower model fit than the models without price variables. Product prices on Amazon frequently change due to promotions, memberships, and other factors, so the actual price of reviewed products may often differ from the price at the time of web scraping. Further, the actual price at the time of purchasing could be different from the price at the time of writing a review. This price gap between the actual price and the price at the time of web scraping might be a source of inherent bias in the price variables. This study uses the reviewers' sentiment toward the perceived price value dimension as a sentiment variable.
In detail, the sign of coefficients for variables in OP models reflect the sign of the marginal effect with the extreme star ratings (R ipt = 5 and R ipt = 1). In the HETOP models, the sign of the coefficients for x a it variables (that involved in both x it and Z it ) reflect the sign of the marginal effects for the x a it variables with the extreme ratings. However, the sign of the coefficients for x b it variables (that only belong to x it ) does not directly reflect the sign of the marginal effects with any star ratings. In this study, all the variables in the HETOP models are x a it variables, excluding six x b it variables consisting of the reviewer's average star rating by t b i and five brand dummies.
The interpretations for the most satisfied consumers (five-star reviewers) are based on statistically significant variables in model_h2 (the main model for interpretation) and model_h4 (the model for interpretation of the volume of prior reviews in each subcategory).
Based on the user DF variables in model_h2, the probability that a reviewer will give a five-star rating to the reviewed PT will decrease if the reviewer writes a longer review summary or body and has a greater volume of prior reviews in all categories.
In contrast, the probability of a reviewer giving a fivestar rating will increase if the reviewer has a higher variance of review summary length in prior reviews. In addition, the reviewer's average star rating in prior reviews has a positive influence on the probability of the reviewer giving a fivestar rating. Even though the direct economic interpretation is limited, the coefficient of the reviewer's average star rating is the largest among the statistically significant variables in model_h2.
With other prior reviewers' DF variables in model_h2, the probability of a reviewer giving a five-star rating for a PT increases with increased variability in length of prior review summaries for the PT.
In contrast, the probability that a reviewer will give a fivestar rating for a PT decreases as the average length of the prior review summary increases. Chevalier and Mayzlin [11] suggested that the statistical significance of the review length variable indicates that consumers read the text in the reviews. Here, this point suggests that a reviewer who gives the extreme ratings (a 1-star or 5-star rating) may respond to prior reviewers' review summary.
Based on the reviewers' sentiment toward product content dimensions (PCDs) extracted from the review text, the probability of a reviewer giving a five star-rating increases if the reviewer has a positive attitude toward ''smart connectivity,'' ''easiness,'' ''energy saving,'' ''functionality,'' ''support,'' ''price value,'' ''privacy,'' and ''Amazon effect'' dimensions. The results of the sentiment variables indicate that consumers prefer ''smarter'' and ''easier-to-use'' PTs. In addition, these consumers prefer PTs made by firms that provide better support for consumers. Therefore, firms need to consider not   only developing smarter products but also making them easier for consumers to use with better consumer support programs.
This same group of consumers also consider a PT's energy saving capacity, functionality, and perceived price value. Interestingly, privacy also affects these consumers' preferences, as they may be concerned about the information stored and transmitted by wireless smart thermostats. Firms may need to mitigate consumers' concerns about their privacy with respect to energy consumption and life pattern data.
To the best of the author's knowledge, this is the first study to investigate the effect of online retail market service quality on consumers' sentiments. Amazon's better service quality (such as faster delivery, better consumer service, and flexible refund policy) may increase the probability of a reviewer giving a five-star rating. This service-related result supports the idea that online retail market service quality may influence consumers' preferences as well. Therefore, without considering the effect of online market service quality on the reviewers, the estimation of consumer preferences may lead to upward or downward bias. In contrast with the service dimension, the ''environmental friendliness'' dimension proved to be statistically insignificant.
Model_h4 contains thirty-two variables for the volume of prior reviews in each subcategory instead of the volume of prior reviews in all categories, like model_h2. The results of model_h4 indicate that the probability of a reviewer giving a five-star rating increases if the reviewer has written a larger volume of prior reviews for products in the ''appliance'' and ''health care and personal care'' categories by t b i . For example, reviewers who have a high volume of prior reviews for products in the ''appliance'' category might have more technical knowledge and experience with hardware devices. In addition, thermostats are home energy control devices designed to keep the ideal temperature for consumers' comfort within their homes, so consumers who have a greater volume of prior reviews for products in the ''health care and personal care'' category may have better knowledge related to thermostats.
In contrast, the probability of a reviewer giving a fivestar rating decreases if the reviewer writes a higher volume of reviews for products in the ''Amazon instant video,'' ''apps,'' ''cell phones,'' ''clothes,'' ''groceries,'' ''magazine subscriptions,'' and ''pet supplies'' categories. While these data-driven interpretations are subjective, they do show how to use DFs to identify latent consumer characteristics.

A. MARGINAL ANALYSIS
Generally, marginal effect analysis is an appropriate way to interpret each parameter in OP models due to non-linearity. Table 5 shows the marginal effect of key variables (model_h2) at the average value of one company's reviewers (Nest, during June 2014).
The sign of the marginal effect of x a it for the extreme ratings is the same as the sign of the coefficient of those variables in model h2. Accordingly, the average star rating of the reviewers by t b i (only one continuous x b it variable) shows the same sign as the coefficient of this variable for the extreme ratings in model_h2. In contrast, the marginal effect of binary dummy variables for each brand (dummy type of x b it ) shows different signs from the coefficient for these dummies over the star ratings.
In terms of other prior reviewers' (crowd) DF variables, the brand dummy variables show different patterns of marginal effects for each star rating. The marginal effect of the Nest brand dummy shows a negative influence on the probability of a reviewer giving a five-star rating; otherwise, it shows a positive influence on the probability of the reviewer's other star ratings. Increasing the crowd's average length of review summary for the PT will decrease the probability of the reviewer giving a five-star rating. In contrast, increasing the crowd's variance of the review summary length for the PT will increase the probability of a five-star rating.
In terms of the reviewers' sentiment toward the nine PCDs, eight sentiment variables are statistically significant, while the environmental friendliness dimension is not. The sentiment variables show a positive relationship with the probability of a five-star rating; however, the sentiment variables VOLUME 9, 2021 have a negative relationship with the other star ratings. If a reviewer has more positive sentiment toward smart connectivity, easiness, energy saving, functionality, support, pricy value, and privacy for programmable thermostats and Amazon's service quality, the probability of writing a five-star rating will increase while other star ratings will decrease.

B. ROBUSTNESS
All the models containing digital footprints (DFs) and sentiment variables show a much better model fit than the base model_o1 (which contains only observable variables at t i ). Nonetheless, latent omitted variable bias is still a concern because a one-sided review system cannot provide actual socio-demographic information about the reviewers.
To account for potential omitted variable bias, the robustness test in this study follows Mayzlin et al. approaches [14].
The first step is to compare the coefficients of the key variables between the model without control variables (the base model) and the model with control variables (the control model). If the signs of the coefficients for the key variables are the same and the magnitudes of the coefficients for the key variables are similar between the base and the control model, the effect of omitted variables on the coefficients of the key variables may be relatively small. In this case, the omitted variable problem might be neglectable for estimating the coefficients of the key variables.
As shown in Table 6, the sign of the coefficients for the statistically significant key variables is the same in the control and the base models. The magnitudes of the coefficients for the key variables are also similar in the control and the base models. These empirical results indicate that the omitted variable problem might be lessened by adding digital footprints (DFs) and sentiment variables for each product content dimension.
Even though there is still the possibility of selection on unobservable factors, the models using DF and sentiment variables show a much better model fit than model_o1 and the same sign and similar coefficient magnitudes for key variables across the HETOP models. This similarity indicates the importance of digital footprint mining and sentiment analysis in estimating consumer preference.

V. EX ANTE PREDICTION USING MACHINE LEARNING
Increased ability to predict potential customers' level of satisfaction with a product would enable firms to better target potential positive consumers. Therefore, six different machine learning models (Appendix) are applied here to predict potential consumers' sentiment. Classification is a prediction task for a discrete dependent variable (i.e., label). For example, predicting a fivestar rating from online product reviews involves multiclass classification, which is often a more difficult task than binary classification. Bouazizi and Ohtsuki [34] showed that the accuracy of sentiment classification of a balanced dataset from Twitter decreased from 81.3% in a binary classification to 60.2% in a multiclass classification with seven different sentiment classifications.
As shown in Table 7, the rating distribution in this study is skewed to the positive class, so it is an imbalanced dataset. Classification of imbalanced data is challenge in machine learning because classification results tend to be biased toward the majority class.
Class weighting is a popular approach to mitigate the imbalanced class problem [36]. In detail, class weighting puts more weight on the minority class (three-star ratings) than majority classes in a machine learning model's loss function, making the loss function more sensitive to the minority class and less sensitive to majority classes. In this study, class weighting is applied to each machine in this section as a hyperparameter.
The data used in these machine learning models is sampled from October 12, 2005 to July 17, 2014, and the total sample size is 5,307 reviews (and reviewers). This study defines the validation and test datasets with similar sample sizes (301 and 303 reviews, respectively) and time intervals (about a month). This study further assumes that the weather and seasonality are similar in the validation and test datasets.
The ex ante classification of potential reviewers' sentiment is divided into ex ante and partial ex ante classification. First, the ex ante classification is the prediction of potential consumers' sentiment before they make a purchase. In this case, firms do not know reviewers' ratings, reviews, or reviewed or purchased thermostats, so these ex post variables are excluded.
Second, the partial ex ante classification is a prediction of potential consumers' sentiment before they write a review for a purchased thermostat. In this case, firms know the types of thermostats that consumers have purchased. However, they do not know the consumers' rating and reviews for the purchased thermostats because the consumers have not posted a review yet. Therefore, reviewers' ratings and reviews are excluded from the partial ex ante model, but the programmable thermostat dummy variables are included in the partial ex ante model.
If the machine learning model is too closely fitted to the training data, the fitted model's prediction performance for new data points in the validation set will decrease. This modeling error is usually called overfitting in machine learning [37]. The optimal hyperparameter values for each prediction machine are selected when the optimal values mitigate the overfitting problems during the hyperparameter tuning process.
To avoid overfitting, the original dataset is split in the training step into a total training set and a test set, and the total training set is also divided into a training set and a validation set for hyperparameter tuning. Each machine learning model is trained on the training set and predicts new data points in the validation set. The optimal hyperparameter values are selected when the validation loss stops decreasing while the training loss keeps decreasing.
In the test set prediction step, each prediction model is also trained on the total training data with the optimal hyperparameters selected during the training step. The model trained on the total training data predicts the label in the test set. Reviewers' sentiment classification in the test set can be interpreted as predicting the strength of potential consumers' preferences.

A. MACHINE LEARNING MODELS FOR EX ANTE PREDICTION
The support vector machine (SVM) [38] and decision tree (DT) [39] models are base models (single classifiers) used to compare their prediction performance with more complex models.
Ensemble methods use a set of base classifiers. Dietterich [40] suggested that ensemble models often perform better than single classifiers because: (1) averaging classifiers may reduce the probability of using the wrong classifier; (2) different starting points for each classifier's optimization may reduce the possible local optima; and (3) combining classifiers may represent the correct function for mapping features to labels. Random forest (RF) [41] and extreme gradient boosting (XGB) [42] are tree ensemble models.
Recently, deep learning (DL) has shown dramatic progress in diverse areas. DL automatically learns a representation of data for required tasks [43]. The artificial neural net (ANN) [44] and long-short-term memory (LSTM) [45] models are DL models.

B. EX ANTE PREDICTION PERFORMANCE IN SENTIMENT CLASSIFICATION
The prediction performance criteria for sentiment classification are: 1. Accuracy: the ratio of the total number of correctly classified reviews over the total number of reviews; 2. Precision: the fraction of reviews correctly classified for a given star rating over the total number of reviews classified as the star rating; 3. Recall: the fraction of reviews correctly classified for a given star rating over the true number of reviews belong to the star rating; and 4. F-score: the weighted average of precision and recall in the following format: According to the studies conducted by Ibrahim et al. [46] and Jeni et al. [47], the F1 score may be a better evaluation criterion for this imbalanced dataset because accuracy could mislead the prediction performance of classifiers for an imbalanced dataset. For example, if a machine learning model classifies all the instances in the test set (Table 7) as a positive class, the accuracy will be.7855 (the minimum reasonable accuracy of a classifier). Accordingly, the weighted average macro F1 score (WA F1) is the evaluation criterion for each model's prediction performance in this study as follows: The predictive performance of six popular prediction machines with six different feature sets can be seen in Table 8. Model 1 (''at time model'') is the base model that contains only 37 observable variables. This model is a base model (feature set) for the prediction performance of the six machine learning algorithms with different models (i.e., different feature sets). Without digital footprints and sentiment variables, as in the case of model 1, only the prediction performance of SVM in the WA F1 score is slightly better than that of the econometric model (HETOP). In this case, there is no strong incentive to apply other complex machine learning models to predict potential consumers' sentiment instead of the base machine learning model (SVM) or conventional econometric model (HETOP). In addition, the predictive performance of machine learning and econometric models with this feature set is very low. Models 2, 3, and 4 (Table 8) are ex ante models used to predict consumers' potential sentiment for PTs before they make a purchase. Model 3 (the ''ex ante sub-model'') shows the highest predictive performance of the best classifier among all six models (including the three ex ante models). RF and XGB in model 3 are not only the best prediction machine among the six classifiers in all six models with a WA F1 score of 0.74, but also shows the highest accuracy among the six classifiers in all six models with a score of 0.802 (Table 8).
Surprisingly, adding more price variables to model 3 does not improve the best classifiers (RF and XGB)' prediction performance in model 4. This result indicates that adding a potentially biased variable (price at the time of web scraping) to prediction machines may not improve the prediction performance.
Models 5 and 6 ( Table 8) are ''partial ex ante'' models used to predict consumers' potential sentiment for the PTs purchased before they write a review. These models contain the product dummies for 71 PTs; therefore, firms know the type of PTs purchased by the consumers.
Surprisingly, adding these product dummies to the feature set in model 3 does not improve the WA F1 score of most of the classifiers (Table 8). Therefore, information about purchased PTs may not be very useful for improving classifiers' prediction performances in model 3. Table 9 provides the detailed model structure in model 3, the optimal hyperparameters for each model, and the confusion matrix for each classifier's prediction. Notably, all the classifiers in model 3 show a zero WA F1 score for the minority class (2; three-star rating). This result shows the biased prediction problem in the imbalanced data. If a threestar-rating reviewer group is the minority group in a society, it may cause unfairness and inequality issues.

VI. SENTIMENT CLASSIFICATION USING NLP
Labeling text data for sentiment analysis often requires highcost, time-consuming, and labor-intensive work. If the volume of review data is larger, the required time, labor, and financial cost for annotation will increase as well. In this case, firms can reduce these labeling costs by leveraging natural language processing (NLP).
Firms can apply deep learning methods to identify semantic meanings from review text. After training NLP models on an expert-annotated training dataset, the trained NLP models could classify the reviewers' sentiment toward a specific product content dimension (PCD) in a new review text dataset. Firms can apply these sentiment analyses to heuristic, fast, data-driven business decision making for better consumer support and feedback.
As a digital experiment for examining NLP's potential for sentiment analysis, diverse NLP methods are applied to classify reviewers' sentiment toward a specific product content dimension (functionality) because the functionality dimension contains the least imbalanced data among the nine PCDs for programmable thermostats (PTs). As shown in Table 10, the reviewers' sentiment regarding the functionality is distributed as follows: positive (3) with 41.70%, neutral (2) with 32.77%, and negative (1) with 25.53%. This dataset is relatively balanced compared with the previous datasets.
Word embedding is a way to map words to real vector space. Word embedding assumes that numerical vectors generated from review text contain the semantic information in the review text. High quality word embedding vectors are essential for sentiment classification performance. Three different word-embedding approaches are applied in this study to convert review text into numerical input vectors: (1) word frequency-based embedding, (2) word distributionbased embedding, and (3) context-based embedding.
In particular, transfer learning has shown success in different NLP tasks and has become an important approach in NLP [48]- [50]. Transfer learning assumes that, when the  training dataset is relatively small, using parameters in pretrained models trained with big data could improve NLP models' performance in a new task.
Two popular transfer learning approaches are finetuning [48] and further pre-training [51]. The fine-tuning approach simply reuses a pre-trained model for new target tasks. A further pre-training approach involves training a pre-trained model with domain data to update the weights in the pre-trained model to reflect contextual domain information. The fine-tuning and further pre-training methods are applied to the W2V and BERT models in this study. VOLUME 9, 2021  On top of each word-embedding vector generated from the review text, tree-based ensemble models (RF, XGB) and a deep learning model (CNN) are applied to classify reviewers' sentiment toward the functionality dimension. Each classification model is combined with a suitable word-embedding method for each classifier's characteristics.

A. WORD EMBEDDING: MAPPING TEXT TO NUMERICAL VECTORS
Frequency-based embedding is a simple way to map each review text to numerical vectors. Term frequency-inverse document frequency (TF-IDF) is a frequency-based type of word embedding and penalizes the high-frequency words in the entire review [35]. On top of the TF-IDF embedding vectors from the review text data, RF and XGB are applied for sentiment analysis. TF-IDF has a high-dimensional spare matrix and cannot represent similarity, ambiguity, and contextual meaning in a text (Appendix).
The Word2Vec (W2V) [52] model is a word distributionbased embedding method and generates dense embedding vectors representing each word's semantic meaning. For example, the W2V model may generate similar embedding vectors for ''pen'' and ''pencil'' because the two words contain similar semantic meanings. In this study, the W2V model is trained with all the reviews (N = 1,926,047) in the ''tool and home improvement'' category and the number of unique words is 73,856. The hyperparameters are the W2V embedding dimension, window size, and training dataset. After hyperparameter tuning, the optimal W2V embedding dimension is 100 and the optimal window size is 5 (Appendix).
Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art context-based embedding method. BERT can represent the same word in a sentence with different embedding vectors by reflecting the contextual meaning of each word in the sentence. For example, in the sentences ''I did not like this thermostat in the past. Now, I love this thermostat,'' the word ''thermostat'' occurs twice, in the first and in the second sentence. BERT generates different embedding vectors for ''thermostat'' in the first and second sentences based on the contextual information in them. Meanwhile, context-free embedding models (e.g., TF-IDF and W2V) generate the same embedding vectors for ''thermostat'' in both sentences.
In particular, the domain expert in this study reads and annotates all 5,307 reviews for PTs and finds that the review text often contains a comparison between the previously owned PT and the newly purchased PT, so the same word in the review often represents different contexts based on its position in the review. For example, ''I disliked the previous thermostat. However, I love this new thermostat.'' In this text, even though the word ''thermostat'' occurs both in the first and in the second sentence, the first one may contain a negative sentiment and the second one may contain a positive sentiment. However, context-free embedding models cannot capture different semantic meanings of the same word in different positions in the review sentences. In contrast to the context-free embedding models, BERT (context-based embedding) can find the contextual difference between occurrences of the same word in different positions in the review sentences.
This study uses the BERT-based model, which contains 30,522 unique tokens with 768 embedding dimensions for fine-tuning and further pre-training. With a fine-tuned BERT, the convolutional neural network (CNN) is applied on top of the pre-trained embedding from the original BERT model. Having further pre-trained BERT, the BERT embedding is updated by training on the review text data and is used as input vectors for the CNN classifier. Recently, Gururangan et al. [51] and Sun et al. [53] showed that further pre-training with domain data could improve machine learning models' performance.

B. CONVOLUTIONAL NEURAL NETWORK (CNN) FOR SENTIMENT CLASSIFICATION
Many studies have applied a CNN for text classification and shown good performance [54]- [56]. Liu et al. [1] and Timoshenko and Hauser [2] applied CNN text classification on top of W2V embedding trained on review data. In this study, the CNN classifier on top of BERT or W2V embedding is applied for sentiment analysis (Appendix).
According to Zhang and Wallace [56], the filter size and the number of filters are key hyperparameters for a CNN model where a 1-max pooling is better than other pooling methods, and regularization has little influence on the performance of the CNN classification. This study applies multiple feature sizes and different filters to find the optimal parameters. Input embedding vectors are generated from multiple versions of the W2V and BERT models. For structured data, 161 variables are selected from the partial ex ante sub-model as input variables for the full model (text and structured data model).  Table 11 shows the distribution of the three classes in the functionality decision in the review text. This dataset is relatively less imbalanced than the previous datasets, so the prediction performance of the minority class (1) may be better than in previous cases. In Table 11, the bottom line of the test set accuracy is 0.3795.

C. SENTIMENT CLASSIFICATION EXPERIMENT DESIGN
This study defines the partial and full models based on the type of features in the model. The partial model simplifies the feature engineering by excluding digital footprint (DF) mining from user-generated content (UGC) to generate numerical input variables. In general, DF mining requires intensive manual coding and adequate computing resources (e.g., mass storage space and big-memory computers). Generating input variables from DFs also requires a large online product review dataset that contains individual user IDs, product IDs, and time stamps. Firms often want to reduce feature engineering by focusing only on review text data (the partial-model approach). However, the full-model approach shows how to combine unstructured review text data with structured data to improve a classifier's performance.
In this section, tree ensemble models (RF and XGB) are selected as baseline models to compare their prediction performance with more complex models. The TF-IDF embedding method is applied to the RF and XGB models because these models are incompatible with the two-dimensional word-embedding vectors generated by the W2V and BERT models.
The CNN model is a popular deep learning model for text classification. In particular, various CNN models on top of BERT or W2V embedding vectors are the main classifiers in this section. In this study, the CNN model's hyperparameters are the length of the review text, training epochs, number of filters, filter sizes, dropout rate, and learning rate.
The W2V embedding models are trained on different review datasets with different window sizes and embedding dimensions. The CNN classifier on top of Google's pretrained W2V embedding (trained on three million words and phrases from Google News) shows lower prediction performance than the CNN classifier on top of W2V embedding generated in this study (trained on online product review data from Amazon). In particular, two different online product review datasets are used for training the W2V models: (1) W2V_S (N = 169,809 reviews), containing all reviews of the target reviewers across all categories over the entire sample period; and (2) W2V_L (N = 1,926,047 reviews), consisting of all reviews in the ''tool and home improvement category'' over the entire sample period. The W2V model trained on W2V_L shows better performance for sentiment analysis in this section than the W2V model trained on W2V_S and on Google's pre-trained model.
The BERT models are applied to word-embedding methods with two different approaches, the fine-tuning and further pre-training approaches. The fine-tuning approach simply reuses the pre-trained embedding vectors from the original model as input-embedding vectors for a classifier. This approach relies on transferring learning and has recently been shown to be successful in the performance in NLP tasks.
A further pre-training approach updates the pre-trained embedding vectors by training the pre-trained model on domain data to adapt domain context information to embedding vectors. However, there is no ground truth or theoretical proof supporting the assumption that further pre-training ensures better performance with noisy online product review data. Two different online product review datasets receive further pre-training: (1) BERT_S (N = 169,809 reviews), containing all reviews of the target reviewers across all categories over the entire sample period; and (2) BERT_L (N = 1,926,047 reviews), consisting of all reviews in the ''tool and home improvement category''.
For further pre-training of the BERT model on domainspecific review data, the hyperparameters are the learning rate, batch size, and further training steps. In this study, the optimal hyperparameters for further training BERT are learning rate 0.00001, batch size 32, and 1,926,047 training steps. In the BERT model, the maximum length of tokens is fixed as 512 (510 without special tokens); therefore, 512 is the maximum length of review tokens for the BERT model in this study. Table 12 presents the results of the sentiment classification of reviews about a specific product content dimension. The classification models are divided into the partial model (using text only) and the full model (using text and structured data).  In the partial model, the CNN models on top of fine-tuned BERT or further pre-trained BERT_L embedding show the highest WA F1 score and accuracy. Accuracy is an important evaluation metric for measuring the prediction performance because the dataset in this section is relatively more balanced than the datasets in the previous sections.

D. SENTIMENT CLASSIFICATION RESULTS
All the CNN models on top of BERT embedding shows better prediction performance than the tree ensemble models and the CNN models on top of context-free embedding (TF-IDF and W2V embedding). This result indicates that BERT is a better embedding method for sentiment classification.
It demonstrates that the identification of contextual information from review text is a critical factor for the sentiment classification of online product reviews (Table 12).
In the full model, the CNN model on top of the finetuned BERT embedding shows the highest WA F1 score and accuracy (Table 12). This result indicates that firms can easily implement sentiment analysis without intensive training steps for word-embedding models and accomplish high prediction performance by reusing pre-trained BERT embedding as input embedding vectors. The CNN models with further trained BERT embedding show lower prediction performance than the CNN model with pre-trained BERT embedding. Therefore, further pre-training of BERT may not be a suitable embedding method in this case.
Surprisingly, the class-weighted XGB on top of TF-IDF embedding shows the same WA F1 score as the CNN on top of pre-trained BERT embedding (Table 13). The prediction performance of XGB with text and structured data is higher than that of XGB with text data only. This result may be due to the weighted XGB's good prediction performance with structured numerical variables.
In contrast to the previous sections, the dataset in this section is relatively balanced, so the imbalanced classification problem is not a critical issue in this section and the classification performance for the minority class is not low. Overall, the CNN on top of fine-tuned BERT is the best option in all cases, with high prediction performances and low computational costs for training the embedding model. In addition, the fullmodel cases are mostly superior to the partial-model cases.

VII. CONCLUSION
This study finds that all HETOP models containing DFs and sentiment variables show a higher model fit than the base model containing no DFs or sentiment variables. Furthermore, machine learning models containing DFs and sentiment variables show better prediction performance than the base model. These points indicate the importance of DF mining and sentiment analysis for estimation and prediction tasks.
The HETOP models' results show that a consumer is less likely to give a five-star rating for a reviewed programmable thermostat (PT) if he or she: (1) writes a longer review summary and body, (2) has a lower variance of review summary length in prior reviews, a larger volume of prior reviews across all categories, and a higher average rating in prior reviews across all categories, (3) writes a review for the PT that has a higher average length of review summary and/or lower variance of review summary length in prior reviews, (4) writes a larger volume of prior reviews in specific product categories.
The eight sentiment variables positively affect the probability of a 5-star rating. The sentiment variables represent the target consumers' sentiment toward product content dimensions (PCDs). The dimensions are (1) smart connectivity, (2) easiness, (3) energy and money saving, (4) functionality, (5) support, (6) perceived price value, (7) privacy, and (8) the Amazon effect. The results suggest that consumers consider not only the smartness of programmable thermostats but also the easiness of using the device. Surprisingly, consumers also consider the value of privacy. Without extracting the latent product content dimension from the online product reviews, firms may not be able to discover these latent factors that affect consumer preferences. To the best of the author's knowledge, this is the first study to address the effect of the online retail market platform's service quality on the consumers' star ratings. Without consideration of the online platform service quality effect, empirical results will be biased. This approach can be applied to design the promotion of products, measure the effects of policies (such as energy star certification) on consumers' preferences in the online retail market platform, and identify the factors that affect consumer satisfaction or dissatisfaction.
This study also finds that extreme gradient boosting (XGB) is the best prediction machine among six popular machine learning algorithms for predicting individual consumers' sentiment before they make a purchase or write a review. In addition, this study shows how to combine variables generated from text and other numerical variables to make predictions. This study also shows each machine learning algorithm's performance in sentiment classification with the imbalanced dataset, finding that all the machine learning algorithms show low prediction performance for the minority class. The imbalanced classification problem can cause social inequality or unfairness issues if the majority class group belongs to the minority groups in a society. Above all, this approach can be implemented in an online review platform to design better target marketing strategies and recommendation systems.
This study applies natural language processing (NLP) to classify the target consumers' sentiments toward a specific product content dimension from the review text. Firms can apply this approach to reduce expensive domain expert VOLUME 9, 2021 annotation costs and implement data-driven business decisions. This approach provides empirical evidence that the context-based embedding (BERT) approach outperforms context-free embedding models (TF-IDF and Word2Vec). In particular, this study applies transfer learning concepts by applying pre-trained BERT embedding as input embedding for the CNN classifier. It also suggests that the further pretraining of BERT with domain review text data may not guarantee the improvement of prediction performance.
In sum, the approaches in this study are interpretable, applicable, and scalable to a wide range of goods, allowing for the identification and prediction of unobserved consumer preferences and sentiments associated with product content dimensions for a specific target product group.
Applying the approaches in this study to specific search goods (e.g., organic or non-organic milk) or credible goods (e.g., wine) will be a good extension of this study. The effects of expensive domain expert annotation and relatively inexpensive crowdsourcing annotation (e.g., Amazon Mechanical Turk) for sentiment classification performance will also be a valuable topic for future research. In addition, a study that examines true and fake reviews on different online platforms will be useful for identifying the differences between true and fake reviews.

APPENDIX A CONCEPTUAL FRAMEWORK
The conceptual consumer space shows the segmentation of consumers ( Figure 2). The purpose of this consumer space concept is to derive the group of consumers who become reviewers on Amazon.
The total consumer group is denoted as S t . This total group is divided into two groups, those who are users of Amazon, S a , and those who are not, S na . This study assumes that members of the non-Amazon user group S na do not write and read reviews on Amazon.
The Amazon user group S a is split into two subgroups, those who write reviews, S aw , and those who do not, S anw . It should be noted that even though consumers in S aw write reviews, it is possible that their review data contains bias. Accordingly, this study assumes these biased reviews reduce the credibility of the information found in the reviews.
Above all, if a researcher analyzes the review data written by the consumer group S aw is analyzed and used to estimate and predict individual consumer preferences of the entire Amazon user group S a , it will cause sample selection bias because there is no information about S anw . Therefore, this study aims to estimate and consumer preferences for the group of Amazon users who write a review, i.e., S aw , by using the review data written by this group (S aw ) while excluding biased reviews (from the subgroup S awb ). Consequently, this paper implements specific pre-processes to remove the reviews written by S awb .
In addition, this study extracts individual reviewers' DFs for a specific product group from a dataset of 141 million Amazon reviews. The DFs are divided into two groups. 1.User DFs: reviewer i's DFs before writing a review of thermostat p on day t i .
is a DF function for reviewer i who writes a review of p before t i . 2. Crowd DFs: the crowd's (other prior reviewers') DFs for thermostat p before i writes a review of thermostat p on day t i . The Amazon review data used in this study are secondary [8]. The dataset has 142.8 million reviews that generated from May 1996 to July 2014. This data set does not have duplicate reviews for the same products. Detailed descriptions for each data pre-processing step are shown below: Step 1: Selecting reviews with no missing values The programmable thermostats (PTs) belong to the ''tools and home improvement'' category. Clarifying a specific product group (programmable thermostats) based only on the category may lead to noisy or missing samples. Therefore, the set of programmable thermostats is carefully defined through the following processes: 1. Selecting the category to which the product belongs from the following list. 2. Removing the products that contain ''non-programmable'' in the title. 3. Selecting the products that contain ''programmable'' in the product description. 4. Removing the products that contain ''non-programmable'', ''non programmable'', or ''programmable no'' in the product description. 5. Removing the products that have a missing value in the brand or price variables. 6. Evaluating the image of each product to verify the robustness of the product set. The PT set without missing values in either brand or price variables will henceforth be called ''programmable thermostats.'' There are 110 thermostats in this set. Although the total number of initial reviews of the 110 PTs was 8,817, the total number of reviewers was 8,694, because some reviewers wrote multiple reviews.
This study considers only inexperienced consumers' first review of the PTs, because inexperienced consumers may become experienced consumers after they have written their first review. Second and third reviews of PTs from the same reviewer are deleted. Therefore, the total number of reviews of PTs used in this research is 8,694, the same as the number of reviewers.
Step. 2: Cleaning ''suspicious one-time reviewers'' and ''always-the-same-rating reviewers'' Step 2.1 Cleaning ''suspicious one-time reviewers'' Zhao et al. [18] indicated that fake reviews increase consumers' uncertainty about products and that more believable online reviews of experience goods have a larger effect on consumer choice. Some firms may write positive reviews about their products and negative ones about their rivals' products [14], [17], [19]. Accordingly, deleting potential fake reviews is essential to improve the credibility of review and reduce consumer uncertainty.
Mayzlin, Dover, and Chevalier (2014) defined the ''suspicious reviewer'' as one who writes a review for a hotel for the first time only during the sample period (October 2011) and showed that t3heir rating distribution is more polarized than that of the entire sample [14]. This study takes this into account by accessing individual reviewers' prior reviews in different categories over the entire sample period, defining a ''suspicious one-time reviewer'' as one who writes only a review for a PT as a first review and does not write reviews for any other products over the entire sample period.
This cleaning process assumes that suspicious one-time reviewers are less likely to write reviews of other products in different categories, excluding specific target product groups (own products or other competitors in the same product group), to minimize costs. In other words, suspicious onetime reviewers may be unlikely to post reviews outside of their product area. It is possible that they are actual reviewers. However, it is still reasonable to delete potential suspicious one-time reviewers to remove possible bias. In addition, suspicious one-time-reviewers do not have any digital footprints (DFs); therefore, these reviewers are supposed to be deleted in step 3 (deleting reviewers and reviews for products with no DFs.) A total of 1,165 reviews for 80 PTs are detected, written by 1,165 suspicious one-time reviewers.
Step 2.2 Cleaning ''always-the-same-rating reviewers (ASRs)'' Some reviewers always give a star-rating at the same level for all reviewed products in all categories, regardless of the product quality. Such reviewers may not respond to product quality and previous reviews written by the crowd. Consequently, these reviews do not reflect the product quality. It may also be possible that the reviewers give the same rating level because the number of reviews is simply small. Over the sample period, 1,970 reviewers rated products in all categories at the same level; however, 1,165 reviewers wrote only 1 review and 316 reviewers wrote 2 reviews.
In this study, an ''always-the-same-rating reviewers (ASR)'' is a reviewer who writes more than 8 reviews with the same rating level. In detail, ''Programmable thermostats'' belong to ''tool and home improvement'' category in the Amazon review system. The majority rating in this category is a 5-star rating with a probability of 0.595. If the probability of the majority star rating in the five-scale star-rating system is 0.595 (extreme and subjective assumption), the probability that a reviewer independently writes reviews with the same majority star rating in nine consecutive reviews is 0.00934 (less than 1%). Only 69 reviewers write more than 8 reviews at the same star rating level (5 stars), surprisingly designating them as ''always happy reviewers (AHRs)''; these 69 reviews for 25 PTs are removed.
There is no overlap between 1,165 suspicious 1-time reviewers and 69 ASR reviewers. The number of reviewers become 7,460 after removing 1,234 reviewers. As can be seen in Figure 3, the share of 1-star ratings of suspicious reviewers (18.9%) is about twice as large as that of reviewers after cleaning the suspicious 1-time reviewers and ASRs (9.69%). Therefore, there is potential for negative promotional reviews in the suspicious 1-time reviewers' reviews.
Step. 3: Deleting reviewers and reviews for products with no digital footprints (DFs) Without DFs, it is impossible to measure the effect of DFs on a reviewer's rating for a PT when the reviewer writes a review for a PT for the first time. Accordingly, this procedure is followed: (1) 1,965 reviewers do not have any previous reviews of other products excluding PTs in all categories before the first day of writing a review for PTs; (2) 91 reviewers write a review for a PT that does not have any previous reviews written by other prior reviewers. The overlap between the 1,965 reviewers and the 91 reviewers is 28 reviewers; therefore, 1,234 reviewers are removed.
Step 5: Identifying five latent product content dimensions in the reviews using LDA Step 5.1: What is LDA (Latent Dirichlet allocation)? LDA is a Bayesian unsupervised learning model used to identify latent topics in each review and the distribution of these topics in each review. The terminology for LDA in this study is defined as follows: · w i,n is the nth word in the ith review and it follows a multinomial distribution.
· V (vocabulary) is the total number of unique words in the set of all review data · K is the total number of topics in each review and is a hyperparameter · The ith review is a sequence of N words as r i = w i,1 , . . . , w i,N · A corpus is a set of M reviews as R = (r 1 , . . . , r M ) As a generative probabilistic model, LDA assumes that each review is represented as a distribution over K topics as θ i . θ i is a vector in R K that represents the proportion of each topic in the ith review. θ i follows a Dirichlet distribution that has α as a Dirichlet parameter. In addition, ϕ k is the kth topic vector in R V that represents the proportion of each word that belongs to V in the kth topic. ϕ k follows a Dirichlet distribution that has β as a topic hyperparameter. z i,n is a vector in R K that maps the nth word in the ith review to topic k. z i,n and w i,n follow a multinomial distribution. Overall, θ i , ϕ k , and z i,n are latent variables and w i,n is an observable variable.
In addition, LDA assumes that w R (words in reviews) is generated from the joint distribution of θ R (the review's topic distribution) and ϕ K (the topic's word distribution). The joint distribution indicates the word generation process in reviews as follows: p(z i,n |θ i )p(w i,n |ϕ k , z i,n |θ i ) Excluding w i,n , the other variables are latent variables. During the training process of LDA, the optimal values of the latent variables maximize the posterior probability. The posterior probability is denoted as follows: However, the denominator of the posterior probability is intractable for exact inference because ϕ K , θ R , and z R are unobserved variables. In fact, various approximate inference methods are applicable for estimating posterior probability such as variational inference and Gibbs sampling.
Step 5.2: LDA Application in This Study LDA is often called topic modeling. Topics in online product reviews indicate the product content dimensions for the products. The product review text for a specific product group contains finite product content dimensions (topics of product reviews) for the product group. Based on the empirical results of the LDA model and the theory [57], Liu et al. [2] divided the product content dimension for products from the online product review text into six dimensions as (1) esthetics, (2) conformance, (3) durability, (4) feature, (5) brand, and (6) price.
Though the theoretical framework is useful in general, this paper uses the LDA model to define the product content dimensions in online product reviews for a specific target product group (programmable thermostats) instead of the general category of goods.
After pre-processing, the number of unique words in 5,307 reviews (the review summary and the body of the review) for LDA is 4,554. The LDA model in this study contains 5 topic dimensions (Table 14). The number of optimal topics is determined by the coherence score ( Figure 5) [58]. As can be seen in Table 14, the author, who is a domain expert in the power industry interprets, 5 subjective product content dimensions.   Step 6: Modifying the PCDs by leveraging the domain expert's knowledge The expert extends the five product content dimensions from the LDA model to nine dimensions based on domain knowledge and the purpose of the research design. The dimensions are: (1) smart connectivity, (2) easiness, (3) energy saving, (4) functionality, (5) support, (6) price value, (7) privacy, (8) the Amazon effect, and (9) environmental friendliness.
Passonneau et al. [23] suggested that annotation by experts transfers domain knowledge to machines for better prediction performance. Accordingly, the author manually annotates 47,763 labeling tasks for the reviewers' sentiment toward each product content dimension to transfer domain knowledge to the models (Table 15). Dimension 1. Smart Connectivity This dimension indicates the reviewers' sentiment toward programmable thermostats' (PTs') remote control of other home appliances through a Wi-Fi connection using apps and software. Wireless connectivity is a key component of thermostats' smartness as an Internet of Things (IoT) device because it enables consumers to control their home appliances with smartphones, tablets, and computers wherever and whenever they want.
Features related to remote control, Wi-Fi accessibility, and software quality for wireless control belong to this dimension. Firmware for Wi-Fi thermostats can update itself periodically and display customized pictures on the touch screen. For example, reviewers present positive sentiments like the following: ''It is nice to monitor & adjust home temperature remotely on iPhone.'' and ''I love the automatic updates that I have been receiving.'' Dimension 2. Easiness This dimension indicates the reviewers' sentiment toward PTs' simplicity and convenience of installation, set up, programming, and usage. Unlike other experience goods, PTs require technical knowledge and skills. A lack of the required knowledge and skills may become a source of difficulty and failure of usage. The easiness of understanding the instruction manual, making the wiring connections, and controlling the device (including programming with a better user interface) belong to this dimension. Some reviewers posted ''Easy to Install and Use'' and ''so easy to use and so easy to see in the dark.'' Dimension 3. Energy Saving This dimension indicates the reviewers' sentiment toward programmable thermostats' actual or expected energy saving and/or money saving due to better energy efficiency and costeffectiveness than other thermostats or their previous one. The reviewers' comments about features related to better energy saving belong to this dimension along with the reduction of utility bills for electricity or gas. For example, reviews in this dimension include ''A much lower price in your electric bill.'' and ''My gas bill dropped by 30% the first month.'' Dimension 4. Functionality The purpose of thermostats is to control energy usage for heating and cooling. Accurate and precise control for temperature and time are therefore essential for a better programmable thermostat. This dimension presents the quality of controlling and performance of features. The discomfort caused by thermostats' quality of functionality belongs to this dimension. For example, a clicking noise from thermostats during setting or programming indicates reviewers' negative sentiment toward this dimension. The reviews in this dimension include ''Temperature not accurate but does the job.'' and ''Makes a clicking noise.'' Dimension 5. Support This dimension is related to consumer and technical support service, replacement and return service, warranty, packing quality, additional support service on the website, and other helpful materials for consumers. Consumer support services are vital for consumer satisfaction because thermostats require technical knowledge and skill during installation, setting up, and programming.
Consumer support services are vital for consumer satisfaction because thermostats require technical knowledge and skill during installation, setting up, and programming. Consumer support services may also mitigate inexperienced consumers' concerns, technical difficulties, and dissatisfaction during the pre-and post-purchase periods. Some reviews in this dimension are ''customer service is amazing! Tweet them for help even!'' and ''They sent mine in 2 days in perfect condition, plus they appear to have a fair return policy.'' However, the expert disregards the reviewers' sentiment toward Amazon's quality of consumer support service. Without separately considering the online market platform's service quality, the reviewers' sentiment toward this dimension for the PTs will be biased. Dimension 6. Price Value This dimension is a reviewer's subjective evaluation about the price level compared with the quality, future benefits, and other factors. Written comments related to the price value, all positive or negative events affecting the price, and repair costs belong to this dimension.
The prices on Amazon.com change very often and differ for consumers due to different promotions and memberships. The true price of reviewed products in the past may be different from the price at the time of web scraping. In this case, the observed price variables at the time of web scraping could be biased. Therefore, this study extracts the reviewers' sentiment toward this dimension from review text data. Some example reviews for this dimension are ''this is money well spent.'', ''Gold box deal makes it worth'', ''Too expensive to justify the benefit'', and ''running a promo to give you a $40 gift card with your purchase.'' Dimension 7. Privacy This dimension is about privacy concerns related to thermostats. Wi-Fi thermostats provide remote control through the Internet, which may cause consumers to have concerns about privacy and data security. Wi-Fi thermostats can store and transform user information and consumption data.
Most of the negative privacy concerns occurred for the Nest when Google purchased it on January 13, 2014. Some reviews are ''Since Google's Nest buyout raises privacy concerns'' and ''Unless and until clear, unequivocal, irrevocable legal guarantees are in place that Google doesn't get Nest data, I would say that any Nest user must expect that, ultimately, Google will have all that data.'' Dimension 8. The Amazon Effect This dimension is the reviewers' sentiment caused by Amazon's service quality, such as Amazon's delivery, consumer support, and refund and replacement policy. Reviews on Amazon.com describe not only the product quality but also Amazon's service quality. If researchers do not account for the effect of Amazon's service quality on the reviewers' ratings, it may cause a bias. To the best of the author's knowledge, this is the first paper to measure the effect of Amazon's service quality on reviewers' star ratings.
Some reviews for this dimension are ''Amazon's return policy is great!'', ''I am very pleased with this purchase and with Amazon customer service.'', ''Amazon is really good about their customer service'', and ''super fast Amazon delivery for free (overnight).'' Dimension 9. Environmental Friendliness Since programmable thermostats are a home energy control device requiring energy consumption for heating and cooling, some researchers may be interested in the issues related to carbon emissions and climate change.
This dimension is a binary variable indicating whether reviews contain comments about the environmental friendliness of thermostats. Only nine reviews contain comments related to this dimension, including ''it helps save the environment!'', ''I feel all environmentally friendly for wasting less energy, too.'', and ''thanks to this environmentally friendly thermostat. I am also helping to save the world.''

APPENDIX C HETEROSKEDASTICITY ORDERED PROBIT MODEL
Reviewers' observable ratings indicate the range of their unobservable continuous preference as follows: The ordered dependent variable, R ipt ∈ [1,5], is reviewer i's first star rating for a PT on day t. U * ipt denotes the unobservable continuous utility of reviewer i for product p on day t. The unknown cutting points (thresholds) are denoted as c k with the assumption that c 1 < c 2 < c 3 < c 4 . U * ipt can be represented as follows: where x it indicates a vector of independent variables, ε it is a homoskedastic error term following a standard normal distribution, and ρ > 0 is a scale function to adjust the variance. The heteroskedasticity ordered probit (HETOP) model assumes its scaling function to be ρ i = exp(Z it γ ), where Z i denotes the regressors for the scaling function and γ are unknown coefficients for Z it . The probability of a reviewer's rating for a PT can be derived as follows: where is the cumulative distribution function (CDF) of the standard normal distribution. The log-likelihood (LL) function for N reviewers and reviews is: This LL function is maximized with respect to unknown parameters θ = {β, γ , c 1 , c 2 , c 3 , c 4 }. I(·) denotes an indicator function and θ can be estimated through the maximum likelihood estimation.
Marginal effect analysis is an appropriate way to interpret each parameter in OP models. The variables in x it can overlap with those in Z it ; therefore, x a it denotes the variables involved in both x it and Z it while x b it denotes the variables that only belong to x it . In the case of continuous variables, Table 16 shows the marginal effects of both x a it and x b it . The sign of a coefficient reflects the sign of the marginal effect only in the marginal effect of x a it at R ipt = 5 and inversely reflects the sign of the marginal effect only in the marginal effect of x a it at R ipt = 1. In all other cases, the sign of coefficient does not necessarily determine the sign of the marginal effect for the parameter. The marginal effect of the binary dummy at each level of R ipt = j ∈ [1,5] can be derived as follows [59]: where d it is a binary dummy variable and d it = 0 indicates the base group.

APPENDIX D VARIABLES DESCRIPTIONS
See Table 17.
The category diversity is the Shannon index as follows: and N c is the number of prior reviews in subcategory c by t b i .

APPENDIX E MARGINAL EFFECT
Tables in this section show the marginal effect of key variables (model_h2) at the average value of one company's reviewers (Nest, during June 2014).

APPENDIX F MACHINE LEARNING MODELS
Six popular machine learning models are applied to ex ante prediction tasks. The support vector machine and decision tree models are base models used to compare their prediction performance with more complex models. Random forest and extreme gradient boosting are tree ensemble models. The artificial neural net and long-short-term memory models are deep learning models. A high-level overview of each model is presented below.

A. KERNEL SUPPORT VECTOR MACHINE (KERNEL SVM)
The support vector machine (SVM) model finds the linear separable hyperplane in the feature space to classify labels [38]. To deal with non-linearly separable, noisy, and outlier data, Cortes and Vapnik [60] introduced a slack variable as ξ i ≥ 0, ∀i and a parameter C. ξ i is the distance between the linear hyperplane and the misclassified x i , while C is a weight for the sum of ξ i in the sample as N i=1 ξ i [61]. In particular, kernel SVM is applied in this study to consider the non-linearity of the data. A kernel function K implicitly maps original data to a high-dimensional functional feature space : x → ϕ(x), such that K x,x =< ϕ(x), ϕ(x ) > for two samples x and x . The Gaussian radial basis function (RBF) is the kernel function, as follows: where γ > 0 and ||x − x || 2 is the squared Euclidean distance between x and x . The RBF is a similarity measure ranging between zero and one, and ϕ(x) has an infinite number of dimensions [62].
Overall, the dual problem of kernel SVM can be expressed as follows: where C ≥ α i ≥ 0 and N i=1 α i y i = 0. α i denotes the Lagrange multipliers, and {x i |C > α i > 0, ∀i} are the support vectors deciding the decision boundary. C is an upper bound of ξ i in this kernel SVM optimization setting. In addition, C and γ are two hyperparameters of SVM.
One-vs-rest (OvR) is a popular method for multiclass classification [63]. In the OvR approach to three-class classification, three binary SVMs classify each class in an online product review against the rest of the classes as {1, the others}, {2, the others}, and {3, the others}. The SVM that has the largest margins among the three SVMs determines the class of new data in the test set.

B. DECISION TREE (DT)
The decision tree (DT) model recursively partitions the feature space into a disjointed set of rectangular regions such that each region contains the same classes (Figure 7). For multiclass classification, the DT model has K classes (K > 2). The feature space at each node n is divided into two sub-regions based on θ n ∈ {x j , t j |node = n}, where x j denotes the splitting variable j and t j denotes the splitting value for x j at node n. θ n splits the data at node n into {D left (θ n )|x j ≤ t j at node = n} and {D right (θ n )|x j > t j at node = n}. R n represents the region corresponding to node n in the feature space, and N n = N i=1 I (x i ∈ R n ) means the total number of instances in R n . Node m denotes the terminal node. The hyperparameter of DT is the maximum number of the tree depth in this study.
In DT, impurity means the heterogeneity of classes in a node and H (·) denotes the impurity function. The optimal value of θ * n minimizes the impurity at the given node n as  follows: The decision tree is simple, interpretable, applicable for regression and classification with continuous and/or categorical variables, and acceptable for a dataset containing missing values. However, the decision tree has high variance due to its hierarchical structure, which means that a small change in features can cause different split results. Further, the classification of the DT on imbalanced data could be biased toward the majority class. Therefore, tree ensemble models are applied to mitigate these problems.

C. RANDOM FOREST (RF)
Ensemble methods use a set of base classifiers. The random forest (RF) is a tree ensemble model called bootstrap aggregating. Dietterich (2000) suggested that ensemble models often perform better than single classifiers because (1) averaging classifiers may reduce the probability of using the wrong classifier; (2) different starting points for each classifier's optimization may reduce the possible local optima; and (3) combining classifiers may represent the correct function for mapping features to labels [40].
In particular, the RF is able not only to improve the prediction performance by reducing variation but also to maintain robust prediction performance with an increasing number of noisy variables [41].
The RF' procedure is: (1) generating an independent training set s i by selecting a subset of the sample from training set S with replacement; (2) creating de-correlated RF rf i , by selecting a subset of features; (3) training rf i with s i and using fitted rf i to classify new data x; and (4) repeating the above steps B times and classifying new data by using majority voting as follows: where θ i indicates the parameters determining the structure of rf i , including the subset of features, splitting variables and points at each node, and the values at each terminal node. The hyperparameters are the number of trees and the depth of the trees.
Breiman [64] argued that the RF's prediction performance depends on the performance of individual DTs and the correlation between DTs. However, the minority classes in imbalanced data could be less represented in the sub-samples due to resampling, and this may cause lower prediction performance for the minority classes in RF. Chen et al. [36] suggested using the weighted RF to correct the problem of imbalance. Boosting combines multiple weak classifiers to build a strong classifier. However, boosting does not involve bootstrap resampling [39]. Extreme gradient boosting (XGB) [42] implements gradient boosting [65] by regularizing the complexity of the tree structure. The prediction of a tree ensemble model is the sum of K DTs: ., T} and w ∈ R T }. F is a possible functional space of DTs, q is a leaf index function and represents the structure of the tree, T is the number of leaves in the tree, and w is the weight of each leaf.
Each DT has an objective function (OF). A smaller OF value means a better tree structure. The optimization of each tree structure minimizes the OF: The OF contains additive tree functions; therefore, it cannot be optimized by the conventional methods. Therefore, additive training is applied to the optimization by adding a new function f t (x i ) in each iteration t and using a second-order VOLUME 9, 2021 Taylor approximation: . For the multiclass classification, the softmax loss (cross entropy loss) is applied: L(y i , y i ) = −α k K k=1 I(y i = k)log Pr( y i = y i |x) For imbalanced data, α k becomes N K×N k to put more weight on the minority class and less on the majority class in the loss function [66]. The hyperparameters of XGB in this study are the number of trees, tree depth, learning rate, and class weight.

E. ARTIFICIAL NEURAL NETWORK (ANN)
An ANN is a deep learning (DL) model. DL automatically learns a representation of data for required tasks [43]. Recently, deep learning has shown dramatic progress in diverse areas including natural language processing (NLP). Deep learning also has the potential to improve business analytics [67].
Deep learning relies on the universal approximation theorem [68], [69]. In this theorem, an ANN represented bŷ F(x, w) can approximate any Borel measurable function f(x) (any continuous function on a compact subset of finite Euclidean space is Borel measurable) with any desired degree of accuracy [43], [70] as follows: The ANN will also be useful for approximating E(Y|X) by mitigating functional form misspecification [44], [71].
The ANN has a multilayer structure with input, hidden, and output layers. Figure 8 shows the basic structure of the ANN for binary classification. The ANN example has an input layer with two input variables, one hidden layer with three neurons, and one output layer. Each neuron in the hidden layer receives a weighted input value from the input layer and the received input values enter the activation function (continuous nonlinear function) in each neuron. In this example, the activation function is the rectified linear unit (ReLU), f(x) = max (0, x). The weighted sum of output values from the hidden layer enters the output layers. The softmax function, f (x i ) = exp(x i ) j exp(x i ) , turns the output values from the previous hidden layer into the probability of class one. If P(class = 1) > 0.5, the label will be one; otherwise, it will be zero. The ANN learns optimal weights by backpropagation [72].
In this study, the ANN structure contains two hidden layers. The activation functions are ReLU. The optimization method for minimizing cross-entropy loss is Adam [73]. Dropout is a regularization method used to prevent overfitting during the training steps.
The hyperparameters are the optimal training iteration, dropout rate, learning rate, and number of neurons in the two hidden layers. The class weight is also a hyperparameter; however, the class-weighted ANN shows lower prediction performance than the unweighted one.

F. LONG SHORT-TERM MEMORY (LSTM)
The recurrent neural net (RNN) is a DL model for sequence data. However, the RNN may suffer from the vanishing gradient problem during the training of long sequence data [74]. LSTM mitigates the vanishing gradient problem by introducing the memory cell structure [45], [75]. LSTM has a multilayer structure with input, hidden, and output layers. In particular, the hidden layer(s) contains memory cells. Each memory cell is controlled by three gates (the input i t , forget gate f t , and output gate o t ). The memory cell at time t receives the input value x t , hidden state h t−1 and previous cell state at t-1 C t−1 .
The input gate i t decides whether the information in x t and h t−1 is useful for C t . The forget gate f t decides whether the information in h t−1 is useful for C t . The output o t decides which information in C t will be preserved in h t . Figure 9 shows the structure of the memory cell. The hyperparameters of the LSTM model in this study are the learning rate, training epochs, and number of neurons.

APPENDIX G EX ANTE PREDICTION RESULTS
Model 1 (''at time model'') is the base model that contains only 37 observable variables. Models 2, 3, and 4 are ex ante models used to predict consumers' potential sentiment for PTs before they make a purchase. Models 5 and 6 are ''partial ex ante'' models used to predict consumers' potential sentiment for the PTs purchased before they write a review.

A. TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
Frequency-based embedding is a simple way to map each review text to numerical vectors. Term frequency-inverse document frequency (TF-IDF) is a frequency-based type of word embedding and penalizes the high-frequency words in the entire review [35]. For example, ''the'' may have a low TF-IDF value because many reviews contain ''the''.
The pre-processing for TF-IDF in this study is conducted as follows: Step 1. Putting all words into lower case; Step 2. Splitting the review text into words; Step 3. Removing stopwords, punctuation, numbers, and single characters; Step 4. Lemmatizing words (converting words into the base form, e.g., writing → write).
After the above steps, the number of unique words in 5,307 review texts (vocabulary) is 15,843. This is a spare high-dimension matrix containing many zero values. TF-IDF represents how frequently a word appears in the entire review as follows: TF − IDF score(unique word n,i ) = tf n,i × log N df n tf n,i : the frequency of word n in review i (term frequency) df n : the frequency of reviews containing word n (document frequency) N : the number of total reviews (N = 5,307) In this equation, low-frequency words in review i will have a low TF-IDF score due to low term frequency; common words that occur in many reviews will also have a low TF-IDF score due to low document frequency [78]. On top of the TF-IDF embedding vectors from the review text data, tree ensemble models (RF and XGB) are applied for sentiment analysis. TF-IDF has a high-dimensional spare matrix and cannot represent similarity, ambiguity, and contextual meaning in a text.

B. Word2Vec (W2V)
The Word2Vec (W2V) model is a word distribution-based embedding method and generates dense embedding vectors representing each word's semantic meaning. For example, the W2V model may generate similar embedding vectors for ''pen'' and ''pencil'' because the two words contain similar semantic meanings.
As a pre-process, the following steps are applied: Step 1. Converting emoticon and $ symbols into related words; Step 2. Splitting the review text into words (tokenization); Step 3. Removing stopwords, punctuation, numbers, and single characters; Step 4. Lemmatizing words (converting words into the base form, e.g., writing → write). After the above steps, the W2V model generates embedding vectors from each review text. The skip-gram W2V model [52] generates k-dimensional real-vector word embedding v n for the nth word in all reviews by maximizing the following objective function: .
where N is the number of words in all the reviews (the entire corpus); c is the window size for selecting neighboring words around the center word n; and T is the number of unique words (vocabulary) in all the reviews. In this study, the W2V model is trained with all the reviews (N = 1,926,047) in the ''tool and home improvement'' category and the number of unique words is 73,856. The hyperparameters are the W2V embedding dimension, window size, and training dataset. After hyperparameter tuning, the optimal W2V VOLUME 9, 2021  embedding dimension is 100 and the optimal window size is 5.

C. BIDIRECTIONAL ENCODER REPRESENTATIONS FROM TRANSFORMERS (BERT)
Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art context-based embedding method. BERT can represent the same word in a sentence with different embedding vectors by reflecting the contextual meaning of each word in the sentence. For example, in the sentences ''I did not like this thermostat in the past. Now, I love this thermostat,'' the word ''thermostat'' occurs twice, in the first and in the second sentence. BERT generates different embedding vectors for ''thermostat'' in the first and second sentences based on the contextual information in them. Meanwhile, context-free embedding models  (e.g., TF-IDF and W2V) generate the same embedding vectors for ''thermostat'' in both sentences. In particular, the domain expert in this study reads and annotates all 5,307 reviews for PTs and finds that the review text often contains a comparison between the previously owned PT and the newly purchased PT; therefore, the same word in the review often represents different contexts based on its position in the review. For example, ''I disliked the previous thermostat. However, I love this new thermostat.'' In this text, even though the word ''thermostat'' occurs both in the first and in the second sentence, the first one may contain a negative sentiment and the second one may contain a positive sentiment.
However, context-free embedding models (e.g., TF-IDF and W2V) cannot capture different semantic meanings of the same word in different positions in the review sentences. In contrast to the context-free embedding models, BERT (context-based embedding) can find the contextual difference between occurrences of the same word in different positions in the review sentences. VOLUME 9, 2021 FIGURE 11. The structure of the CNN [55], [56].
The pre-trained BERT embedding model is trained with 800 million words using a book corpus [79] and 2,500 million words from Wikipedia data. BERT uses the WordPiece tokenizer [80], which splits each word into sub-words to deal with out-of-vocabulary words.
BERT's structure is based on multilayered transformer encoders [81]. BERT is trained for two objectives: masked language modeling (MLM) and next sentence prediction (NSP). MLM is a prediction task for randomly masked tokens in the sentences to learn about the contextual information in the text. NSP is a binary classification indicating whether the second sentence is a subsequent sentence to the first one to learn about the relationship between sentences.
This study uses the BERT-based model, which contains 30,522 unique tokens with 768 embedding dimensions for fine-tuning and further pre-training. With a fine-tuned BERT, the CNN is applied on top of the pre-trained embedding from the original BERT model. Having further pretrained BERT, the BERT embedding is updated by training on the review text data and is used as input vectors for the CNN classifier. Recently, Gururangan et al. [51] and Sun et al. [53] showed that further pre-training with domain data could improve machine learning models' performance. Figure 11 provides an example of a simplified CNN model for the binary classification model. The structure of the CNN in this example has four layers. The first layer is the input word embedding generated from the review text. Each review text is split into tokens (e.g., words in a W2V model and sub-words in a BERT model) and becomes a sequence of the tokens with length n. The tokenized review is denoted as x 1:n . Each token x i is mapped to a word-embedding vector R d . The embedded sequence of tokens x 1:n is expressed as follows:

APPENDIX I CONVOLUTIONAL NEURAL NETWORK (CNN) FOR SENTIMENT CLASSIFICATION
x 1:n = x 1 ⊕ x 2 .. ⊕ x n , where x i ∈ R d , i ∈ {1, . . . , n} and where each class has a predicted probability, and the class showing the highest predicted probability will be the predicted class.