Do Reviewers’ Words and Behaviors Help Detect Fake Online Reviews and Spammers? Evidence From a Hierarchical Model

Although numerous studies have investigated spam detection and spammer detection on online platforms, they have ignored the fact that reviews written by the same reviewer may be correlated because each reviewer has their own distinct style. The traditional logistic regression model cannot handle this type of data because they violate the independence of residuals assumption. Furthermore, relatively few studies related to fake review detection have considered linguistic and behavioral aspects simultaneously. Thus, we propose a hierarchical logistic regression (HLR)-based model for detecting fake reviews that considers both linguistic and behavioral characteristics. With this outcome, our kernel also has multiple applications, including the detection of review spammers as a pre-module of quality in machine learning. The experimental results demonstrate that HLR can classify fake reviews and review spammers more accurately than the standard machine-learning algorithms.


I. INTRODUCTION
The rapid development of the Internet has led to the rising availability of review services on online platforms, such as shopping websites (e.g., amazon.com) and opinion-sharing websites (e.g., epinions.com). People typically do not now purchase products or services without first reading the reviews [1]; consumer-generated reviews have become an indispensable part of the online shopping experience. However, approximately 30% of online reviews are fake [2] and can mislead consumers into making poor decisions. They may even undermine the credibility and usefulness of reviews in general. Positive fake reviews can boost sales, conferring prestige and financial benefits on both the corporate and individual levels. By contrast, negative fake reviews can exert severe negative effects on the sales of a product or service and may even threaten the reputation of the relevant firm [3]. Therefore, the detection and elimination of review spam is The associate editor coordinating the review of this manuscript and approving it for publication was Fu Lee Wang . essential for protecting the interests of consumers and sellers alike.
Unlike other kinds of spam (e.g., Web spam or email spam), review spam (i.e., fake reviews) is considerably more challenging to detect. Specifically, human users experience difficulty recognizing review spammers because spammers can easily pretend to be legitimate reviewers. Furthermore, given the openness of product review sites, spammers can pose as numerous users, thereby complicating their eradication. Some paid professionals fabricate reviews without having used the product or service in question. Their sole goal is to promote the reputation of their employer or undermine that of their employer's competitors [4], [5]. Such behaviors undercut the credibility of review platforms. In sum, distinguishing fake online reviewers from real ones is a formidable challenge because review spammers can outsmart genuine users by mimicking their behavior [6].
Fake review detection and review spammer detection have been investigated for many years. [7] conducted the first study on spam detection in which they constructed a classifier that VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ employed certain types of duplicate reviews as positive training data; the remaining types were used as negative training data. A more in-depth scrutiny was continued performing by them in [8], this study contributed to the identification of three types of spam reviews, namely untruthful opinions, reviews on brands only, and non-reviews by representing a review using a combination of review, reviewer and productlevel features. [9] developed a scoring method for measuring the degree of spam produced by each reviewer by examining several behavioral characteristics of spammers. Subsequently, they modeled these behaviors for spammer detection.
To detect deceptive opinions, [10] explored psycholinguistic features by combining the Linguistic Inquiry and Word Count (LIWC) text analysis program with analysis of standard words and part of speech (POS) n-gram features. Many other studies on spam detection and spammer detection have been done in recent years, for example [11]- [14]. Although numerous studies on identifying fake reviews and review spammers within online platforms have been performed, the nested relationship between reviews and reviewers has not been investigated. Existing standard single-level models treat all electronic word of mouth reviews as independent observations. For example, [15] employed logistic regression to determine whether reviews were manipulative or authentic by considering linguistic cues, such as the readability, genre, and writing style of negative reviews. [7] conducted logistic regression by using a data set obtained by crawling amazon.com. Review content and reviewer-specific features were extracted with 78% accuracy. Notably, reviews written by the same reviewer may be correlated with each other because each reviewer expresses their knowledge and ideas through their own distinct style [16], [17]. This correlation severely contravenes one of the most pivotal assumptions of conventional regression analysis, namely the assumption of independence of the residuals [18]. This violation can lead to underestimation of the standard error, which in turn can contribute to incorrect findings of significance. Furthermore, variability among reviewers and among reviews nested within reviewer clusters cannot be resolved by single-level logistic models.
Another shortcoming of the literature on fake review and review spammer detection is that linguistic features (e.g., LIWC, POS tagging) and behavioral features (e.g., life tenure, rating deviation) have rarely been examined simultaneously. For example, [19] incorporated linguistic features of reviews involving POS tagging, unigram, and LIWC analyses into their forecasting model and reported a detection accuracy of 65%. [20] examined reviewer features (e.g., the average proportion of unhelpful reviews and the ratio of the number of first reviews of a product to the total number of reviews written by that reviewer), achieving a precision of 64%. These findings provide compelling evidence to support the premise that both linguistic and behavioral characteristics can play integral roles in the detection of fake reviews and review spammers. However, few studies have probed both types of features in an integrated manner, despite the fact that combining them typically yields more favorable detection performance than considering each type of characteristic separately. For example, by using integrated features from review and reviewer, [21] attained accuracy of 90% in detecting fake reviewers, [1] achieved an F1-score of 95% in spammers detection, and [22] identified fake reviews with 95% accuracy. Therefore, there is a need to construct a fakereview-predicting model, with features from both categories being used simultaneously.
To detect fake reviews, we developed a hierarchical logistic regression (HLR) model that takes into account the characteristics of both reviews and reviewers. Hierarchical models carefully consider variability at each level of the hierarchy. Specifically, they enable cluster effects at different levels to be analyzed by providing estimates of how much of the variance is attributable to the reviewer and to each individual review. Hierarchical modeling is a highly recommended statistical method for handling this type of data structure when individual reviews are grouped within reviewer clusters. Because machine-learning algorithms cannot handle nesting, our kernel has various applications, including the detection of review spammers as a premodule of quality in machine learning. For model validation, we conducted an experiment into two parts: recency analysis and duration analysis. In both parts, we first conducted HLR under the consideration of both linguistic and behavioral features. We then used the results as inputs for detecting review spammers. The experimental results demonstrated that the HLR model was more effective in differentiating between fake and real reviews and reviewers than were the logistic regression (LR) algorithm and other machine-learning algorithms, namely support vector machine (SVM), random forest (RF), naive Bayes (NB), and k-nearest neighbor (KNN). The highest accuracy rate of fake review detection and review spammer detection was 86% and 94%, respectively. The remainder of this paper is organized as follows. Section II provides the theoretical foundation. Section III introduces the methods and research process. Sections IV and V detail the experiments and results. In Section VI, the conclusions and future directions are presented.

II. THEORETICAL FOUNDATION
In this section, we present the theoretical foundation of fake review detection employed in this study, namely linguistic analysis, the behavioral features of reviewers, and HLR.

A. LINGUISTIC ANALYSIS
Studies have established that writing is a stable, reliable, and personalized trait. Text analysis programs can be used to link natural language characteristics to personality characteristics [10], [17], [23]. Moreover, individuals express their knowledge and ideas through a unique linguistic style [16], [24]. In the analysis of fake reviews, the most widely used linguistic analysis features include LIWC, readability, and POS. Herein, we further applied two features that had not previously been employed in relevant research-evidentiality and credibility.

1) LIWC FACTORS
LIWC is a text analysis tool that links natural word use to personality traits [25]. Through a psychological approach, it counts words within a given text sample irrespective of the context in which the words occur. The LIWC dictionary contains 80 psychological categories into which 4500 words are classified. LIWC enables researchers to explore the linguistic features of textual data, such as the number of pronouns and numbers of positive and negative emotion words. Using the LIWC engine, users can create their own internal dictionary with which to analyze text files and dimensions of interest [26]. [10] achieved more favorable results when they considered LIWC attributes alongside the bag-of-words model than with the bag-of-words model alone. In the context of online communities in which opinions and experiences are shared, the LIWC tool facilitates analysis of reviewers' personal traits and fraudulent behavior [10], [23].

2) READABILITY
Although the literature offers various definitions, readability generally refers to the quality that makes some texts easier to read than others [27]. One study stated that readability is the speed at which a text can be read as well as the ease with which the content can be understood and retained [28]. Focusing on the issue of writing style, [29] defined readability as ''the ease of understanding or comprehension due to the style of writing,'' not relating readability to the content, coherence, or organization of a text. In the present context, readability represents the effort and expertise a person requires to comprehend a review [30], [31]. [15] indicated that fake reviews have higher readability than do genuine reviews because review spammers use words that are easier to understand such that the review can be read rapidly. In line with relevant studies, we employed readability as a linguistic feature through which fake reviews and review spammers were identified.

3) CREDIBILITY
Credibility refers to the quality and professionalism of a review [32]. Various studies have considered credibility in text analysis. For example, [33] took credibility into account to compile a list of indicators for assessing the credibility of blogs. The indicators comprised several main categories: information quality, appeals and triggers of a personal nature, the blogger's expertise and disclosure of their offline identity, and the blogger's profile and value system. [34] asserted that customers tend to vote for reviews that direct them to credible sources; as a result, their helpfulness ratings are increased. The work of [35] also used credibility to discover the major elements attributing to review helpfulness, among which the average user helpfulness, the number of user reviews, the average business helpfulness, and the review length were of the utmost significance. [17] used credibility to predict the usefulness of reviews. The indicators examined were correct capitalization, emoticons, all capitals (which suggests a ''shouting'' tone), and misspelling. The researchers determined that credible reviews employ correct capitalization (including by minimizing the use of all capitals) and contain fewer emoticons and misspelled words. Given its demonstrated ability to predict review helpfulness, we believe that credibility also plays a vital role in identifying fake reviews and review spammers. Therefore, we incorporated it as a linguistic feature in our detection model.

4) EVIDENTIALITY
[36] defined evidentiality as the linguistic representation of evidence for a statement and its use as an explicit linguistic system to indicate the quality of information; evidentiality offers evidence to aid direct determination of the trustworthiness of a text [37], [38]. The linguistic definition of evidentiality has two dimensions: 1) as a label to indicate the source of information on narrated events [38] and 2) the evidence with which information is obtained [39]. [37] devised a linguistic model in which the concept of evidentiality was incorporated into a machine learning-based text classification framework. Evidential information provides clues instrumental to predicting the value of a text. Because the objective of review spammers is to convince other users to agree with their opinions, their reviews tend to be clearer and more straightforward than genuine reviews. Accordingly, evidentiality should be adaptable to the evaluation of fake reviews and review spammers.

5) POS
POS tagging, which provides syntactic (or grammatical) information on a sentence, has been used in natural language processing to measure text informativeness [40]. This method relies on the assumption that spammers cannot replicate all aspects of natural language when repurposing content [41]. A study by [10] reported that fake reviews contain more verbs, adjectives, and superlatives than do genuine reviews because review spammers tend to exaggerate. [42] employed supervised algorithms in classifying a data set of fake reviews. The researchers also performed POS tagging, focusing on the writing style. They detected fake reviews with outstanding accuracy (91.51%). Following these studies, we applied POS as linguistic features for identifying fake reviews and review spammers.

B. BEHAVIORAL FEATURES 1) RATING ENTROPY
Introduced by Claude Shannon in 1948, the concept of entropy holds that the average level of uncertainty inherent in a variable's probable outcomes is the entropy of a random variable [43]. On this basis, we defined rating entropy as the average level of disorder in a reviewer's rating scores. Genuine reviewers are more likely to base their reviews on merit, resulting in balanced reviews-that is, reviews that are VOLUME 10, 2022 equally critical and noncritical. Spammers, by contrast, are likely to present extreme opinions given that their objective is either to artificially increase the ranking of a product or service or to lower the ranking of its competitors. According to [2], suspicious reviews are more extreme than genuine evaluations. [44] noted that review spammers tend to leave extreme ratings. Thus, we considered entropy a behavioral feature useful for recognizing fake reviews and review spammers.

2) RATING DEVIATION
Rating deviation refers to the amount by which a reviewer's rating of a product or service differs from the average rating of that product or service. In general, genuine reviewers rate a product comparably to other reviewers. By contrast, spammers are more likely to give low-quality products high ratings and give high-quality products low ratings to promote and undermine these products, respectively. [9] observed that spammers tend to deviate from the general rating consensus. The more a reviewer's rating of a product differs from the average rating, the greater that reviewer's rating deviation. [45] indicated that rating deviation is among the most critical features for identifying fake reviews. Notably, rating deviation provides clues pivotal to determining the quality of a review or reviewer.

3) REVIEW COUNT
Review count is the number of reviews written by a particular reviewer. This is a valuable factor and helps distinguish between review spammers and genuine reviewers. Specifically, spammers can make more reviews than genuine reviewers. In some circumstances, spammers evade detection or blacklisting by posting a small number of reviews from one account and then creating a new account from which to post more reviews. Studies on review manipulation have demonstrated that the behavioral distributions of opinion spammers differ from those of non-spammers and that publishing numerous reviews signals deviant behavior [44], [46]. Therefore, we considered review count a useful behavioral feature for recognizing fake reviews and review spammers.

4) LIFE TENURE
Life tenure is defined as the duration for which a reviewer has been active in an online forum [47], [48]. Specifically, studies have indicated that users' confidence regarding the authenticity of a reviewer increases with the amount of time that reviewer has been active in the forum or on the review website [49], [50]. According to [22], real consumers use their accounts to post reviews occasionally, whereas review spammers remain members of a platform for only a short period and post a relatively high number of reviews in that period. In sum, one individual activating in a short time and post numerous reviews is indicative of suspicious behavior. By contrast, a reviewer who is visible for a relatively long period and post review periodically corresponds to normal behavior [46]. This is why we used life tenure as a feature for recognizing review spammers.

5) REVIEW GAP
The average time between one reviewer's successive comments is known as the review gap. This is a useful metric for identifying potential review spammers who are likely to try to copy the average person's review-posting frequency. Research has demonstrated that some people send messages in bursts of activity, whereas others send messages in more consistent intervals. For example, [22] observed that opinion spammers are rarely long-term users of any website, whereas genuine reviewers typically are. The posting of reviews over a long and short period of time indicates regular and suspicious activity, respectively [44], [46]. Therefore, this study employed review gap as a behavioral feature for identifying fake reviews and review spammers.

C. HLR MODEL
Hierarchical modeling can account for the hierarchical structure of data sets consequent to unobserved heterogeneity when individual observations are nested in some factors at a higher level of data structure, which can lead to dependency across observations [51], [52]. Consider the relationship between reviews and reviewers. Reviews written by a specific reviewer form a group with high homogeneity; in other words, the reviews in the group are not completely independent of each other. This is due to the inherent differences between the writing styles of that reviewer and other reviewers; reviews written by a specific reviewer are linked to that reviewer [17]. The nesting structure leads to an inadequacy for using single logistic regression model for predicting due to the violation of the assumption of independence of the residuals and the indetermination or variability among reviewers [18]. By contrast, HLR models can determine how a covariate measured at different levels of the hierarchy influences the response variable. This is accomplished by permitting group characteristics at higher levels of data structure to be involved in modeling individual outcomes [53].
We divided the HLR model into two levels, in which reviews are at the lower level (level 1) and are each nested within a certain reviewer (level 2). The first level is expressed as is the log function of the odds. The odd is the probability that individual review i written by reviewer j is fake (denoted Y ij = 1) divided by the probability that individual review i written by reviewer j is genuine (denoted Y ij = 0). X 1ij , X 2ij , . . . , X qij representing the linguistic features of reviews, are the predictors.β 0j is the intercept, and β 1j , β 2j , . . . , β qj are the coefficients corresponding to the predictors X 1ij , X 2ij , . . . , X qij . Each coefficient captures the average effect of a level-1 predictor on log(odds), becoming the odds ratio OR when raised to the exponent, as in OR 1 = exp β 1j . The (OR) is the multiplicative factor by which the predicted probability of an event occurring rather than not At level 2, we assume that β 0j , . . . , β qj depends on the unobserved factors specific to the jth reviewer. Thus, where u denotes the macro error, assumed to have a normal distribution N (0, δ 2 ); γ 00 , . . . , γ q0 are the fixed effects; and γ 01 , . . . , γ qk are the coefficients associated with the behavioral features Z 1j , . . . , Z kj of each reviewer. These coefficients represent the average effect of a level-2 variable on the log(odds) and become the OR when raised to the exponent, as is the case with the coefficients in level 1.
Equations (1) and (2) define a multilevel model that can equivalently be written as a single equation by substituting (2) into (1): The presence of macro error terms in (2) makes (3) a mixed model. If the macro errors are suppressed, (3) becomes a fixed-effect specification, and its estimation poses no particular problem. In this instance, eliminating the macro errors would be undesirable because we are unable to specify (even in principle) all the determinants of the withinreviewer coefficients. Our purpose was thus to incorporate this fundamental aspect of substantive formulation into an appropriate estimation procedure, following the recommendations of [54]. Fig. 1 presents the system framework of the proposed HLR model. The model construction procedure was functionally divided into three major tasks. The first step involved retrieving and preprocessing a data set of reviews, including the content of the reviews. In the second step, the linguistic features of reviews and the behavioral features of reviewers were extracted to obtain model inputs. The third step entailed processing the two-level HLR model to classify fake reviews, in which level 1 represented review characteristics and level 2 represented behavioral features. This was followed by setting a threshold for review spammer detection. Recency and duration analyses were conducted to investigate the effectiveness of the proposed framework. Each step is explained in detail as follows.

A. DATA COLLECTION
We selected a labeled data set retrieved from the Yelp website. The data were originally collected by [46]. This data set comprised reviews written by 16 935 reviewers of 121 restaurants from 2004 to 2012 as well as corresponding information on the reviewers. From this initial data set, linguistic and behavioral features were extracted and then stored in the Review database.

B. DATA PREPROCESSING
Data preprocessing involved formalizing and structuring the review content in preparation for analysis. We followed the preprocessing procedure suggested in [55]. First, irrelevant characters, words, and other elements, such as HTML, tags, URLs, and punctuation marks, were eliminated. Next, the pronouns were replaced by corresponding nouns. Subsequently, the reviews were split into sentences in accordance with punctuation marks such as commas, semicolons, exclamation marks, and question marks. Finally, we applied POS functions to the tokenized words and then removed the stop words.

C. FEATURE EXTRACTION 1) EXTRACTION OF LINGUISTIC FEATURES
As mentioned, each individual's linguistic style is unique and persists across multiple pieces of text. Herein, five linguistic aspects widely applied in the literature-LIWC, POS, readability, credibility, and evidentiality-were employed to determine review quality. The average reviewer tends to write more than one review in an online community; we thus had to aggregate the reviews posted by each reviewer. The extraction of linguistic features is explained as follows.

a: LIWC FACTORS
To analyze language use, we adopted the LIWC approach, which was developed by [24]. Four reliable LIWC factors-Immediacy, Making Distinctions, The Social Past, and Rationalization-were used to aid fake review detection. These factors comprised 11 subcategories in the linguistic dimension, namely affective processes, cognitive processes, negations, pronouns, quantifiers, social words, tentative words, word count, family-related words, leisure-related words, and words longer than six letters. Given a review, each LIWC factor was estimated for the kth category of LIWC of the jth review from the ith reviewer:

b: READABILITY
Various formulas for calculating the readability of a text have been developed over the past 80 years [27]. We conducted the Flesch reading ease test [56], a reliable, widely used measure [27], to calculate the readability score of each review, with higher scores indicating higher readability. The formula for the test [56] is where sb i,j is the number of syllables of each word of the jth review written by the ith reviewer and s i,j is the total sentences of jth review written by the ith reviewer.

c: CREDIBILITY
Credibility was determined on the basis of four indicators from the framework proposed by [57], namely capitalization, emoticons, shouting, and misspelling. As mentioned, appropriate use of capitalization represents a proper linguistic style, contributing to a sense of credibility. Regarding the emoticon indicator, the overuse of Western emoticons [e.g., :-) and :-D] reflects a less credible linguistic style [58], [59]. Writing in all capitals conveys a ''shouting'' tone, which is indicative of low credibility. Given that credible reviewers should be able to write with proper spelling, the more spelling errors were present in a text, the less credible we considered the text. Given a review, each kth indicator of credibility of the jth review written by the ith reviewer was estimated using the following pattern: where x i,j,k is a parameter of the number of sentences beginning with a capitalized word and z i,j is the total number of sentences. Regarding the other indicators, x i,j,k refers to the number of words belonging to the kth category and z i,j represents the total word count of the jth review.

d: POS TAGGING
The literature on computational linguistics demonstrates that the frequency distribution of POS tags in a text is often dependent on the genre of the text [60]. Thus, we computed a feature POS distribution, which has been used in [10], to aid fake review detection. Four POS tagging categories were considered: verbs, adverbs, adjectives, and superlatives. The kth POS category of the jth review written by the ith reviewer was calculated as follows: e: EVIDENTIALITY Evidentiality, which is based on a hierarchy, forms a continuum from high to low. Various hierarchical schemes have been proposed. We employed the evidentiality categories proposed in [37] and classified them as representing high or low evidentiality ( Table 1). The evidentiality score corresponding to the kth category of the jth review written by the ith reviewer is defined as where wc i,j,k is the number of words belonging to the kth category corresponding to each feature of the jth review written by the ith reviewer and w i,j is the total word count of that review.

2) REVIEWER FEATURE EXTRACTION
Reviewer behavior analysis is essential to the detection of fake reviews and review spammers. We used the review gap, life tenure, review count, rating entropy, and rating deviation to determine review quality. Reviewer features were extracted as follows.

a: REVIEW GAP
The review gap was calculated as the time difference between the posting of two consecutive reviews written by a given reviewer. If a reviewer posts frequently, the review gap is extremely low. The equation for calculating the review gap (in days), presented in [50], is where Gap i corresponds to the review gap of the ith user, n i is the number of reviews written by the ith user, and t i,j corresponds to the time stamp of the jth review posted by user i.

b: LIFE TENURE
Life tenure was calculated on the basis of an equation proposed in [50]: where t i,0 is the time stamp of the first review written by the ith user and t i,n i is the time stamp of the jth review written by the ith reviewer.

c: RATING ENTROPY
Rating entropy was calculated on the basis of the entropy theory advanced by [43], in which the rating entropy of the ith reviewer is expressed as where p i,g is the probability of the ith reviewer giving a review score of g and k is the number of discrete rating scores that can be given by a reviewer.

d: RATING DEVIATION
Following [61] and [50], we computed the mean absolute deviation of each reviewer from the average rating of all restaurants reviewed by that reviewer to aid fake reviewer detection: where Deviation i corresponds to the rating deviation of the ith reviewer, n i is the number of reviews written by the ith reviewer, r i,j is the rating score given by the ith user for restaurant h j in their jth review, and µ h(j) is the mean rating of this restaurant.

3) HLR PROCEDURE
The linguistic and behavioral features were subjected to HLR as presented in [54]. This involved three crucial steps: 1) Running the model without predictors (i.e., constructing an empty model), 2) running the model with level 1 and level 2 predictors (i.e., constructing an intermediate model), and 3) constructing a final model by adding intra-level interactions. The first step aims to confirm whether the data set has a nested structure. To calculate the intra-class correlation coefficient (ICC), an empty model (i.e., a model with no predictors) must be constructed. The ICC can be used to decompose the outcome variation into within-cluster and between-cluster variation [52], [54]. Furthermore, the ICC is a positive value between 0 and 1 and quantifies the proportion of between-cluster variation to the total outcome variation [62]. Depending on whether the ICC value conforms to the [0, 0.059], [0.059, 0.138], or [0.138, 1] interval, the degree of between-group heterogeneity is categorized as low, moderate, and high, respectively [52]. Moderate or high heterogeneity indicates that a data set has a nested structure and is thus suitable for HLR application. The ICC was calculated using (13), where var(u 0j ) is the random intercept variance and (π 2 /3) ≈ 3.29 refers to the level-1 variance component in the standard logistic distribution [54]: After confirming that the data had a hierarchical data structure, we calculated the coefficient of the correlation between the independent variables on levels 1 and 2. If the correlation between any two variables had a large coefficient, we calculated the variance inflation factor (VIF) to determine whether multicollinearity was present between these variables. The VIF is a measure of how much the variance (the square of the estimate's standard deviation) in an estimated regression coefficient is increased because of collinearity [63]. Given that the effects of linguistic features depend on reviewer behavior, we constructed the model with level-1 and level-2 predictors (corresponding to the reviews and reviewers, respectively) to estimate the variation in the effect of linguistic features on the odds of a fake review from one VOLUME 10, 2022 reviewer to another, since we expect the effect of linguistic features to depend on some reviewer's characteristics as presented in (1) and (2). Finally, as shown in (3), a synthesized model was run with level-1 predictors, level-2 predictors, and intra-level interactions to obtain the final model.

IV. RECENCY ANALYSIS
For experimentation, we used the data set collected by [46]. Yelp has a proprietary filtering algorithm for filtering out fake and suspicious reviews, which are presented in a list. Yelp also features recommended reviews considered to be genuine. Yelp's filter was reported to be highly accurate in an article published in [64]. For these reasons, we believe that the labeling of the Yelp data set is reliable and suitable for our purposes.
In this study, two experiments were conducted to evaluate the proposed model, namely recency and duration analyses. This section is dedicated to the former one.

A. DESCRIPTIVE ANALYSIS
In the recency analysis, the original Yelp data set was divided into three sub-data sets, designated as sub-data sets A, B, and C, containing each reviewer's 5, 30, and 50 most recent reviews, respectively. This process enabled accurate estimation of the regression coefficient [65], [66]. Table 2 displays the descriptive statistics of the sub-data sets. The first quartile of the mean number of words per reviewer (designated Q1) was defined as the middle number between the minimum value and the median, whereas the third quartile (designated Q3) was the middle value between the median and the maximum value. The obtained Q1 and Q3 values indicate slightly right-skewed distributions of the average number of words in a review. The discrepancy was greater under a greater number of reviews, suggesting that the more reviews written by a reviewer, the more information they wish to convey to other customers.

B. FAKE REVIEWS DETECTION
As mentioned in Section III, HLR was suitable for application to the three sub-data sets because of the presence of high heterogeneity. Sub-data sets A, B, and C had ICC values of 0.16, 0.28, and 0.31, respectively. An examination of the correlation coefficients and VIFs revealed no collinearity between variables in each sub-data set. As shown in Table 8 of Appendix, all the VIFs were smaller than 5. Next, the ORs of all predictors were calculated. When the OR of a feature was greater than 1, the greater the value, the more likely the review was classified as fake. Conversely, when the OR of a feature was less than 1, the greater the value, the less likely the review was classified as fake. Due to space limitations, the ORs and p values of all features and sub-data sets are presented in Table 9 of Appendix. The influence of each variable on the dependent variable differed between sub-data sets; however, these differences were not large.
For demonstration, Fig. 2 and Fig. 3 present the ORs of features with positive and negative effects for sub data-set A. Significant features are marked with an asterisk. Notably, when the rating entropy increased by one unit, the probability of the review being classified as fake was 6.132 times that of the probability of the review being classified as genuine. The rating deviation exhibited a similar trend, with an OR of 2.055. These results are consistent with those of [50]; this study also indicated that a higher rating entropy and rating deviation values are more likely to fake reviews. [2] empirically discovered that suspicious reviews tended to be more extreme than normal ones. To avoid being detected   and blocked by an online platform, a review spammer may have multiple online accounts. This corresponds to shorter life tenure, a longer review gap, and a lower review count. These inferences are in line with the results demonstrated in both figures. A fake review might involve more cognitive processes and be more effective than a genuine review, according to the findings of [23]. Review spammers usually use simple vocabulary and shorter words to enhance the readability of their reviews. Our finding was also consistent with the outcome of [15], which gave credit to the adoption of straightforward expressions for higher readability of fake reviews, attracting more review readers. Thus, fake reviews typically contain fewer words longer than six letters. Furthermore, they may contain improperly capitalized words and more misspellings, which indicate lower credibility. A less credible review may have a fewer useful count (e.g., total useful feedback that review receiving from readers).
To evaluate the detection performance of the proposed HLR model, the well-recognized machine learning algorithms, namely, NB, SVM, RF, and KNN, were chosen as the benchmarking models. They were implemented by Scikitlearn machine learning library, in which the default hyperparameter values were adopted for simplicity, as illustrated in Table 3. The performance evaluation was conducted with five-fold cross-validation in terms of accuracy, precision, recall, F1 score, and area under the receiving operator characteristic curve (AUC), as shown in Fig. 4. The HLR model had the most favorable performance; this was attributable to its consideration of the nested structure of the sub-data sets. It can be discovered that as the amount of data in the group increased, the detection performance declined. This can be explained by the increase of the variation of features extracted in reviews with the rising number of reviews.

C. REVIEW SPAMMERS DETECTION
Following [67], [68], we used the percentage of fake reviews to identify review spammers, setting a threshold in our experiment. When the percentage of fake reviews exceeded the threshold, a given reviewer was considered a review spammer.
We set the threshold to range from 10% to 90%. Through the detection of fake reviews, we further calculated the fake review rate of each reviewer. If the fake review rate was higher than the corresponding threshold, the reviewer was classified as a review spammer. Table 4 displays the performance indicators when using the HLR and LR models. Among most of the performance measures, HLR yielded more favorable detection results and the performance improved according to the rising amount of data from 10% to 50% whereas it decreased after exceeding the 50% threshold. The optimal results were observed when the threshold was set to 50%; thus, this threshold level was used in the performance comparison.
In Fig. 5, the HLR model obtained the most favorable detection outcomes. In all models, the detection accuracy increased upon the increasing number of reviews. This may be due to the fact that the more reviews written by a reviewer and included, the more accurately the determination of fake versus genuine could be made. In most of the detection results, the numerical differences between the HLR model and other models exceeded 20%. Overall, the results of fake review detection and review spammer detection both demonstrate that the HLR model is more suitable for application when the review data are hierarchical.

V. DURATION ANALYSIS
The purpose of the experiment was to explore the impact of different time intervals on fake review predictions. VOLUME 10, 2022   We performed duration analysis to identify notable observations occurring over 1, 3, 6, 12, 36, and 72 months.

A. DESCRIPTIVE STATISTICAL ANALYSIS
For the duration analysis, the Yelp data set was decomposed into six sub-data sets corresponding to the aforementioned six duration (designated sub-data sets 1-6). Table 5 displays the descriptive statistics of the sub-data sets. Regarding the difference between Q1 and Q3, the longer the duration, the more information reviewers wished to convey through their reviews.

B. FAKE REVIEWS DETECTION
We calculated ICC values to determine whether each sub-data set was suited for HLR application. The ICC of sub-data sets 1-6 was 0.16, 0.22, 0.25, 0.29, 0.36, and 0.39, respectively.  These values are considered sufficiently high for HLR application because they all exceed 0.138. An examination of the correlation coefficients and VIFs revealed no collinearity between variables in each sub-data set. As shown in Table 10 of Appendix, all the VIFs were smaller than 5. The ORs of all predictors were calculated and are presented in Table 6. Significant features are marked with asterisks. Similar to the results of the duration analysis, fake reviews had larger rating entropy, greater rating deviation, and a longer review gap but lower word count and useful count and exhibited improper use of capitalization. Furthermore, fake reviews were associated with shorter life tenure. The duration analysis revealed that affective processes and verbs positively affected fake review detection. Specifically, consistent with the findings of other studies, we determined that most spammers write imaginative reviews containing more pronouns, adverbs, and verbs, whereas genuine reviewers write informative reviews containing more adjectives and nouns [69], [70]. In line with those presented in [71], our results confirm that affective processes are a useful LIWC factor and contribute substantially to fake review detection. In addition, we validated the result with the work of [72] which also adopted the labelled data-set collected by [46]. The finding was consistent with ours in that extreme rating entropy, and rating deviations were the signs of fake reviews.
Finally, we evaluate the detection performance of the proposed model and compared to other machine learning algorithms. Consistent with the recency analysis findings, the HLR model outperformed the LR, SVM, RF, NB, and KNN models. Fig. 6 shows that in most of the detection results, the numerical differences between the HLR model and other models exceeded 10%. Moreover, for all models, detection accuracy decreased as the duration lengthened due to the increase of the variation in each sub-data set. Overall, the results of fake review detection and review spammer detection both demonstrate that the HLR model is more suitable for application when the review data are hierarchical.

C. REVIEW SPAMMERS DETECTION
As in the recency analysis, we set the review spammer detection threshold to range from 10% to 90%. Table 7 presents a comparison of the detection performance of the HLR and LR models. The optimal results were achieved under a threshold 50% and a period of 72 months (6 years). All performance indexes exceeded %. Notably, the recall reached 99%, indicating that almost all reviewers who were identified as review spammers were actually review spammers.
The threshold values used for the performance comparison are shown in Fig. 7. HLR achieved the most favorable detection outcomes. As in the recency analysis, detection performance improved as the duration lengthened. Overall, the results confirm the premise that the more reviews written by a given reviewer and included, the more accurately it can be determined whether that reviewer is a review spammer. In sum, both the recency and duration analyses revealed that HLR is the most suitable approach for handing the nesting present in most review data.

VI. CONCLUSION AND FUTURE DIRECTIONS
Because of the vast quantity of information available on online platforms, identifying credible reviews and reviewers, whose opinions consumers consider when making purchasing decisions, is essential. Studies on fake review detection have ignored the nested association between reviews and reviewers; this can undermine detection performance. To overcome this major shortcoming, we employed HLR to determine how much variance is ascribable to a reviewer and to each individual review on the basis of their nested connection. Recency and duration analyses were performed to investigate the role of linguistic and behavioral features in fake review detection and review spammer detection by considering the number of reviews and time stamps, respectively. Subsequently, a comparison of the detection performance of the HLR model and LR, SVM, RF, NB, and KNN models was undertaken.
The HLR model had the most favorable performance overall, demonstrating that the hierarchical review-reviewer relationship contributes crucially to detection accuracy and thus must not be disregarded. Fake reviews tended to have greater rating deviations and scoring deviations and to contain more emoticons and leisure-related words. Moreover, fake reviews involved more cognitive processes and corresponded to a longer review gap, shorter life tenure, a lower word count, a lower review count, more words longer than six letters, and more adjectives and pronouns. In addition, they were less likely to feature correct capitalization and had lower sociality and usefulness. In the recency analysis, the detection accuracy was maximized when the model was applied to sub-data set A (i.e., the sub-data set containing reviewers' five most recent reviews). In the duration analysis, the detection accuracy was highest when the reviews in one month of a reviewer adopted to HLR model. The results suggest that we should examine as many of a given reviewer's reviews as possible to optimize review spammer detection. Detection accuracy was greater in the duration analysis than in the recency analysis, indicating that the duration of past reviews has an effect on detection performance.

A. THEORETICAL AND MANAGERIAL CONTRIBUTIONS
This main theoretical contribution of this study is that it proposes a new approach for detecting fake reviews by considering both the linguistic and behavioral aspects of review data. The proposed model outperformed various machine learning techniques because it has the ability to handle nested data. Notably, it can also be applied to identification of review spammers as a premodule of quality. Furthermore, we determined which linguistic style and behavioral features are important in detecting fake reviews. The findings serve as a reference for scholars and stakeholders alike with regard to understanding which features of review language and reviewer behavior most strongly affect review quality.
The present study has valuable managerial implications; the identification of fake reviews and reviewers from stylistic and behavioral perspectives enables potential consumers to avoid untrustworthy information and find genuine reviews on which to base their purchasing decisions. Specifically, our model can help manufacturers and retailers recognize review spammers who spread deceptive information and issue warnings in opinion-sharing communities accordingly. Moreover, companies can enlist the assistance of credible reviewers to support marketing campaigns. For example, during the life  cycle of a product, manufacturers should encourage genuine reviewers to share their positive experiences with that product in advertisements. This type of campaign can persuade consumers to purchase the product, thus increasing sales. In addition, firms should pay attention to genuine negative reviews, such as suggestions for improving their products or services. Notably, our approach can be used on all types of website

B. LIMITATIONS AND FUTURE DIRECTIONS
This study has some limitations. First, the present data must be hierarchical in nature, containing information on both reviews and reviewers. Second, to identify the stable characteristics of reviewers, we examined review count as a behavioral feature and thus did not consider sentiment (because content words may vary substantially between topics). Third, we did not take into account other linguistic features, such as the feature of social networks and relationships (e.g., friendships) among platform users. The inclusion of such characteristics into future studies is expected to increase detection accuracy. Finally, we did not consider the emerging problem of spammer groups; this should be investigated in future studies. Tables 8-10. THI-KIM-HIEN LE received the M.S. degree in management information systems from the Ho Chi Minh City University of Technology, Vietnam National University, Ho Chi Minh City, in 2014. She is currently pursuing the Ph.D. degree with the Institute of Information Management, National Cheng Kung University, Taiwan. She is also an Instructor with the School of Information Management, University of Economics and Law, Vietnam National University. Her research interests include business intelligence, data mining, text mining, and human-computer interaction.

See
YI-ZHEN LI received the M.S. degree in information management from the National Cheng Kung University, Taiwan, in 2020. She is specialized in artificial intelligence and text mining. She is currently a Software Engineer and her work focus on developing the software for semiconductor manufacturing.
SHENG-TUN LI received the Ph.D. degree in computer science from the University of Houston, University Park, TX, USA, in 1995. He is currently a Distinguished Professor with the Department of Industrial and Information Management and the Institute of Information Management, National Cheng Kung University, Taiwan. He is the author/coauthor of ten IT-related textbooks, including two translated, over 80 journal articles, and numerous conference papers. His work has been appeared in Information & Management, Omega-The International Journal of Management Science, IEEE TRANSACTIONS ON FUZZY SYSTEMS, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART B: CYBERNETICS, Fuzzy Sets and Systems, Information Sciences, Journal of Information Science, and Technovation. He is also a holder of one IT-related patent. His research interests include artificial intelligence, business intelligence, data mining, and text mining. VOLUME 10, 2022