Context-Aware User and Item Representations Based on Unsupervised Context Extraction From Reviews

User reviews often supply valuable information to alleviate the rating sparsity problem that can occur in recommender systems. Recent work has employed deep learning techniques to learn user and item representations from reviews, which are then used to predict ratings. Such representations are usually learned by considering every word in previous reviews, including words that are irrelevant to user preferences or item features. Some approaches try to identify and extract signiﬁcant words from reviews based on a predeﬁned list of contexts, where contexts such as the season or weather could have strong inﬂuences on user decisions about items, and which are more relevant to their preferences or sought-after features. Specifying optimal values for contexts, however, is not a trivial task and the values are mostly restricted to a single word format. To overcome these limitations, we propose a novel unsupervised method for extracting contexts from reviews, which are then utilized to construct user and item representations. To this end, we adopt a region embedding technique to automatically extract a context as any word in a text region that inﬂuences patterns of rating distributions in reviews. Instead of considering every word in all previous reviews, our user and item representations are dynamically constructed based on different relevance levels among the extracted contexts from a particular review by applying our interaction and attention modules. Experiments demonstrated that utilizing our representations for rating prediction could outperform existing state-of-the-art context-aware and review-based recommendation techniques


I. INTRODUCTION
User-generated reviews can supply valuable information to alleviate the rating sparsity problem that can occur in the standard collaborative filtering-based (CF-based) recommendation approach, which utilizes rating data alone [1]- [3]. In particular, recent work has employed deep learning techniques and attention mechanisms to learn representations of users and items from reviews and use them for rating prediction [4]- [7]. Although these models utilize different types of networks to learn such representations, they share two similar principles that could limit their potential.
The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino .
First, they consider every word in a review as an input when learning user and item representations. Given that some words are not relevant either to user preferences or item features, such words should not be given any weight when modeling their representations. For example, in hotel recommendations, words such as ''clean'' or ''breakfast'' are more relevant to a user preferences toward a hotel features than words such as ''he'' or ''run''. If we can identify and utilize only the words relevant to a specific recommendation domain, the user and item representations could be constructed in a more efficient and meaningful way.
Moreover, the user and item representations are constructed in a static manner by aggregating their relevant previous reviews. This means that each user or item has one fixed representation per review. We believe that, to predict a rating for a particular review, with the aim of modeling a user preferences and an item features for application to the user's current situation, it is more important to concentrate and leverage the more relevant information embedded in that review. For example, the review in Fig. 1 mentions that the room offers breathtaking views of a city at night. To generate user and item representations for predicting a rating for this review, it would be beneficial to know how much the user prefers, and how much the hotel's rooms are well known for, its city views at night. That is, our assumption is that the user and item representations should be dynamically constructed for each particular review, to capture the interactions between user preferences (or item features) and the relevant information in that review. Whereas the traditional CF-based approach considers rating data alone, context-aware recommendations improve the effectiveness of the recommendations by taking into account additional contextual information [8]. A piece of contextual information (or simply a ''context'') such as location, time, or weather can have a major influence on users' decisions when they are choosing items. For example, the type of clothes users would buy depends strongly on the season (e.g. winter or summer).
Many context-aware methods have been able to achieve improved prediction accuracy, compared to standard CF-based approaches [8]- [11]. However, two significant challenges remain. First, obtaining contexts is not a trivial task because they are rarely provided directly by users. Many context-aware datasets collect contextual information by predefining a list of contextual variables and possible corresponding values, as shown by the examples in Table 1, and asking users to select appropriate values for the contextual variables at the time they rate the items. To deliver the best possible predictive performance, this approach requires the expensive process of carefully predefining optimal values for the contextual variables, each of which tends to be associated with relatively few recommendation domains. Moreover, incorporating too many contexts tends to increase the dimensionality of the data, thereby triggering a sparsity issue. After obtaining the contexts, the second challenge is to identify and utilize only the contexts that are relevant to a specific recommendation task. Several methods define a relevant context as one that has a significant influence on the distribution of ratings [9], [18], [19]. Figure 2, for example, visualizes the rating distributions for two contextual variables, Companion and Season. Each cell contains the frequency of each contextual value for each rating value, and its grayscale shading emphasizes the density of the frequency. Note that each contextual value for Companion influences the distributions of ratings differently, whereas there is no significant difference in the distributions for Season. We can therefore hypothesize that Companion is a relevant context and Season is an irrelevant context for this data. In practice, most context-aware methods [18], [19] identify the relevance of contexts by applying statistical tests such as the paired t test to each contextual variable. However, this approach is only applicable when using predefined types of context (such as those presented in Table 1).
When writing reviews, users can express opinions describing their experiences and their satisfaction with items, which can be a valuable source of contexts [1]. As shown in Fig. 1, for example, underlined words such as ''summer'', ''family'', or ''night'' can be considered as contexts embedded in a review. However, extracting such contexts from reviews is not a trivial task. Most methods rely on text mining techniques to extract a context as a word that matches one of the predefined contextual values [12]- [14], which involves the same challenge of predefining optimal values for each contextual variable. To avoid this, several approaches have utilized'' (gives a stronger impression that corresponds to the number of citations unsupervised techniques for extracting contexts from reviews automatically [20]- [22]. By not having to predefine values for contexts, they hence become applicable to a wide variety of recommendation domains. However, these unsupervised methods lack a procedure for identifying the relevant contexts for each specific domain. Moreover, contexts extracted from most such methods have been restricted to a single word format, e.g. ''family'' or ''breakfast''. In fact, the precise meaning of a context might require more than one word for its expression, such as ''family trip'' or ''continental breakfast.'' In this work, we argue that effectively extracting and utilizing contexts in reviews could help overcome the challenges of obtaining and identifying relevant contexts in context-aware methods, in addition to the limitations of modeling user and item representations from reviews via deep learning techniques.
In our previous work [23], we proposed a novel unsupervised method for defining and extracting contexts from reviews. A relevant context in our work is defined not by a single word alone but by the word plus those of its neighboring words that influence the distributions of ratings. We develop a region embedding technique [24] to emphasize the words in a small text region for consideration as a context, and represent it by region embedding to be used for a rating prediction. By not having to predefine contexts, our method can be used to extract relevant contexts from reviews in any recommendation domain.
In this work, we propose an extension to the model developed for our context extraction method [23], namely context-aware region embedding (CARE). CARE derives region embedding representations for the extracted contexts that are output from our previous context extraction procedure. These are then input to our proposed rating prediction procedure, which contains two neural network modules for interaction and attention. Our interaction module first models the relevance of each context in a particular review based on its past interaction with an individual user preferences and item features. Our attention module then generates the user and item representations based on different relevance levels among contexts in each review. These two modules enable our model to dynamically construct unique user and item representations for each specific review, rather than have one static representation for all reviews, as found in most deep-learning-based methods. Finally, our user and item representations are used to predict ratings by exploiting a latent factor model [25].
We have evaluated the predictive performance of our CARE model on three well-known review datasets. Our experiments demonstrate that CARE outperforms existing state-of-the-art rating prediction methods that include both review-based and context-aware recommendation techniques [4], [7], [21], [22], [26], [27]. In addition, we provide an extensive discussion on the context extraction and rating prediction performance of the various aspects of CARE in various situations.
The main contributions of this work can be summarized as follows.
• We propose an unsupervised context extraction technique that is capable of extracting relevant contexts from review datasets in any recommendation domain.
• A relevant context in our work is defined not only by a single word format but also by including its adjacent and/or nonadjacent words that influence the distributions of ratings.
• Finally, our user and item representations are dynamically constructed for each particular review, to capture the different relevance levels among its extracted contexts to the individual user preferences and item features. The remainder of this paper is organized as follows. We first describe the related work in Section II. Our proposed CARE, including the context extraction and rating prediction methods, is presented in Section III. We provide details of our experiments in Section IV and continue the discussion in Section V. Finally, we conclude our work in Section VI.

II. RELATED WORK A. COLLABORATIVE FILTERING
By utilizing previous rating data, a CF-based approach creates a predictive model for estimating the ratings of unseen items for users. One effective approach in CF-based methods is the matrix factorization technique [25], a latent factor model that models a rating as an interaction between the latent features of user and item. Specifically, each user u i and each item v j are respectively associated with user and item latent-feature vectors x u i and x v j ∈ R k , where k is the number of latent dimensions. The rating of user u i for item v j is then estimated byr where b u i , b v j , and µ ∈ R respectively denote the bias for user u i , the bias for item v j , and the global bias. The model parameters are learned and optimized by minimizing the regularized square error through a loss function defined by where O denotes the set of observed user-item rating pairs, r i,j is the observed rating score of user u i toward item v j , and λ is a constant controlling the regularization rate. (The regularization term is added to avoid overfitting the observed rating data.) The parameters x u i and x v j denote the representations of user u i and item v j , respectively, in the latent space. Because they are learned and optimized from observed ratings, their representation quality depends on the quantity of available historical ratings. For most rating datasets, however, users typically rate only a few items among all the available items in the system, which leads to a rating sparsity problem. Such a problem will directly affect the quality of the user and item representations.

B. REVIEW-BASED RECOMMENDATIONS
To alleviate the rating sparsity problem, review-based approaches aim to learn user and item representations by leveraging the content of review data, in addition to the simple rating data [1]. The idea is that reviews can offer richer and more meaningful information than numeric ratings for constructing user and item representations. For example, consider the review in Fig. 1. Even if this user only provided 87096 VOLUME 8, 2020 one review, we could still mine very useful information from that review, such as having traveled in summer with family members, liking city views at night, and being concerned about the cleanliness of the room. This means that, if we can effectively extract such useful information from this review, we might be able to successfully model the user preferences and use them to construct a high-quality user representation.
Many deep-learning techniques have been adopted to model user and item representations from reviews [4]- [7]. For example, DeepCoNN [4] applied a convolutional neural network (CNN) [28] to learn such representations, which were then used to predict ratings based on the latent factor model. In an extension to the CNN, NARRE [7] applied an attention mechanism [29] to construct representations by considering different contributions from reviews based on their usefulness. Despite variations, these techniques share a common network framework for constructing representations, which is shown schematically in Fig. 3. To learn a representation for user u i , this technique first creates a user document for u i by concatenating all the user's previous reviews. Each of the M words in u i 's document is then looked up and mapped with its word embedding, which can be initialized randomly or by utilizing a pretrained word embedding such as Word2Vec [30], GloVe [31], or BERT [32]. These word embeddings are then fed into the neural network components to learn x u i as a representation of u i . Note that an item representation can be constructed in the same way as a user representation. The output of such a framework is a static representation for every user and item in the training data. As mentioned in Section I, constructing a representation using this framework has two limitations. First, words that are not related to a user preferences or an item features are (but should not be) used in constructing the representations.
Second, to predict a rating for a review y i,j , the representations of u i and v j are not (but should be) dynamically constructed by emphasizing the relevant information in the review, which explains the reason behind its rating, rather than relying on information from other reviews.
Extracting and exploiting contexts from reviews could be the key to overcoming these two limitations. Context words such as ''friend'' or ''summer'' are often related to a user preferences and an item features, and are therefore appropriate for use in constructing representations. Moreover, contexts help characterize the situation within which the rating was being given, which is unique and specific for each review. This means that user and item representations can be constructed dynamically by considering the contexts embedded in each particular review.

C. CONTEXT-AWARE RECOMMENDATIONS BASED ON USER REVIEWS
There are two main approaches to extracting contexts from reviews, namely supervised and unsupervised approaches. A supervised approach extracts contexts based on a predefined list of contextual variables and their corresponding values [12]- [17]. Using the predefined contexts in Table 1, words such as ''summer'', ''family'', or ''night'' could be extracted as contexts from the review in Fig. 1. However, non-predefined words in Fig. 1 such as ''clean'', ''free-wifi'', or ''breakfast'', which could potentially be considered as contexts, are overlooked. For a supervised approach to be robust, therefore, it will require the contextual variables and their corresponding values to be predefined optimally for each specific recommendation domain.
In contrast, an unsupervised approach aims to infer contexts from reviews without having to predefine them [20]- [22]. Some of these approaches [20], [21] classify reviews into context-rich and context-free reviews, based on features of each review such as the number of words, verbs, and verbs in the past tense. The contexts are then extracted as those words or topics that occur more often in the context-rich reviews. These two methods, however, require manual annotation of the review data (as context or noncontext) as part of the training process. Recently, CARL [22] has applied CNN and word-level attention to semantically infer contexts from reviews. Its user and item representations are constructed by modeling the attention weight of each word as its influence in each context on a user-item pair. This method was, however, presented using the framework shown in Fig. 3, which means that it suffers from the limitations of utilizing irrelevant words and constructing only static representations.
Most context extraction methods [12]- [17], [20] define and extract a context in the form of a single word such as those shown in Table 1. However, when users write reviews, they have flexibility in how their contexts are presented, including using phrases in addition to single words. For example, some contexts from the review in Fig. 1 might be best extracted as ''family trip'', ''night city view'', or ''friendly staff'', which are more meaningful than just ''family'', ''night'', VOLUME 8, 2020 or ''friendly.'' We believe that other words that often accompany (or are present in the same text region as) context words might help in capturing the appropriate meaning of contexts, and should therefore also be extracted to represent contexts accurately.
In our previous work [23], we proposed an unsupervised method for extracting contexts from reviews by utilizing a technique called ''region embedding'', which had been introduced previously [24]. Our model has the ability to identify and construct a relevant context not only in single word format but also by including adjacent or nonadjacent words that occupy the same text region and that influence the distributions of ratings. In this paper, we aim to extend the model to include an efficient method for utilizing the extracted contexts when constructing user and item representations and using them to make rating predictions.

III. PROPOSED METHOD
In this section, we provide a detailed explanation of CARE, including our proposed method for learning user and item representations, together with our previous context extraction method. Figure 4 presents an overview of the workflow of CARE, which comprises two main parts, namely context extraction and rating prediction. In context extraction, we first identify a list of candidates to be considered as context words, extract their associated text regions, and generate their rating distributions. The text regions and their rating distributions are used to learn the region embeddings, which are then utilized for rating prediction. In rating prediction, we first construct dynamically the user and item representations from the region embeddings via our proposed interaction and attention modules. The interaction module models the relevance of each context to the individual user and item by its past interaction with the user preferences and the item features. The attention module then generates user and item representations based on the different relevance levels among contexts extracted from a particular review. Finally, these representations are used in a latent factor model that predicts the rating.

A. CONTEXT EXTRACTION
We first present our procedure for extracting contexts from reviews, which comprises three main steps, namely identifying candidate context words, extracting contextual regions, and learning the region embeddings.

1) IDENTIFYING CANDIDATE CONTEXT WORDS
We adopt the definition used in Odić et al. [19], which defines relevant contexts as those that contribute to explaining the variance in ratings. By applying this definition to review data, a context can be considered as any word in the reviews that influences the distribution of ratings. For example, Fig. 5 presents a word-rating co-occurrence matrix, which gives the word frequency for each rating value. From this figure, words such as ''clean'' or ''good'' have frequent mentions in reviews with more positive rating scores such as ''4'' and ''5'', whereas ''dirty'' or ''not'' were mentioned more frequently for more negative scores such as ''1'' or ''2''. The implication is that these words influence the distribution of ratings, and could therefore be considered as ''candidates'' for contexts. To measure the significance of the influence of a word on the distribution of ratings, we first compute its variance on such a distribution, as also shown by the example in Fig. 5. After computing variances for every word in the review corpus, only those words having variances above our predefined minimum-variance threshold min var are selected as candidate context words and stored in the candidate list Cand. Note that, in addition to direct context words (e.g. ''clean''), Cand also includes opinion or sentiment words such as ''good'' or ''not'' if they also have significant influence on the distributions of ratings.

2) EXTRACTING CONTEXTUAL REGIONS
Depending solely on the candidate context words might not be sufficient to cover the variety of influences of contexts on the distributions of review ratings. This is because some words that often accompany candidate context words (neighboring words) might significantly alter the ways they influence the distributions of ratings. For example, as shown in Fig. 5, the word ''service'' has a mixed frequencies of rating scores ''2'', ''3'', and ''4'', indicating a neutral distribution toward middle-rank ratings. However, if we consider its co-occurrence with the word ''good'' (i.e., ''good service''), the rating distribution could change from neutral to strongly positive, whereas ''worst service'' could result in a strongly negative distribution, as shown in Fig. 6. This means that neighboring words might be opinion, sentiment, or other words that could change the semantic meaning of a candidate context word, and therefore influence their rating distributions. This emphasizes the importance of considering neighboring words in addition to the candidate context words if the modeling of the influence of contexts is to be effective. Consider a candidate context word c n ∈ Cand. We define the neighboring words of c n as any word w t that occupies the same ''text region'' of c n . More specifically, we consider w t ∈ region(c n , d), where d is the window size for a region of length 2 × d + 1. We call region(c n , d) a contextual region of c n . Note that w t can be in any position within region(c n , d), not necessarily directly adjacent to c n . This takes account of the different writing styles users may adopt for the same meaning in writing reviews. For example, ''view of a city at night'' and ''night view of a city'' are simply alternative expressions of the same context.
We, therefore, formally define a relevant context in our work as ''any word in a region of text that has an influence on the distributions of ratings.'' Such context can be in a format of a single word, adjacent words, or even nonadjacent words, provided they occupy the same text region. To identify the positions of these words, we first need to extract their associated contextual regions. Let Region denote a set of all contextual regions extracted from reviews and let region(c n , d) m ∈ Region be a contextual region at index m in Region. The positions of words to be constructed as context can then be identified by the following steps.
1) Generate all possible combinations of words w t ∈ region(c n , d) m of size θ (where θ ≤ 2 × d + 1) that include c n , denoted by δ(c n , w t ) m . 2) Count the number of times each combination in δ(c n , w t ) m co-occurs in the same region with each rating value on the entire training data and compute the variance from the frequency distribution of ratings. 3) Choose the combination that contributes the highest variance in rating distribution as context for region(c n , d) m . If no combination has a variance above min var , c n alone is considered as context for that region. Store the rating distribution of this combination, dist(c n , d) m ∈ R |Rating| in the list of rating distributions Dist for index m. Figure 7 illustrates the procedure for identifying the highest contributed variance combination of size θ = 2 for the region ''bed is clean and cozy.''. Since, (''clean'', ''cozy'') yields the highest variance, it is chosen to represent context for this region.

3) LEARNING THE REGION EMBEDDINGS
From the previous step, we are able to extract the contextual regions Region and their associated rating distributions Dist from the review data. We now want to utilize them to train our predictive model so that, given a contextual region region(c n , d) m as an input, it predicts the rating distribution dist(c n , d) m as an output. To achieve this, a model with an ability to identify those words in region(c n , d) m that contribute to dist(c n , d) m is required. We therefore adopt the model used for region embedding with local context proposed by Qiao et al. [24] as our training model. This technique learns two representations for each word, namely a word embedding of itself and a local context unit, which is a weight matrix for its interaction with its neighboring words. Our aim is that the local context unit can be modified to emphasize the positions of the words that have influence on the rating distributions and can therefore be considered as contexts for each contextual region.
Our derived region embedding method for context extraction in CARE is shown in Fig. 8. This technique is a simple feedforward neural network model that takes a text region region(c n , d) m as an input and produces a rating distribution vector dist(c n , d) m as an output. Every word w t ∈ Vocab is mapped to its word embedding, whereas only a candidate context word c n ∈ Cand is mapped to its local context unit, which is used to produce the projected word embeddings. Finally, a region embedding, which is a representation of a text region, is generated from the projected word embeddings for use in computing a rating distribution.
Formally, every word w t has an associated word embedding e w t , which is stored in the column of the embedding matrix E ∈ R h×|Vocab| , where h is the embedding size and Vocab is the vocabulary of all words in the training data. In addition to the word embeddings, a candidate context word c n also has an associated local-context unit matrix K c n ∈ R h×(2×d+1) , which is stored in the tensor C ∈ R h×(2×d+1)×|Cand| .
Given a contextual region region(c n , d) m as an input, the projected word embedding p w t of word w t at index position l of region(c n , d) m is calculated by (3) VOLUME 8, 2020  A word embedding e w t of word w t at position l of region(c n , d) m is projected into the region of the candidate context word c n by element-wise multiplication with the corresponding column l of K c n . This indicates that e w t can alter the semantic meaning of c n . For example, w t = ''very'' in the region of c n = ''clean'' yields a positive meaning for the region, whereas w t = ''not'' would result in a negative meaning.
After obtaining all projected word embeddings, the region embedding γ γ γ c n ,m ∈ R h of a contextual region region(c n , d) m is computed by where max is a max pooling operation over all projected word embeddings, which is applied to extract the most predictive features in the region [24]. This indicates that the meaning of region(c n , d) m can be defined semantically by the meaning of neighboring words w t with respect to the candidate context word c n . For example, the two contextual regions ''very clean room'' and ''not clean room'' would give totally different region embeddings for the same c n = ''clean.'' Finally, γ γ γ c n ,m is fed into the fully connected layer to calculate the rating distribution dist(c n , d) m . Its objective is to predict a vector of rating distributions and we adopt a multivariate linear-regression model for the prediction, as expressed by Here, W f ∈ R |Rating|×h and b f ∈ R |Rating| are the weight matrix and bias vector in the fully connected layer, respectively, where |Rating| is the size of the categorical rating scores (e.g. |Rating| = 5 for a five-point rating score). We chose L2 as the loss function, following Qiao et al. [24], and Adam [33] as the optimizer. No regularization was applied.
After all model parameters are learned, each contextual region region(c n , d) m can now be mapped with its region embedding representation γ γ γ c n ,m . Because the region embeddings are trained to represent the rating distributions of their associated contextual regions, we will discuss the quality of this representation further in Section V-D.
In the next section, we show how the extracted contextual regions and their associated region embeddings can be utilized for rating predictions in CARE.

B. RATING PREDICTION
The previous section described the extraction of contexts from reviews as contextual regions and their representation in terms of their associated region embeddings. In this section, we present an efficient rating prediction method for CARE that utilizes the extracted contexts to dynamically construct user and item representations. First, we give an overview and brief explanation of the architecture of our model. This is followed by a more detailed explanation of the proposed interaction and attention modules that are used in constructing the representations for users and items. We then describe the final step of the procedure, in which the rating of a particular review can be predicted using the derived representations.

1) MODEL OVERVIEW
To utilize the region embeddings for making personalized rating predictions, it is important to model the influence of each contextual region on the user preferences and the item features, which are used in determining a particular review's rating. To achieve this, we consider two important aspects of each contextual region, namely its relevance to each user preferences or item features and its contribution to a particular review's rating. First, the relevance of each contextual region to a user's personal preferences depends on how it has previously been expressed by that user in previous reviews. For example, many of the user's hotel reviews might have contained words such as ''cheap'', ''expensive'', or ''worth'', whereas ''small'', ''large'', or ''spacious'' might have been used less often. The implication is that this user is highly interested in the price when choosing a hotel, but is less concerned about the size of the room. Therefore, those contextual regions containing words related to the price of a room should be more relevant to this user preferences than those containing words related to the size of a room. The same assumption can also be applied to the relevance of contextual regions to the item's unique features. Depending on what has been frequently described in their reviews, some hotels, for example, might be famous for their service, whereas others are better known for their convenient location.
Moreover, a review usually contains more than one contextual region and different contextual regions might have unequal influences on the user's decision about the item, which would consequently affect the rating. We believe that, the more relevant a contextual region is to each individual user preferences or item features, the more it should contribute to the rating of a particular review compared with the other regions within the same review.
By properly analyzing the relevance and the contribution made by contexts in a particular review, we are able to dynamically construct user and item representations that are specific and unique for that review. To implement this, we propose a rating prediction model, called the attentional interaction model for CARE (CARE-AttnIntr), whose architecture is shown schematically in Fig. 9. Our model is composed of two parallel neural networks, one each for user-context and item-context modeling. The model takes a review of user u i on item v j , denoted by y i,j , that contains M contextual regions, denoted by Region(y i,j ), as an input. By looking up its corresponding word embeddings and local context unit (learned from the context extraction step), a region embedding for each contextual region is generated. The region embeddings are then fed into the user-context and item-context modeling networks. Each of these networks comprises two modules, namely an interaction module that models the relevance of contextual regions and an attention module that learns the contributions of those regions to a review's rating. Finally, the outputs of the user-representation and item-representation networks are fed into the prediction layer, which generates the final prediction of a review's rating by using a latent factor model.

2) INTERACTION MODULE
To model the relevance of contextual regions to user preferences and item features, we introduce a user-context interaction matrix T u ∈ R |User|×h and a item-context interaction matrix T v ∈ R |Item|×h . Each row t u i ∈ T u and t v j ∈ T v contains a vector representing the interaction with the contextual regions for user u i and item v j , respectively. To fully capture the interactions with the contextual regions, the dimensionalities of t u i and t v j are set to h, which is the dimensionality of the region embedding. We then model the interaction of contextual region region(c n , d) m ∈ Region(y i,j ) with user u i and item v j by using element-wise multiplication between its region FIGURE 9. Illustration of the CARE-AttnIntr model for rating prediction. VOLUME 8, 2020 embedding γ γ γ c n ,m and t u i or t v j , respectively, as expressed by The vectors t u i and t v j can be considered as projection vectors for converting the region embedding γ γ γ c n ,m to the user-relevance region embedding γ γ γ (c n ,m),i and item-relevance region embedding γ γ γ (c n ,m),j , respectively. They are learned with the main objective of capturing previous interactions of a contextual region with each individual user preferences and a specific item features. If a contextual region region(c n , d) m was mentioned a significant number of times in user u i 's reviews, its interaction with t u i will result in high values for γ γ γ (c n ,m),i , indicating that it is highly relevant to u i 's preferences.
After all region embeddings γ γ γ c n ,1 , · · · , γ γ γ c n ,M of region(c n , d) 1 , · · · , region(c n , d) M ∈ Region(y i,j ) are converted into user-relevance and item-relevance region embeddings, they are fed into the attention module to compute their contributions to a rating of the review y i,j .

3) ATTENTION MODULE
The contribution of a contextual region region(c n , d) m to a particular review's rating depends on its degree of relevance to the user preferences and the item features compared with the other regions in that review. Because a user-relevance region embedding γ γ γ (c n ,m),i and an item-relevance region embedding γ γ γ (c n ,m),j indicate the relevance of region(c n , d) m to u i 's preferences and v j 's features, we can utilize them for modeling their contributions. We adopt an attention mechanism that has been successfully utilized in many deep-learning-based recommendation methods [6], [7]. Specifically, the attention network for modeling the contribution of region(c n , d) m in a user-context modeling network is defined by where W attn u ∈ R k 1 ×h , W attn ∈ R k 1 , b attn u ∈ R k 1 , and b attn ∈ R are model parameters. The size of the hidden layer in the attention network is denoted by k 1 and g is a nonlinear activation function. We then apply a softmax function to compute a normalized attention score for a contextual region region(c n , d) m with respect to the other M−1 contextual regions in review y i,j , Region(y i,j ) as The attention scores are then used to compute the weighted sum of the user-relevance region embeddings, which are fed into a fully connected layer to create a user representation of user u i that is specific to review y i,j , x u i ,y i,j , as expressed by where W u ∈ R k 2 ×h and b u ∈ R k 2 are a weight matrix and bias vector, respectively, in a fully connected layer with a hidden layer of size k 2 . This user representation is dynamically generated to give different levels of relevance among the contexts in y i,j . A user who has rated N reviews will therefore have N review-based representations, rather than the one static representation across all reviews used in most deep-learningbased methods. (We omit the computational details for the item representation x v j ,y i,j because they are very similar to those for the user representation.)

4) PREDICTION LAYER
We now describe how our user and item representations can be used for the final rating prediction task. We adopt a latent factor model [25], which has been shown to be effective in many deep-learning-based prediction approaches [4], [5], [7], to model the interaction between x u i ,y i,j and x v j ,y i,j . Specifically, the rating of user u i toward item v j is estimated bŷ where b u i , b v j , and µ ∈ R respectively denote the bias for user u i , the bias for item v j , and the global bias. If a contextual region in y i,j is highly relevant to both user u i 's preferences and item v j 's features, it should result in a high rating score. For example, (10) implies that a user who likes a city view at night will be recommended a hotel famous for its nighttime city view. We use Adam [33] as our optimizer, and, to prevent overfitting, we apply a dropout operation on the hidden layer. We also use L2 as our loss function, with regularization expressed as where O denotes the set of observed user-item rating pairs, r i,j is the observed rating score of user u i toward item v j , and denotes our model parameters.

IV. EXPERIMENTAL EVALUATION A. DATA PREPARATION
For our experiments, we used three publicly available datasets. The first was from TripAdvisor, 1 which contains hotel review data. The two other datasets, Amazon Software and Amazon Movies & TV, are members of the Amazon 5-core datasets 2 [34]. All datasets use a five-point rating system (users select an integer from 1 through 5). We performed some preprocessing on the review text for these two datasets as follows: 1) tokenize the review text and convert all words to lower case; 2) remove all punctuation marks and infrequently used words (i.e., those of appearance frequency below 0.01% in all reviews); 3) remove all stopwords listed by NLTK, 3 except for those indicating sentiment meanings such as ''very'' or ''not''. After the preprocessing, the oneword reviews were discarded as uninformative. The statistics of these two datasets after preprocessing are summarized in Table 2. These datasets differ in some respects. For example, Amazon Software is the smallest and densest, whereas Amazon Movies & TV is the largest and also the sparsest among the three datasets. Most of the reviews in all datasets were rated with very high scores, with Amazon Movies & TV having the highest average score. Despite being the smallest dataset, Amazon Software contains the highest number of unique words (i.e., vocabulary) in its reviews and also has the longest reviews, on average. This contrasts with the Amazon Movies & TV dataset, which is the largest dataset, but which contains the smallest vocabulary and the shortest reviews. After applying our context extraction method, the number of candidate context words and their frequencies per review were found to correspond with the size of the vocabulary and the average number of words per review for each dataset.

B. BASELINES
To establish baselines for evaluation, we compared the performance of our proposed CARE with seven existing stateof-the-art rating prediction models: PMF, NMF, RC-Topic, RC-Word, DeepCoNN, NARRE, and CARL. The comparative characteristics of the baselines and our proposed CARE are listed in Table 3. The first two methods involve latent factor models that utilize ratings alone to learn user and item representations. The remaining methods incorporate review data to learn such representations, except for RC-Topic and RC-Word, which learn a representation from each review. DeepCoNN and NARRE are review-based methods that do not consider contexts in reviews. RC-Topic, RC-Word, CARL, and our CARE consider the influence of contexts in reviews on rating predictions. More details about each method are summarized as follows. • PMF [26]. Probabilistic matrix factorization (PMF) is a standard matrix factorization approach that models user and item latent factors as Gaussian distributions.
• NMF [27]. Nonnegative matrix factorization (NMF) is a matrix factorization technique for which each element in the latent factor is nonnegative.
• RC-Topic [21]. Rich-context topic (RC-Topic) learns a review representation based on a distribution of contextual topics defined by Bauman and Tuzhilin [20] and uses the factorization machine (FM) [35] for rating prediction.
• RC-Word. We modified RC-Topic [21] by identifying a set of context words, following Bauman and Tuzhilin [20], and used them to represent a review with a term frequency-inverse document frequency (TF-IDF) vector.
• DeepCoNN [4]. Deep cooperative neural network (DeepCoNN) employs two parallel CNNs to independently construct user and item representations from reviews, which are then used by FM for rating prediction.
• NARRE [7]. Neural attentional regression model with review-level explanation NARRE) is an extension of DeepCoNN that applies review-level attention to model the contribution of each review to the rating.
• CARL [22]. Context-aware user-item represention learning model (CARL) applies CNN and word-level attention to represent a context as the influence of each word in reviews on the ratings.

C. EXPERIMENTAL SETTINGS
For our evaluations, we randomly selected 80% of each dataset as the training set, 10% as the validation set, and the remaining 10% as the test set. Because our model includes two main parts (context extraction aSince our model includesh part required different experimental settings.

1) SETTINGS FOR CONTEXT EXTRACTION
To extract a list of candidate context words, we first created a word-rating co-occurrence matrix for each dataset. Here, the main problem is that many datasets contain biases in the proportion of ratings provided by users. For example, in the TripAdvisor dataset, more than 80% of all reviews were rated as ''4'' or ''5'', meaning that most users preferred to provide high rating scores to most hotels. This causes almost every word in the corpus to be distributed toward high rating scores, as shown by the example in Fig. 10 (a). This is in contrast to the fact that reviews containing words such as ''rude'' should be rated with a low rating score, rather than a high rating score. To properly analyze the actual influence of a word on the rating distribution, we therefore applied a data standardization technique, expressed by VOLUME 8, 2020 where x t,r is the original frequency of word w t given for rating r, µ r is the average of the frequencies of all words given for rating r, and σ r is the standard deviation of the frequencies of all words given for rating r. The rating distribution after applying this standardization is shown in Fig. 10 (b). The frequencies of ratings with the word ''rude'' are now distributed toward low rating scores, which is appropriate to its negative meaning. After standardization, we computed the variance of the rating distribution of each word, and selected the words with variances exceeding min var = 1 as a set of candidate context words for each dataset. The numbers of such words extracted from each dataset and the average number for each review are given in Table 2. To extract the contextual regions for each candidate context word, we set a region size of five and applied a padding of length d = 2 to the head and tail of every review (a candidate context word might be the first or last word in a review). Because some candidate context words might be associated with millions of contextual regions, using all of them for training could cause scalability problems. In fact, this is unnecessary because using only a sampled portion can cover all unique patterns in the rating distributions. We therefore defined a criterion for sampling a subset of the contextual regions for a candidate context word c n . Specifically, let Region c n denote the set of all contextual regions of c n . If |Region c n | > 100k, only a 10% subset was used for training. If 10k ≤ |Region c n | ≤ 100k, 10k were used. If |Region c n | < 10k, all were used. To assign a rating distribution to each contextual region, we set the size of word combination θ = 2, and selected those with variances exceeding min var = 1.
In the training process, our word embeddings E and local context units K were initialized randomly in terms of a uniform distribution with values between −1 and +1. The embedding size h for all datasets was set to 300. The learning rate was optimized from {0.0001, 0.001}, and the batch size was selected from {128, 256, 512, 1024, 2048}, using a validation set.

2) SETTINGS FOR RATING PREDICTION
Because the predictions from all comparative methods are based on the latent factor model, they require a significant number of ratings per user or per item to learn high-quality representations. This significant number was set as 5 for the Amazon Software and TripAdvisor datasets and 20 for the Amazon Movies & TV dataset. We therefore eliminated those reviews that did not meet this significance criterion. Furthermore, we assumed that the review texts would be available for both training and testing stages because this is exploited by RC-Topic, RC-Word, and CARE when extracting contexts and making rating predictions.
For PMF and NMF, the number of latent dimensions was set to 15, the learning rate to 0.005, and the regularization parameter to 0.001.
For RC-Topic and RC-Word, we separated the reviews by applying K-means clustering for the set of review features defined by Bauman and Tuzhilin [20], including the number of words, number of verbs, and number of verbs in the past tense. The number of top contextual topics was selected from {20, 30, 50, 100, 150, 200}, and the number of top contextual words was selected from {1000, 2000, 5000, 10000}. We used LibFM [35] to implement the FM, following Peña [21]. All FM parameters were set to default values.
For DeepCoNN, NARRE, and CARL, the number of convolutional kernels was selected from {50, 100}, the window size for CNN was set to 3, the learning rate was selected from {0.0001, 0.0005, 0.001}, the regularization parameter was selected from {0.001, 0.01, 0.1}, and the dropout rate was optimized between 0.1 and 0.5. The number of latent dimensions was set to 32 and the embedding size was set to 300 for all these models.
For the prediction model in CARE, the learning rate was selected from {0.0001, 0.0005, 0.001, 0.005, 0.01}, the regularization parameter λ was selected from {0.01, 0.1, 1}, and the dropout rate was set to 0.2. The numbers of latent dimensions k 1 and k 2 were both set to 32. We selected the hyperbolic tangent (tanh) as the activation function for (7). Because different reviews contain different numbers of contextual regions, we set 128 as the maximum number of contextual regions that would be extracted from each review. The batch sizes for both the proposed model and the baseline models were optimized from among {16, 32, 64, 128, 256}.

3) EVALUATION METRICS
We evaluated the performance of CARE against the baseline systems in terms of prediction accuracy using three ranking evaluation metrics. The first evaluation metric was the normalized discounted cumulative gain (NDCG), which evaluated the ranking accuracy of the Top-K recommendation list for each user, as expressed by Here, we set rel j ∈ {1,2,3,4,5} as the actual rating scores for an item at rank position j. Because most users in Amazon Software and TripAdvisor had rated items less than seven times, we chose NDCG@3, NDCG@5, and NDCG@7 for their evaluation, whereas for Amazon Movies & TV, with more than 10 ratings per user, we chose NCDG@5, NCDG@10, and NCDG@15 for that dataset. In addition to NDCG, we also evaluated the performances by calculating the hit ratio (HR) and mean reciprocal rank (MRR). We first classified reviews in our test data that had ratings above ''3'' as positive reviews, and the remainder as negative reviews. An HR@K is calculated as the number of positive reviews appearing in the Top-K recommendation list for each user, whereas MRR is computed as the rank of the first positive review in each user recommendation list, as given by (14) and (15), respectively.
Here, we set rel j = 1 for a positive review and 0 otherwise. The value for rank u i is the first rank position of a positive review in u i 's recommendation list. We chose HR@5 in evaluating the performance for all datasets.

D. EXPERIMENTAL RESULTS
The values for NDCG, HR, and MRR for all baseline systems and CARE with each of the three datasets are presented in Table 4 and Table 5. From Table 4 and 5, note that CARE achieves the highest accuracy for almost every MRR and rank of NDCG and HR across all datasets. Next are the deep-learning-based methods that utilize review data (DeepCoNN, NARRE, and CARL). Although CARL, which is a context-aware method, performs quite well on Amazon Software, DeepCoNN and NARRE perform better on TripAdvisor and Amazon Movies & TV datasets. Deep-CoNN and NARRE obtain very similar results across all datasets, although DeepCoNN seems to perform slightly better on TripAdvisor and Amazon Movies & TV. Furthermore, the other two context-aware baseline systems (RC-Topic and RC-Word) achieved quite good NDCG values on Amazon Software but were less effective on the other two datasets, as was CARL. Finally, PMF and NMF, which do not consider review information, yielded the lowest accuracies on all evaluation metrics when compared with the other systems.

V. DISCUSSION
In this section, we provide a detailed analysis of the performance of the proposed method and the baseline systems. We consider first the predictive performance, followed by a detailed analysis of our model in various aspects. We then review the list of contexts extracted from various recommendation domains. We analyze the quality of the region embeddings learned by our extraction method to investigate its suitability for the rating prediction task. Finally, we analyze the merits of defining and extracting contexts as contextual regions in terms of an illustrative example and a sentiment classification task.

A. PREDICTIVE PERFORMANCE
Here, we discuss three main aspects of predictive performances. First, we analyze the effectiveness of leveraging review content in making predictions. We then analyze how identifying and incorporating contexts from reviews affect the prediction accuracy. Finally, we discuss how the dynamic modeling of user and item representations differs from and improves on the use of the static-representation approach.

1) UTILIZING REVIEW DATA
We first discuss the merit of utilizing review data for making recommendations. As shown by Tables 4 and 5, all methods that leverage review content to learn user and item representations (DeepCoNN, NARRE, CARL, and our CARE) returned better values for NDCG, HR, and MRR than the standard CF-based methods that ignore reviews (PMF and NMF). This demonstrated that the rich and useful information embedded in reviews helps the learning of more appropriate representations, which more accurately capture the personal preferences of a user or the unique features of an item. Utilizing these representations consequently resulted in more accurate rating predictions.
Although leveraging review data, the RC-Topic and RC-Word methods did not always provide better prediction accuracy than the standard CF-based methods that did not VOLUME 8, 2020 consider reviews. For example, they gave lower HR and MRR values on the TripAdvisor dataset than PMF and NMF. This might be because both RC-Topic and RC-Word utilize review content to learn a representation of the review itself and use it directly for rating prediction, rather than learning representations for users and items based on a latent factor model, as do the other methods. We believe that their review representations do not capture the personalized information relating to user preferences or item features, which makes their predictions less accurate than latent factor-based methods.

2) INCORPORATING CONTEXTUAL INFORMATION
We now analyze the effect of incorporating contextual information hidden in reviews on the rating prediction. By comparing the results of the review-based context-aware methods (RC-Topic, RC-Word, CARL, and CARE), only CARE could surpass the accuracy of the review-based methods that did not consider contexts (DeepCoNN and NARRE).
We begin with an analysis of RC-Topic and RC-Word, which are review-based context-aware methods based on topic modeling and TF-IDF representations. Although they mostly performed better than the standard-CF based methods (PMF and NMF), they were the least accurate of the review-based methods. This might be because the performance of their context extraction depends on the quality of the reviews. Based on the definition used in RC-Topic and RC-Word, contexts can be inferred only from reviews of high quality, which often contain a significant number of words. For review data containing many low-quality reviews, the context extraction would be less effective. In contrast, CARE is capable of extracting contexts from any kind of review, provided there is at least one candidate context word embedded in that review. This makes our CARE more robust toward review quality than RC-Topic and RC-Word. We analyze the impact of review quality on our CARE further in Section V-B3.
We now analyze the results from CARL, which is a context-aware method based on a deep-learning technique. First, with the advantage of utilizing both deep learning and an attention mechanism, CARL outperformed both RC-Topic and RC-Word on almost every dataset. However, although it outperformed the noncontext-deep-learning techniques (DeepCoNN and NARRE) on Amazon Software, it obtained less accurate results on the other two datasets. According to our assumption, we believe this occurred for two possible reasons. First, CARL considers the contribution of every word in reviews to the rating as the influence of contexts. Some of these words, however, are irrelevant to the user preferences or item features and could degrade the quality of the representations. The second reason is that CARL computes its attention score based on words from all previous reviews, rather than focusing on those contained in a recent review. In contextaware recommendations, contexts are relevant at the time the rating is created and, therefore, only applicable to a particular review and not to others. By considering words from all previous reviews, CARL incorporates irrelevant contexts that are not associated with the current rating situation, thereby constructing less effective representations for users and items. On the other hand, CARE constructs the representations based only on those words (and their neighbors) in a particular review that influence the rating distribution as a context and achieved better results on all datasets. This supports our assumption that considering only the words in a single review that are relevant to the user preferences and item features is better for capturing the contextual information and results in constructing more effective and meaningful representations.

3) STATIC AND DYNAMIC REPRESENTATIONS
Finally, we analyze the effectiveness of constructing user and item representations dynamically, rather than relying on a static set of representations.
First, we consider the predictive performance of the methods that learn static user and item representations (Deep-CoNN, NARRE, and CARL). Although the attention-based baseline systems (NARRE and CARL) outperform Deep-CoNN on the Amazon Software dataset, they deliver lower overall NDCG, HR, and MRR values on the other two datasets. This implies that applying an attention mechanism does not always improve the prediction accuracy if it is not properly integrated into the model. One concern about the attention mechanisms of both NARRE and CARL is that their attention scores are computed from the contents of previous user or item reviews, rather than utilizing the content of the target review (the review for which we want to predict a rating). The representations of NARRE, CARL, and DeepCoNN consider a user's past preferences or item's past features but barely capture more relevant information such as contexts, which can be extracted only from the target review.
In contrast, our dynamic approach for constructing user and item representations achieves a higher overall prediction accuracy than the static-representation approach. This focuses on utilizing the text of the target review as the main source from which to extract the relevant contexts for a user-item pair. We first apply our interaction module to model the relevance of each extracted context in the review to the individual user preferences and item features. We then compute the attention score for each context, based on its relevance level when compared with the other contexts embedded the same review. This helps to construct fine-grained user and item representations that dynamically capture the relevance of contexts in a particular review to the user preferences and item features.
In the next section, we analyze further the performance of CARE from various angles and in various situations.

B. MODEL ANALYSIS
In this subsection, we first study the impact of the attention and the interaction modules in CARE. Next, we analyze the performance of our model for various parameter settings. We then consider the impact of the review quality on the performance of context-aware methods. Finally, we study the robustness of CARE in situations of rating sparsity.
All evaluations in this section were conducted on the TripAdvisor dataset.

1) ATTENTION AND INTERACTION MODULES
To evaluate the effect of incorporating attention and interaction modules, we created four variants of CARE based on the four possible combinations of these two modules, as listed below.
• CARE-AttnIntr: a model that incorporates both attention and interaction modules, that is, our main model.
• CARE-Intr: a model that considers only the interaction module, with the attention module being ignored. The user and item representations are constructed only from their previous interactions with contextual regions in a review. Specifically, instead of computing an attention score and using it as a weight for each region embedding in (9), we directly average all projected region embeddings obtained from (6) to form the user and item representations.
• CARE-Attn: a model that considers only the attention module, with the interaction module being ignored. This model aims to find the contribution of each contextual region to a review's rating, when compared with the other regions in the same review. Technically, we use the region embeddings generated from context extraction directly as an input for (7), without applying the projection operation of (6).
• CARE-Base: a model without attention or interaction modules. The user and item representations are obtained by directly averaging all input region embeddings for a review. In the fully connected layer, we use one shared weight matrix for all users, and another for all items. The hyperparameters of each variant are tuned with the same settings as the main model, as described in Section IV-C2. Their predictive performances are presented in Fig. 11. From Fig. 11, the CARE-Base model (no attention or interaction module) gave the lowest overall NDCG values of all the variants. The implication is that applying our attention and/or interaction modules improves the performances of our proposed model.
By modeling the varying influences of contexts in a review to the rating by applying the standard attention mechanism, the CARE-Attn model made more accurate predictions than the base model. This demonstrates that different contexts can have different impacts on the rating behaviors of all users and items. However, because CARE-Attn does not consider the relevance of each context to an individual user preferences or item features, two reviews containing the same set of contexts would produce the same user and item representations. That is, the representations generated by this model depend only on the different contributions of contexts embedded in the review, regardless of any personalized interaction with the user or item. This makes CARE-Attn suitable for a sparse dataset where most users participate only rarely or most items are rated only infrequently. However, if a review contains too few or too many candidate context words, CARE-Attn might not be able to exploit the information effectively, thereby degrading the recommendation quality. There is more analysis of the impact of the number of candidates in Section V-B2.
Although applying an attention mechanism gives improved accuracy compared with the base method, CARE-Attn still did not match the performance of CARE-Intr, which models the relevance of contexts to an individual user preferences and item features. The representations for this model are unique for each user or item even for reviews containing the same set of contexts, which is more appropriate when making a personalized recommendation. However, because CARE-Intr relies on previous interactions with contexts, it requires a significant number of reviews to precisely capture the relevance of each context. Moreover, unlike CARE-Attn, this model does not consider the varying influence of different contexts on a review's rating, which should be a factor in improving the recommendation quality.
So far, we have explained that both CARE-Attn and CARE-Intr have their own advantages, which contribute to improved predictive performances. By recognizing the trade-off between these two methods, we found that a combined model, CARE-AttnIntr, achieved the best predictive performance among all the variants. This is convincing evidence that the influence of each context can be adequately modeled based not only on its relevance to the target user preferences and item features but also to its contribution to the rating of a particular review when compared with other contexts in the same review. This model, however, also inherits the characteristics of both CARE-Attn and CARE-Intr, in that it requires an appropriate number of candidate context words per review and enough reviews per user and per item to be able to learn high-quality user and item representations.

2) PARAMETER SENSITIVITY
We now study the impact of our model parameters on the performance of CARE. These parameters include the number of candidate context words, the region size, and the embedding size. Figure 12 shows the impact of the number of candidate context words per review on the prediction accuracy. Because different reviews contain different numbers of candidate context words, we should address how these numbers affect the VOLUME 8, 2020  performance of our method. To investigate this, we set a fixed maximum number of candidate context words to be extracted from each review. As shown in Fig. 12 (a), increasing the number of candidates across {1, 16, 32, 64, 128} yields a higher accuracy. However, for a maximum number of 256, the NDCG@5 and NDCG@7 values start to decrease. The main reason might be that the reviews containing very many candidates are unusually long reviews. As shown in Fig. 12 (b), most reviews in the TripAdvisor dataset contain only about 30 candidates with less than an average of 2500 words per review. Some of the very long reviews (having more than 100 candidates) could potentially be spam or otherwise less useful reviews, which could degrade the prediction accuracy. In addition, considering too many candidate context words might weaken the effectiveness of our attention module, because uniformly distributed attention scores would become more likely. Because using 128 candidates yields the highest predictive accuracy, we used this value in our training process.
We now investigate the impact of differently sized text regions. Increasing the region size means having more neighboring words to be identified and incorporated as contexts together with the candidate context words. Figure 13 shows the predictive performance and the validation loss in context extraction for each region size in {1, 3, 5, 7}. (Region size = 1 means that a context is constructed only from the candidate context word itself, without any consideration of its neighboring words.) Despite having a very high validation loss at the beginning, using this region size completed its training with the lowest loss when compared with other region sizes. This is because it only learns one word embedding to represent one rating distribution for each candidate, which has minimal complexity. However, its prediction accuracy is also the lowest for the various region sizes. This implies that considering only a single word as context is insufficient to accurately capture the actual influence of relevant contexts on the distributions of ratings. Because increasing the region size increases the possible number of combinations between neighboring words and the candidates to be constructed as contexts, it significantly increases the number of corresponding rating distributions, raises the model complexity, and increases the validation loss. The prediction accuracy, however, improves significantly as the region size increases from 1 to 5. This supports our assumption that incorporating neighboring words benefits the construction of more relevant contexts for capturing rating distributions, resulting in more accurate recommendations. However, an excessively large region size may also degrade the performance, as shown by the NDCG@7 value for region size = 7. This is because the model might mistakenly incorporate words from different phrases or sentences to construct incorrect contexts. For example, suppose that we extracted a context from the text region ''is really worst services, the only good''. If we set the region size = 3 or 5, the context could be extracted as ''worst services'', but if we set region = 7, the context might be constructed as ''services good.'' Exploiting such unintended contexts could consequently affect the rating prediction score. We therefore selected the region size = 5 for our parameter setting because of its optimal performance.
Finally, we analyze the impact of different embedding sizes, which involves the dimensions of our word embeddings, the local context units, and the corresponding region embeddings. Figure 14 gives the predictive performance and the validation loss of our model when trained with different embedding sizes chosen from {50, 100, 150, 300, 450}. The loss values in Fig. 14 (b) show that the larger embedding sizes are more effective for representing the rating distributions, although the improvement is small between 300 and 450. Accordingly, the larger embedding sizes also produce more accurate recommendations. A size of 300 was chosen for our model setting because it gives a near-optimal overall prediction accuracy and requires less computation time than using a size of 450.

3) IMPACT OF REVIEW QUALITY
We now consider how the quality of reviews affects our context extraction and rating prediction. First, we need a method for dividing reviews with respect to their quality. We follow Bauman and Tuzhilin [20] in classifying reviews into context-rich and context-free reviews, based on the richness of contexts. The criteria used for classification includes the review features such a number of words, number of verbs and number of verbs in past tense. After reviews are classified, we apply our context extraction method to extract the candidate context words and their associated contextual regions from each type of reviews. Table 6 shows a comparison of the statistics for each type of review on all three datasets. These statistics include the average number of words and the average number of extracted contexts (candidate context words) per review. As shown in Table 6, the number of words in context-rich reviews is significantly higher than in context-free reviews. Applying our context extraction method consequently resulted in more candidate context words being extracted from each context-rich review for all datasets. This indicates that the quality of reviews significantly impacts the amount of contexts being extracted by our extraction method.
In addition, we further analyze how this difference in number of contexts would affect the prediction capability of CARE. To do so, we further evaluated the predictive performance of CARE on each type of reviews. We divided the reviews in test data of all three datasets into context-rich and context-free reviews, and evaluated the performance for each set of reviews separately. Figure 15 shows the comparison of NDCG@5 of CARE on each type of reviews from three datasets. From the figure, the prediction accuracies of our CARE on context-rich reviews were significantly higher than those in context-free reviews. The differences in accuracies were even more significant on Amazon Software dataset, in which the number of extracted contexts from two types of reviews were significantly different. We can then infer from the result that the review quality also have significant impact on our predictive performance. With high quality reviews, we can extract more contexts, and having more information to model the ratingresulted in more accurate prediction.

4) PERFORMANCE ON SPARSE DATA
Finally, we analyze the performance of CARE under conditions of rating sparsity. We modified the training data from the TripAdvisor dataset for two types of rating sparsity, VOLUME 8, 2020 namely user-rating sparsity (each user provided only one rating) and item-rating sparsity (each item is rated only once). The sparsity ratios for the user-rating sparsity data and the item-rating sparsity data were 0.99937 and 0.99974, respectively. The NDCG@5 values achieved by CARE and all the baseline systems when using the sparsity-modified data are given in Fig. 16. From Fig. 16, all methods that utilize review data for constructing user and item representations (DeepCoNN, NARRE, CARL, and CARE) produced more accurate predictions than those constructing such representations using only rating data (PMF and NMF). This implies that, in ratingsparsity situations, review content can be used to construct more effective user and item representations than utilizing only the ratings, leading to more accurate rating prediction.
However, utilizing review data does not always solve the sparsity problem. Figure 16 shows that both RC-Topic (in particular) and RC-Word produced significantly lower prediction accuracies than the other review-based methods and even lower prediction accuracies than the rating-based methods (PMF and NMF). This may be because both RC-Topic and RC-Word depend heavily on word frequency when building their review representations. Although the review data might be sparse, its vocabulary size could still be very large and have very low word frequencies. These two methods, therefore, do not have sufficient data to effectively model their topic or word distributions from which to build high-quality review representations. Utilizing these poor representations will then result in low-accuracy rating predictions.
Finally, we discuss the predictive performance of CARE, compared with the other deep-learning-based representations (DeepCoNN, NARRE and CARL). Figure 16 shows that CARE is able to achieve the highest NDCG@5 values for both the user-rating and item-rating sparsity datasets. Note that in DeepCoNN, NARRE, and CARL, these representations are generated statically from corresponding previous reviews. Such representations, however, tend to overfit with sparse data, for which each user or item provides only one or very few reviews. Utilizing these representations to predict ratings for reviews of unseen items will then be less effective. In contrast, CARE utilizes previous reviews only to model the interactions of users and items with contexts, whereas the representations themselves are generated dynamically based on any context being extracted from each particular review. This makes CARE's representations less affected by the sparsity in previous reviews, thereby achieving more accurate predictions.

C. CONTEXT ANALYSIS
In this subsection, we aim to show that by defining contexts in reviews as words that influence the distribution of ratings, our proposed method has the flexibility to extract contexts from review data across a variety of recommendation domains. To achieve this, we analyzed the list of candidate context words extracted from multiple review datasets across different domains. In addition to TripAdvisor, Amazon Software, and Amazon Movies & TV datasets, we also incorporated datasets from Yelp, 4 which contains hotel and restaurant reviews, and another four of Amazon Product's different categories, [34] namely Fashion, Grocery & Gourmet Food, Toys & Games, and Digital Music. Examples of candidate context words extracted for each dataset are given in Tables 7 and 8. We first analyze the list of candidate context words extracted from TripAdvisor and Yelp, which involve similar recommendation domains. As highlighted in Table 7, our extraction method was able to discover mutual words across these two datasets, such as ''area'', ''clean'' and ''friendly'', which indicate the user preferences toward the hotel features. This demonstrates that our proposed extraction method has a generalized ability to extract a similar set of contexts from different datasets in similar domains. Table 8 gives a list of candidate context words extracted across six categories in the Amazon Product dataset. These results indicate that our extraction method is capable of extracting exclusive words that are strongly related to each domain. Examples include ''fit'' for fashion, ''delicious'' for food, ''catchy'' for music, and ''acting'' for movies. This demonstrated that our method would have the flexibility to extract contexts from many kinds of review-rating datasets, independent of the dataset's domain. Moreover, in addition to these exclusive words from particular domains, our method was able to extract sentiment and polarity words such as ''great'', or ''not'' from across all the domains. This supports our assumption that these words also influence the distribution of ratings and should therefore accompany the context words used in making rating predictions.

D. EMBEDDING ANALYSIS
In the previous subsection, we showed that our extraction method is able to extract words representing contexts from multiple recommendation domains. In this subsection, we aim to show that the region embeddings, which are generated from context words and their neighboring words, accurately capture the rating distributions of their corresponding contextual regions and are therefore useful for rating prediction. Our assumption is that contextual regions that contribute similar rating distributions should generate region embeddings that are close to each other in the embedding space.
To investigate this assumption, we first define a method for categorizing the distributions of ratings into classes. Our approach assigns a class to each rating distribution based on its direction (positive or negative). For example, the frequencies of ratings in the distribution dist(c n , d) 1 = [8,25,34,56,95] are positively distributed toward high rating scores, whereas those of dist(c n , d) 2 = [103, 75, 41, 18, 3] are negatively distributed toward low rating scores. We would then categorize dist(c n , d) 1 as belonging to a positive class, whereas dist(c n , d) 2 should belong to a negative class. To implement this categorization, we chose the Pearson correlation coefficient to compute a correlation score between the rating distribution and an ordinal rating vector, as expressed by where cov denotes the covariance function, σ is a standard deviation, and score R ∈ Z |Rating| is an ordinal rating score vector (sorted in ascending order) for which |score R | = |dist(c n , d) m |. For example, score R = [1, 2, 3, 4, 5] could be used for rating data via a five-point rating score. After computing the correlation score for each dist(c n , d) m using score R , we can then assign it to a class by using the categorization criteria given in Table 9, where we categorize the rating distributions into five classes: Strong Positive, Positive, Neutral, Negative and Strong Negative. To visualize the subtle differences between the region embeddings, we sampled the contextual regions from the TripAdvisor and Amazon Movies & TV datasets, categorizing them based on their corresponding rating distributions and generating their region embeddings. Figures 17 and 18 were obtained by applying t-distributed stochastic neighbor embedding (t-SNE) [36] to the sampled region embeddings from each dataset, where the color of each point denotes the class of its associated rating distribution. In Fig. 17, for each dataset, we sampled 50 contextual regions from each class (250 in total) and plotted their corresponding region embeddings. Note that the group of region embeddings representing positive and negative classes is fairly distinguishable. This supports our assumption that contextual regions with similar rating distributions are mapped close to each other in the embedding space.
We can analyze the region embeddings in more detail by visualizing those that are associated with the contextual regions of each candidate context word. As shown in Fig. 18, we selected two candidate context words, ''location'' from TripAdvisor and ''acting'' from Amazon Movies & TV; we sampled 10 contextual regions that contained them from each class (50 in total) and plotted their corresponding region embeddings. Note that words that contribute positive distributions such as ''great'', ''good'', or ''excellent'' are grouped close to each other and are visually separated from negatively distributed words such as ''not'', ''bad'', or ''but''. This again supports our assumption that neighboring words in the same text region as a candidate context word influence the distribution of ratings, and should be considered when extracting contextual information from reviews.

E. CONTEXTS AS REGIONS
In the previous subsection, we demonstrated that the region embedding representations of contextual regions were effective in capturing their associated rating distributions. In this VOLUME 8, 2020  subsection, we further analyze the merits of defining and extracting a context as a contextual region. First, we show that the rating distributions captured by contextual regions explain the polarity of a review more effectively than those captured by single words. We investigate this assumption by utilizing the embeddings of single candidate context words and the corresponding contextual regions for review sentiment classification. Figure 13 (a) in Section V-B2 showed that utilizing a contextual region of size = 1 (i.e., considering only candidate context words not influenced by neighboring words) produced the least accurate rating prediction among those for a range of region sizes. This indicated that defining a context as a single word is less effective in modeling a review's rating than defining it as a text region. To support this finding, we aim here to visualize the difference in influence on a review's polarity between considering a context as a single word and considering it as a text region. To demonstrate this, we applied our context extraction method to an example review from the TripAdvisor dataset for the two types of context, as shown in Fig. 19. Figure 19 (a) shows the contexts extracted as single candidate context words, whereas Fig. 19 (b) shows them extracted as contextual regions. The highlight colors reflect the class of the associated rating distribution for each context, as defined in Section V-D.

1) REVIEW POLARITY
Consider first the contexts extracted as single words in Fig. 19 (a). Note that many extracted words in this review such as ''not'', ''disappointed', ''crowd'', or ''noise'' were classified as contexts with negative rating distributions. Utilizing those words individually could negatively affect the polarity of the review, and consequently result in a low rating prediction score, which does not accord with the actual rating score. This implies that extracting contexts as single words might not adequately explain the polarity of a review. This can happen because some single words fail to capture the actual rating distribution patterns of contexts. In fact, their combination with some neighboring words could radically alter their rating distribution pattern and lead to a totally different interpretation for a rating prediction score. For example, ''not disappointed'' is classified as a context with a positive rating distribution even though both ''not'' and ''disappointed'' are associated with negative rating distributions.
By extracting contexts as contextual regions, we are able to assign a more appropriate rating distribution class to each context. As illustrated in Fig. 19 (b), some contextual regions such as ''not disappointed'', ''close enough for an easy walk'', and ''but far from crowd and noise'' are no longer assigned with negative rating distributions. Many positive rating distributions could positively affect the polarity of this review and produce a high rating score that is closer to the actual rating score. From this example, we can hypothesize that extracting contexts as contextual regions is more effective in explaining the polarity of a review than extracting them as single words. We can investigate this hypothesis by evaluating the comparative performance using single context words against that using contextual regions for the review sentiment classification task.

2) REVIEW SENTIMENT CLASSIFICATION
Here, we aim to demonstrate that, in addition to their use in making rating predictions, the use of contextual regions is more effective in modeling the polarity of a review than the use of single context words. This demonstration involves the sentiment classification of reviews.
For comparison, we set up two models for review classification, namely a ''single'' and a ''region'' model. For the single model, we utilized the word embeddings of all candidate context words learned by our context extraction method for region size = 1. We then averaged the word embeddings of all candidate context words in each review to create an embedding representation for that review. For the region model, we computed the region embeddings for all corresponding contextual regions, using word embeddings and local context units learned by our context extraction method for region size = 5. We then created a representation of each review by averaging all region embeddings of all contextual regions in that review.
To evaluate the classification accuracy of these two models, we chose logistic regression as a binary classifier of the review representations for both the single and region models. We labeled all reviews from all three datasets (Amazon Software, TripAdvisor, and Amazon Movies & TV) having rating scores of more than 3 as ''positive'' reviews, and those with scores less than or equal to 3 as ''negative'' reviews. The classification results for the single and region models on all datasets are given in Table 10. These results show that the region model achieves a higher classification accuracy and F1 score for all three datasets compared with the single model. This supports our hypothesis that considering contexts as contextual regions is helpful in explaining the polarity of a review.

VI. CONCLUSION
We have proposed a novel unsupervised method for extracting contexts from reviews, together with a predictive model that utilizes the extracted contexts for rating prediction. Unlike previous context-aware rating mechanisms, a context in our work can be automatically extracted not only in single word format but also in combination with those neighboring words from the same text region that influence the distributions of ratings. This makes our approach applicable to a wide variety of recommendation domains and suitable for extracting contexts from sets of reviews that may involve a variety of styles. In making rating predictions, our user and item representations are generated dynamically for each specific review through our proposed interaction and attention modules, based on the relevance of the contexts extracted from that review to a user preferences and an item features. Utilizing our representations for making rating predictions is more accurate than using state-of-the-art deep-learning-based representation techniques that do not properly consider the relevance of a context. This point is strongly supported by our experimental results, in which our approach exhibited the best or near-best performance among many state-of-the-art competitors.
In future work, we aim to develop a unified model that combines context extraction and rating prediction into a single step, which we believe will be more efficient for learning the user and item representations and thereby producing more accurate rating predictions.