Injecting User Identity Into Pretrained Language Models for Document-Level Sentiment Classification

This paper mainly studies the combination of pre-trained language models and user identity information for document-level sentiment classification. In recent years, pre-trained language models (PLMs) such as BERT have achieved state-of-the-art results on many NLP applications, including document-level sentiment classification. On the other hand, a collection of works introduce additional information such as user identity for better text modeling. However, most of them inject user identity into traditional models, while few studies have been conducted to study the combination of pre-trained language models and user identity for even better performance. To address this issue, in this paper, we propose to unite user identity and PLMs and formulate User-enhanced Pre-trained Language Models (U-PLMs). Specifically, we demonstrate two simple yet effective attempts, i.e. embedding-based and attention-based personalization, which inject user identity into different parts of a pre-trained language model and provide personalization from different perspectives. Experiments in three datasets with two backbone PLMs show that our proposed methods outperform the best state-of-the-art baseline method with an absolute improvement of up to 3%, 2.8%, and 2.2% on accuracy. In addition, our methods encode user identity with plugin modules, which are fully compatible with most auto-encoding pre-trained language models.


I. INTRODUCTION
Document-level sentiment classification, as a common subtask of sentiment analysis and text classification, aims to extract the sentiment polarity in a piece of text document (e.g. reviews and tweets) [1]. Representative works of document-level sentiment classification are mainly based on neural networks [2]- [4]. Based on the bag-of-words assumption, typical models usually consist of the following steps. Firstly, they learn word embeddings to represent words as vectors. Secondly, they use CNNs or RNNs to capture dependencies between words. Then they apply attention mechanisms or simply take the vector sum or state-of-the-art results on many NLP tasks, such as question answering, language inference, and text classification.
However, few studies have been conducted to apply PLMs in the task of personalized sentiment classification where apart from the text itself, its context information such as identities of the user (who is writing it) and the item (what it is describing) is also available and can help generate a better representation of the text. Most methods for this task are traditional CNN/RNN-based methods [11]- [16], whereas CNN and RNN both have their weakness. Therefore, it is a natural thought to ''upgrade'' the backbone models from CNN/RNN to the stronger PLMs. A recent work [17] tried to incorporate context information into PLMs by using historical reviews of user/item to help the prediction, but the target review text is still modeled without considering user or item in their work.
To this end, we propose User-enhanced Pre-trained Language Models (U-PLMs) to take advantages of both PLMs and user data in the process of text modeling. To be specific, we propose two schemes of personalization, which inject user identities into different modules of a pre-trained language model.
The embedding-based personalization takes advantage of the design of PLMs [8]- [10] that they can accept the sum of multiple embeddings as input. We associate the user id of the text with an embedding vector, and then add it to all words at the input. With this document-level global bias, PLMs can encode correlations between words in a personalized way and output representations closer to users' actual intents.
The attention-based personalization is inspired by a common pattern of some traditional methods [14]- [16]. These methods are based on hierarchical RNN/CNNs, which construct a sentence with words in the lower level and then construct a document with sentences in the higher level. To incorporate context information, these methods add embedding vectors of users and items in attention functions of both levels to help select important words in a sentence and important sentences in a document. To transfer this idea into PLMs, we introduce user embeddings in self-attention modules in PLMs. For each text, we add the embedding of its user as a bias of the [CLS] token's query in self-attention, where [CLS] is a special token used to represent the whole text. With these enhanced self-attention modules, PLMs are capable of selecting user-specific important words in each layer and generating more accurate representations of the text.
Our main contributions are as follows: 1) We propose to apply PLMs in the task of personalized sentiment classification for better text representation, making use of both PLMs' modeling ability and user identity information in the process of text modeling. 2) We demonstrate two different attempts to inject user identities into different modules of a pre-trained language model, both of which can effectively improve the performance of sentiment classification. 3) Both of our attempts serve as plugins and are fully compatible with most auto-encoding PLMs such as BERT and RoBERTa.

4) Experiments on three datasets with two backbone
PLMs show that our framework obtains remarkable improvements over vanilla PLMs and state-of-the-art models for personalized sentiment classification.

II. RELATED WORK
A. DOCUMENT-LEVEL SENTIMENT CLASSIFICATION Document-level sentiment classification is a task aiming at inferring the overall sentiment polarity of a given document, which is usually a review that describes how the author thinks of a particular item (movie, goods, restaurant, etc.) or a tweet that expresses the author's mood or opinion at some moment. Specifically, given a document consisting of multiple sentences, the problem is to determine whether this document conveys a positive/negative opinion or even to what extent the positive/negative sentiment is. In the first case, the problem is defined as a binary classification task, where 0 represents negative and 1 stands for positive. In the latter case, for the case of scores from 1 to 5 stars as an example, 1 and 5 correspond to greatly negative and greatly positive opinions respectively, and 3 means a neutral sentiment. Then the problem is usually defined as a 5-class classification task.
Based on the bag-of-words assumption, a common structure within typical methods for document-level sentiment classification usually consists of four parts: (1) a word embedding layer to represent words, (2) CNNs or RNNs to capture word dependencies, (3) attention or simple pooling functions to gather word-level information and represent the document, and (4) a classifier for sentiment classification.
Quite a few methods are proposed to make improvements to this structure. Xu et al. proposes CLSTM [18], i.e. a LSTM [19] with cache mechanism, to capture the overall semantic information in long texts. Reference [3] sees a document from three levels: word, sentence, and document. It applies CNN/LSTM to word embeddings within a sentence to construct the representation of this specific sentence, and then similarly represents the document with multiple sentences. Based on this hierarchical structure, reference [14] further discovers the different importance of words (sentences) within a sentence (document) and introduces attention mechanism into the procedure of sentence (document) composition with words (sentences).
However, these methods are all based on static word embeddings such as word2vec [5] and GloVe [6] which have limited expression ability.
One issue of these embedding approaches is that they only learn a single vector for a word but cannot handle the cases that one word can have different meanings in different contexts (such as the word ''chair'' with meanings of ''seat'' or ''leader''). On the contrary, pre-trained language models, which are pre-trained on a large corpus of unlabeled text, can generate context-aware word representations in the phase of fine-tuning for specific tasks. Therefore, there is a trend towards the adoption of pre-trained language models in place of traditional methods for many NLP tasks including document-level sentiment classification.
Another problem of these static word embeddings is that they learn representations of words according to contexts but ignore their sentiment polarities, which is problematic when applied to sentiment classification. For example, words ''good'' and ''bad'' with similar contexts are mapped into close word vectors in the embedding space, but they have opposite sentiment polarities. To solve this problem, there are some works [20]- [22] focusing on injecting sentiment knowledge into word embeddings for better performance for the task of sentiment classification.

B. PERSONALIZED SENTIMENT CLASSIFICATION
The task of personalized sentiment classification is a subtask of document-level sentiment classification which assumes that user and item ids of the text are also available.
For better performance of document-level sentiment classification, some researchers have conducted studies to introduce user representations and item representations in the analysis of review texts [13]- [16]. The intuition is that these representations can provide global information such as rating and language preferences of users and overall ratings of items [23]. Taking this information into account can help provide better text representations.
Chen et al. [14] make improvements over the hierarchical LSTM structure and introduce user and item embeddings into the word-sentence and sentence-document attention functions. Wu et al. [15] propose to model the same piece of text from the user's and item's perspective, respectively, and then combine the document representations from these two views together for prediction. Based on that, Yuan et al. [16] introduce the memory network [24] for users and items to alleviate the cold-start problem by modeling the inherent correlation between users or items. Specifically, they store representations of representative users or items in memory slots and then use them to infer representations for cold-start users or items. Different from all these methods, Amplayo [13] attempt to represent users and items with their proposed ''chunk-wise'' weight matrices instead of bias vectors and inject them into four locations (i.e. embedding, encoding, attention, classifier) in a model.
Note that there are some works for rating prediction [25], [26] which also take user, item, and text as input and predict the rating scores. However, they assume that the target review text is unknown when testing, which is different from our task. Therefore, we are not considering them for comparison.

C. PRE-TRAINED LANGUAGE MODELS
In recent years, a large number of Pre-trained Language Models (PLMs) based on Transformers [7] have emerged. These models are firstly pre-trained on a large-scale text corpus which is unlabeled and therefore easy to acquire. In this pre-training step, the models can learn universal language representations [27]. After pre-training, a model can then be easily applied to a specific task and fine-tuned together with randomly initialized task-specific parameters using a small learning rate. Since the model has been pre-trained on lots of data before, it can avoid overfitting the small amount of labeled data for specific tasks. Instead, the model converges fast and usually outperforms traditional methods which are only trained on the task-specific labeled data.
Since the Transformer [7] consists of two parts: an encoder and a decoder, there are correspondingly two types of pre-trained language models.
The first type are the autoregressive models. These models predict a word based on the words which precede or succeed it, which is similar to the traditional statistical language models. Therefore, methods of this type are usually used for generative tasks. Representative works are the GPT series [28]- [30]. GPT [28] proposes the stages of generative pre-training and discriminative fine-tuning and is the first work to make use of transformer structure for text modeling. GPT-2 [29] formulates the supervised tasks as unsupervised problems and demonstrates their model's ability to perform a wide range of tasks in a zero-shot setting. GPT-3 [30] removes the stage of fine-tuning. It takes the idea similar to MAML [31], uses two nested structures called ''inner-loop'' and ''outer-loop'' in pre-training, and learns a good initialization point for the model. Starting from this initialization point, when faced with any specific task, the model quickly fits in with the task and converges within only a few samples.
The second type, the autoencoder models, encode token correlations in a bi-directional manner. The encoding way of ''bi-direction'' makes it unavailable for generative tasks but greatly suitable for discriminative tasks such as text classification, sequence labeling, etc. Representative works mainly consist of BERT [8] and its variants [9], [10], [32], [33]. BERT introduces two tasks in the pre-training stage, i.e. Masked LM (MLM) and Next Sentence Prediction (NSP). RoBERTa [9] improves BERT by performing dynamic masking of tokens. ALBERT [10] uses techniques of factorized embedding parameterization and cross-layer parameter sharing to reduce the parameters of BERT. SpanBERT [32] masks contiguous words instead of single tokens and introduces the span boundary objective to extend the BERT's sense of text spans. Similarly, ERNIE [33] proposed by Baidu integrates knowledge into BERT with entity-level and phraselevel masking strategies.
Since we mainly focus on methods for the task of sentiment classification, we are not going to mention the decoder-based methods in later sections. Therefore, we refer to the encoder-based methods (BERT, RoBERTa, etc.) as pre-trained language models or PLMs in this paper for convenience.

III. METHODOLOGY
In this section, we first define the problem of personalized sentiment classification. Then we introduce our proposed U-PLMs which is divided into three modules. Finally, we illustrate the two-stage training procedure of our framework. Fig. 1 shows an overview of our framework. As shown in the figure, user identity is used in two modules in our work.  The embedding-based personalization is introduced in the embedding module (section III-C), and the attention-based personalization is introduced in the self-attention module in encoder blocks (section III-D). We use BERT as the backbone model for illustration in this section.

A. PROBLEM DEFINITION
Given a piece of text y written by a user u for an item v, the goal of personalized sentiment classification is to predict the sentiment category r (e.g. 1-5 stars) of the text, which represents the user's opinion on the item.
Note that we're using text y and user u, but not item v, in our framework. This is because we found that user information is much more important and effective for personalization than item information in our experiments, which is consistent with the observation of [11].

B. MODEL INPUT
At the very first step, a piece of natural language text is tokenized into a sequence of tokens/subwords by a subword algorithm (e.g. WordPiece [34] for BERT). The sequence is represented as: Note that a special token [CLS] is padded at the beginning of the sequence. This token has no actual meaning itself but is designed to provide sentence-level information in BERT. Its output is often used for sentence-level tasks, including text classification. In our work, we apply BERT to the task of document-level sentiment classification by regarding a document as a long sentence.

C. EMBEDDING MODULE
In this module, BERT accepts the sequence in (1) as input and convert it to a sequence of hidden states: Each hidden state is computed via taking summation of token embeddings, position embeddings, and segment embeddings: Since segment embeddings are used to distinguish tokens in different sentences for sentence pair tasks such as question answering. However, they are not important in single sentence tasks including ours. Based on this reason, we are not mentioning them in Fig. 1 for lack of space.

1) USER INJECTION
To inject user information into the embedding module, we take advantage of the design of PLMs that the sum of multiple embeddings is used as the representation for a token. Specifically, we introduce a new parameter matrix of user embeddings and add the user embedding vector for each text to all tokens as a document-level global bias. That is, equation in (3) where e user u represents the embedding vector for user u who wrote the review text.
Note that although both the user embedding and the segment embedding are identically added to all tokens, their importance are different. In our task, the segment embeddings are all the same across all tokens in a document and also across all documents over the whole dataset. So they provide no useful information in our task. However, user embeddings differ from document to document, since the documents are written by different users (there are also ones written by the same user though). This difference helps the PLMs encode token correlations in a personalized way, and output more accurate representations.

D. ENCODER MODULE
In this module, BERT uses the structure of transformer encoder, which is composed of L identical layers of encoder blocks, as shown in Fig. 1. Each encoder block has two parts: a multi-head self-attention mechanism and a fully connected feed-forward network. Each part is surrounded by a residual connection [35] and followed by a layer normalization [36] component.
Each block accepts a sequence as input, update representations with these two parts, and outputs the sequence with the same shape, which is passed into the next block as input. The input of the first block is the output of the embedding module. That is: where H l ∈ R N ×d , N is sequence length and d is the dimension of the hidden states.

1) MULTI-HEAD SELF-ATTENTION
The multi-head self-attention module is applied to pass context information between tokens and update token representations iteratively. Formally, considering the self-attention module in layer l, this process can be explained in the following steps: (1) Apply three MLP layers to all tokens in H l−1 and reshape them to get query Q l,a , key K l,a and value V l,a for each attention head a. All of them are in the shape of R N ×A×d a , where N is sequence length, A is the number of heads and d a is the dimension per head.
(2) Calculate similarity between all token pairs by calculating the dot product of Q and K: where S l,a ∈ R N ×N represents the similarity matrix which is normalized along the dimension of key. Each element S(i, j) represents the importance of the jth token (among all tokens) to the ith token.
(3) Using the similarity matrix as attention weights, calculate the weighted sum of the whole value sequence for each query token separately: and outputs O l,a ∈ R N ×A×d a . (4) Finally, for each token x, aggregate the updated representation of all heads together:

2) FEED FORWARD NETWORK
The part of feed forward network is designed to further increase the model capacity. It is composed of two linear transformations with a activation function between them: where x ∈ R d , W l,1 ∈ R d×4d and W l,2 ∈ R 4d×d . f is the activation function, i.e. GeLU [37] in BERT. The function in l for all tokens identically.

3) USER INJECTION
We now propose to inject user information into the encoder module, or its ''multi-head self-attention'' part to be more specific. This design is inspired by the traditional methods for personalized sentiment classification. These methods are mostly based on hierarchical RNN/CNNs with word-level and sentence-level attention functions. To inject additional context information, They incorporate user embeddings and item embeddings as global biases in these attention functions. Interestingly, we found that there is also a similar pattern in BERT: since the [CLS] token is designed to represent the whole document, the attention with [CLS] as query and all tokens as key aims at gathering information of important tokens in the document.
Based on this observation, we propose to add a user vector as bias to the query vector of [CLS]. Then the similarity calculation in self-attention for head a in layer l, i.e. the equation as shown in (6), is now formulated as: where both Q l,a and U l,a are in the shape of R N ×A×d a . For U l,a ∈ R N ×A×d a , we only initialize the part which is added to the query of [CLS] (i.e. U l,a [0, :, :]) with random values, and fill all others with 0.

E. OUTPUT MODULE
After the stacked encoder blocks, BERT outputs a sequence of hidden states H L which is then used for specific tasks. The task of sentiment classification is defined as a sentence-level classification problem, which uses the output of the [CLS] token for prediction. Specifically, it passes the output of [CLS] through a MLP layer and get h ∈ R d to represent the whole text, and then uses a classifier with softmax for classification: where p ∈ R |C| is the vector of class probabilities, in which |C| is number of classes. W ∈ R |C|×d and b ∈ R |C| .

F. TWO-STAGE TRAINING PROCEDURE
After we load the pre-trained BERT, we train it in two stages successively: 1) In the first stage, we pre-train the model with the Masked LM (MLM) task in the same way as [8], using our train set. This is intended for our framework to learn doma in-specific knowledge in the dataset. 2) In this stage, we introduce user-specific parameters by following one of the two personalization strategies and fine-tune the whole model for sentiment classification with cross-entropy loss. Note that in our training procedure, user identity is only used in the second stage because we expect our framework to firstly learn general knowledge of the words in the domain in the first stage, and then ''fine-tune'' their representations with user-specific parameters for the task together.

IV. EXPERIMENTS
We conduct various experiments to evaluate our proposed U-PLMs in this section.

A. EXPERIMENTS SETTINGS
Following prior works, we conduct experiments on three sentiment classification datasets 1 which consist of user and item IDs apart from review texts. These datasets are collected from IMDB and Yelp websites and split into train set, dev set, and test set with a ratio of 8:1:1. Statistical details of the datasets are given in Table 1. We use Accuracy on dev set to select the best model, and use Accuracy and RMSE on test set for evaluation.
For experiments of vanilla BERT and U-PLMs, We pad or clip the text to be with a max length of 500. Following [38], we load the BERT model from BERT-base which contains L = 12 encoder blocks. We run the in-domain pre-training with a learning rate of 5e-5 for 100,000 steps. As for finetuning, we set the learning rate to 2e-5 and the batch size to 30 to fully leverage the GPU memory. We empirically set the max epoch number to 3. We optimize our model with AdamW [39] and use slanted triangular learning rates [40] with a warmup ratio of 0.1 for both in-domain pre-training and fine-tuning.
As for the user-specific parameters in U-PLMs, the user embedding in embedding-based personalization is initialized with a normal distribution N (0, 0.005 2 ) for better performance. The ones in self-attention modules are initialized with N (0, 0.02 2 ) and we add a layer normalization [36] component in each module so that the user bias falls into a similar range with the query representation of [CLS].
To show the robustness of our framework, we also use RoBERTa (loaded from roberta-base) as our backbone model, using the same hyper-parameters with BERT. We run our models 10 times and report the average results. All experiments are conducted on an RTX 3090 GPU.

B. BASELINES
We compare our methods with the following baseline methods: • NSC [14] uses hierarchical LSTM and attention mechanism to encode the review text.
• BERT [8] corresponds to the vanilla BERT model. It is trained with the same setting as our method except that no user information is used.
• NSC + UPA [14] is the enhanced NSC model that incorporates user and item identities into the attention mechanism. • HUAPA [15] uses two separate networks with the same structure to model the text from the view of user and item respectively and combine them for final prediction.
• CHIM [13] studies how and where to incorporate user and item information in a sentiment classification model.
• RRP-UPM [16] is based on HUAPA and considers the inherent correlation between users or items for better representation.
• IUPC [17] is the first method to apply BERT into personalized sentiment classification. It uses historical reviews to represent user and item and combines these two representations with the target review for better prediction.
Since the embedding-based and attention-based personalization are independent of each other, we employ them separately in our model when comparing with baselines to study their effects respectively. The combination of them is discussed in section IV-D.

C. MODEL COMPARISONS
We implement and train BERT, RoBERTa, and our framework, respectively, while using the results of other baselines reported in their papers.
The results are listed in Table. 2. The methods are divided into three different groups according to additional information they use apart from the text: (1) models considering no user or item identity; (2) models considering both user and item identities; and (3) models considering user identities only.
From these results, we can make the following observations: Firstly, vanilla BERT and RoBERTa perform much better than NSC. This proves the effectiveness of PLMs for document-level sentiment classification.
Secondly, traditional models are improved after using user and item identities. For example, NSC + UPA gains great improvements over NSC. RRP-UPM achieves competitive results with BERT on accuracy. These results show that additional context information truly brings help to the task.
Thirdly, IUPC outperforms other baselines on most metrics. This shows that it's feasible and worthwhile to combine PLMs and context information for better performance.
Finally, our methods outperform all baselines including IUPC on all metrics. This proves that our ways of injecting user identity help both text modeling and final prediction.
It's worth mentioning that we've also tried to incorporate item identity as additional information. Unfortunately, we didn't find an obvious improvement in performance, which is consistent with the observation of [11] that user information is much more effective than item information.

D. ABLATION STUDY
We conduct an ablation study to evaluate the effect of each component. Specifically, we use vanilla BERT and RoBERTa as base models and then add different combinations of three components, i.e. the in-domain pre-training stage, embedding-based personalization, and attention-based personalization, to construct different alternatives of our framework. Results are shown in Table. 3.
From the results, we can observe the positive effects of the in-domain pre-training and both of our user-related components. Firstly, by training the model on our task-specific training data with MLM before fine-tuning, the model is proved to fit in with the dataset better. Based on this, a remarkable improvement is further brought by the injection of user information. Both the embedding-based and attention-based personalization improve the performance, and among them, the embedding-based one is slightly better. Finally, since the two strategies are independent of each other, we've also tried to combine them together in a single model. However, no further improvement is achieved.

E. ANALYSIS FOR EMBEDDING-BASED PERSONALIZATION
As a part of the summation in the embedding module (as shown in (4)), the value range of the user embeddings matters VOLUME 10, 2022 TABLE 3. Ablation study for three components. PT means the training stage of in-domain pre-training. Emb-U means embedding-based personalization. Att-U means attention-based personalization. a lot. If it's too large compared to other embedding values, it might be dominant in the token representation and therefore eliminate the difference between tokens in a text. On the other hand, if it's too small, the injected user information may fail to affect the token representation. Therefore, it's necessary to choose an appropriate initialization range for the user embeddings.
The results on IMDB and Yelp13 with RoBERTa as backbone are shown in Fig. 2. Although our model outperforms the vanilla PLM consistently under different initializations, it is observed that a std which is the same as or slightly smaller than the original embeddings (i.e. 0.02) is appropriate. When std is larger than 0.02, the performance of our method significantly drops. We can also find a trend of drop when std turns from 0.005 into 0.001.

F. ANALYSIS FOR ATTENTION-BASED PERSONALIZATION
In the proposed U-PLMs (Att), we inject user vectors into the self-attention modules of all encoder blocks, serving as biases to queries of [CLS] tokens. In this section, we make further analysis and explore the different choices of (1) encoder layers and (2) tokens to be biased.

1) IMPACT OF PESONALIZATION LAYERS
Instead of injecting user vectors into all encoder layers, we only choose some of them for injection in this experiment. Specifically, we divide all 12 layers into four parts, each consisting of three consecutive layers. Then we inject user vectors into only one part to affect this part directly and later parts indirectly. For example, if user vectors are injected into layers 7-9, the text is firstly modeled the same as in vanilla PLMs in layers 1-6. Then the queries of [CLS] in self-attention modules of layers 7-9 are biased with the user. Finally, user information is passed to latter layers implicitly in the biased [CLS] representations.
Results are shown in Table. 4. Overall, comparisons in three datasets show that injecting user vectors into all encoder layers is better than all the alternatives. This is because the introduced user parameters differ from layer to layer. Injection into all layers enables the PLMs to encode text more flexibly in each layer.  However, some of the alternatives such as ''10-12'' can achieve similar results with ''All''. This means that injection into three layers is enough for the exploitation of user information. There is also a trend that the model tends to perform better when user vectors are injected into latter layers, especially for the IMDB dataset which is more complicated. These observations can help us with the improvement of attention-based personalization in future work.

2) IMPACT OF PESONALIZATION TOKENS
In our proposed strategy, we add user vectors to the query of only the [CLS] token. The intuition behind this design is that the [CLS] token is used to represent the whole document in BERT or RoBERTa, and adding bias to its query in self-attention has a similar pattern with the personalized attention mechanism in traditional methods.
As an alternative, we try to explore this strategy further by adding user vectors to all tokens, instead of the [CLS] token only. Results, as shown in Table. 5, show that the performance of our model drops when we add biases to not only [CLS] but also all other tokens. The reason might be the different roles of [CLS] and normal tokens as queries in self-attention. The [CLS] token has no actual meaning itself and adding bias to its query is to help select important tokens within the document for the user. However, relations between normal tokens are not quite different between users. Therefore, injection into these tokens is not necessary. However, this alternative still outperforms the vanilla model, which proves the importance of user information.

G. ARE ALL USERS FULLY TRAINED?
To study whether users with different numbers of training samples are all well-trained, we compare our framework including U-PLMs (Emb) and U-PLMs (Att) with vanilla PLMs for different groups of users.
We first calculate the metrics for each user based on his/her corresponding test samples. Then we divide all users into several groups according to their numbers of training samples. The metric for a group is obtained by averaging over metrics of all users in this group, and we conduct the comparison between three models for each group separately.
To save space, we only exhibit the results using RoBERTa as the backbone model in this experiment, as shown in Table 6. We can see that both of our methods obtain improvements consistently for users of all groups. This means that our framework, both embedding-based and attention-based, works well for all users with many or a few training samples.

H. COMPARISON BETWEEN EMB-BASED AND ATT-BASED PERSONALIZATION
In this section, we look back to the experiment results again and make further observations on the difference between our two methods, i.e. U-PLMs (Emb) and U-PLMs (Att).
It can be observed from Table 2 that U-PLMs (Emb) performs slightly better than U-PLMs (Att) for both backbone models, except for the IMDB dataset.
We think the reason might be the complexity of userspecific parameters, the difficulty of the task, the amount of introduced user information, and their relationships. Specifically, the only additional parameter introduced in U-PLMs (Emb) is the user embedding matrix in the embedding module. On the other hand, U-PLMs (Att) introduces a new embedding matrix for each encoder block, which is much more complicated. Since we only use user ID as additional context information for user privacy, the embedding-based strategy might be enough to model this information.
However, on the IMDB dataset which consists of more sentiment classes and much longer sentences than the Yelp datasets, U-PLMs (Att) achieves similar or even better performance than U-PLMs (Emb). This might be because the task is more difficult, and to model the user information with the complicated version of U-PLMs is more appropriate.
Similarly, results in Table 6 also support our judgement. On all three datasets and both metrics, U-PLMs (Att) performs worse than U-PLMs (Emb) for users with under 20 training samples. However, with the training samples per user increasing, the gap gradually decreases. U-PLMs (Att) can even perform better in some of the results. Therefore, we can safely draw a conclusion that our proposed embedding-based personalization is suitable for simple scenarios while the attention-based one is better for complicated cases. Furthermore, the attention-based personalization is more flexible, with many alternatives such as ones in section IV-F1 and IV-F2 can be explored in future work.

I. RESEARCH IMPLICATIONS
We focus on the study of the combination of pre-trained language models and user identity information for documentlevel sentiment classification, which is shown to be effective and worth further discovering in our experiments.
Firstly, both personalized data and PLMs are available in practice, especially for corporations. In addition, our experiment results show that both of these are quite effective for the task. However, few studies have been done to combined their advantages and further improve the performance.
In our research, we deeply study the ways of combining PLMs and user data. Unlike existing works, we introduce user data into not only the prediction module but also the procedure of text modeling. Experiment results show that these two information can indeed be exploited jointly. We believe that VOLUME 10, 2022 there are many better ways which deserve discovering in this field.

V. CONCLUSION
In this paper, we propose two attempts to inject user identity into PLMs to build U-PLMs, i.e. user-enhanced pre-trained language models, for personalized sentiment analysis. Experimental results show that both of our two methods outperform vanilla pre-trained models and state-of-the-art models for personalized sentiment classification greatly. These observations further indicates that pre-trained language models and personalized data can be exploited jointly for better performance in the task of document-level sentiment classification. Furthermore, we found that the embedding-based personalization is enough to model the user id which contains not much information, while the attention-based strategy is suitable for more complicated situations. In the future work, we will try to make improvements based on our two strategies and explore other better ways of injecting user identity and other context information of the text into PLMs.

VI. ETHIC CONSIDERATIONS
In this paper, we use a unique ID, as the only information to represent a user. The ID is in the form of a meaningless string (e.g. ''U0001''), and contains no real information (gender, race, etc.) about users.
To ensure acceptable privacy practice, all the datasets we use in this paper are publicly available, and we use them in a purely observational and non-intrusive manner.
XINLEI CAO received the B.S. degree in software engineering from East China Normal University, China, in 2019, where he is currently pursuing the master's degree with the Department of Computer Science and Technology. His research interests include natural language processing, pre-trained language models, and information retrieval.
JINYANG YU is a Ph.D. in Medicine. He is engaged in post marketing drug monitoring and evaluation. He has presided over or been responsible for a number of post marketing drug research projects. Currently, he is mainly responsible for the medical devices monitoring and information construction.
YAN ZHUANG received the Ph.D. degree from Tsinghua University, China, in 2019. He is a Senior Engineer with the Medical Big Data Research Center, Chinese People's Liberation Army (PLA) General Hospital, engaged in hospital informatization construction and research for more than ten years, specializing in medical big data and artificial intelligence related technologies, and doing research in database and data warehouse, human-machine computing, knowledge graph, and graph neural networks. He has published more than 90 papers in the field of medical big data and won the Best Paper Award at the 2017 ACM International Conference on Information and Knowledge Management (CIKM2017). VOLUME 10, 2022