Collaborative Filtering Recommendation Algorithm Based on Attention GRU and Adversarial Learning

Aiming at the problem that the traditional collaborative filtering algorithm using shallow models cannot learn the deep features of users and items, and the recommendation model is very susceptible to the counter-interference of its parameters; this paper proposes a matrix-factorization recommendation model that combines adversarial learning and attention-gated recurrent units (AGAMF). Firstly, the gated recurrent unit based on the attention mechanism is used to extract the user’s latent vector from the user’s auxiliary side information. Secondly, the convolutional neural network is used to extract the item’s latent vector from the item’s auxiliary side information. Finally, adversarial disturbances are introduced on the latent factors of users and items to quantify the loss of the model under parameter disturbances, and the latent vectors of users and items are integrated into the probability matrix factorization to predict the user’s rating of the item. Experiments were performed on two real data sets MovieLens-1M and MovieLens-10M, and the RMSE, MAE and Recall indicators were used for evaluation. Experiments prove that the model proposed in this paper is robust and can effectively alleviate the problem of data sparsity. Compared with other related recommendation algorithms, our model has a significant improvement in recommendation performance.


I. INTRODUCTION
In recent years, with the rapid development of Internet technology, the ways for users to obtain data have become more and more abundant. However, the explosive growth in the amount of information has brought about the problem of "information overload". Faced with noisy data, users may not be able to accurately select effective information. Therefore, the recommendation system is a necessary tool to help users obtain effective information. Traditional approaches include collaborative filtering methods [1]- [3], which use similar preferences among similar users to discover users' potential preferences for items, and are vulnerable to cold start problems and data sparsity problems. And content-based methods [4], [5], mining other items with similar attributes for recommendation based on user historical behaviors often The associate editor coordinating the review of this manuscript and approving it for publication was Bohui Wang . encounters the problem of difficulty in feature extraction. Otherwise, hybrid recommendation methods [6], [7], considering that a single recommendation method has its own shortcomings, combine different recommendation algorithms for mixed recommendation.
Matrix factorization [8] is a widely used model-based CF method with good scalability and accuracy, the matrix factorization recommendation method has attracted more and more attention. The method expresses the user's rating information on the item in the form of a matrix, mines the lowdimensional latent space through the factorization operation of the matrix, and re-representing users and items in the low-dimensional space, and then expresses the correlation between users and items by the inner product of the latent feature vectors of users and items. In order to solve the sparsity problem of scoring data, Mnih proposed a probability matrixfactorization method [9]. Since deep learning has a powerful ability to learn the essential characteristics of data sets from VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ samples, more and more researches focused on combining traditional matrix factorization with deep learning models. Additional Stacked Denoising Autoencoder (aSDAE) [10] is good at extracting effective latent features from auxiliary side information and obtaining the implicit relationship between users and items. It extends the Stacked Denoising Autoencoder [11], takes additional auxiliary side information as input and integrates it closely with matrix factorization. Kim used Convolutional Neural Network (CNN) to capture the contextual information of item description documents to improve the accuracy of score prediction [12]. Considering that most current methods assume that there is a multi-linear interaction between latent factors, and small random disturbances of linear model parameters will lead to large backward errors. He [13] added adversarial perturbations to the each embedding vector of user and item in the matrix factorization can improve the robustness of the model. In recent years, neural networks based on the attention mechanism have been widely used in natural language processing. Zhou [14] proposed to use attention-based bidirectional long and shortterm memory networks to capture key semantic information.
Zhang [15] introduced the attention mechanism in the normalized matrix decomposition to analyze the user's different attention to item attributes to obtain more accurate user preferences. Wang [16] proposed a knowledge graph attention network to mine higher-order relationships (connecting two items with one or more link attributes). Liu [17] proposed a probabilistic model that combined a stacked denoising autoencoder and a convolutional neural network, and showed good results. However, since the input document of the AutoEncoder contains many noise data without keywords, it is impossible to automatically distinguish keywords and capture sequence information, and also it is extremely vulnerable to be interfered by model parameters during model training. In response to these problems, based on Liu [17], a recommendation model (Adversarial GRU-Attention Matrix Factorization, AGAMF) is proposed by combining adversarial learning and GRU-Attention mechanism. Through the gated recurrent unit based on the attention mechanism and the convolutional neural network, adversarial perturbations are enforced on embedding factors of users and items to quantify the loss of the model under parameter perturbations. The latent vectors of users and items are integrated into the probability matrix factorization to predict user ratings. This work solves the problem of data sparsity and enhances the robustness of the model by optimizing feature vectors. The main contributions of our method are as follows: 1. Use GRU based on attention mechanism to enhance user feature extraction ability, obtain contextual semantic relationship of documents and highlight keyword information. 2. Adversarial perturbations are enforced on embedding factors of users and items to quantify the loss of the model under parameter disturbances. Stable the model fitting process and enhance the robustness of the model. 3. Integrating GRU-Attention and CNN into the PMF framework, and applying regularization parameters of users and items to balance the rating information and auxiliary side information, effectively alleviating the problem of data sparsity. Figure 1 shows the overview of the probabilistic model for AGAMF, which integrates GRU-Attention and CNN into PMF, and the perturbations are enforced on each embedding vector of user and item.

II. ADVERSARIAL GRU-ATTENTION MATRIX FACTORIZATION MODEL
In which R' represents the observed rating matrix, R represents the predicted rating matrix, X represents user auxiliary side information, such as user ID, gender, age, and occupation. Y represents item auxiliary side information, such as movie type and movie description. W and W + represent the weight of CNN and GRU-Attention. u and v respectively represent the adversarial perturbation enforced on the embedding vectors of users and items. K is dimension of the latent vector and σ 2 is the variance of the Gaussian normal distribution.
From a probabilistic point of view, the conditional distribution over predicted ratings can be given by: In which N ( x| µ, σ 2 ) is the probability density function of the Gaussian normal distribution with mean µ and variance σ 2 .

A. MATRIX FACTORIZATION
Generally, MF model can learn latent factors of users and items in the user-item matrix, which are further used to predict new ratings between users and items. For clarity, we include the most common formulation of MF as follow: 208150 VOLUME 8, 2020 λ U and λ V are regularization parameters that are usually set to alleviate model overfitting, in which I ij is an indicator function that is equal to 1 if R ij > 0, otherwise 0. In addition, ||U|| F and ||V|| F denote the Frobenius norm of the matrix.

B. GRU-ATTENTION
GRU has strong memory capabilities in time series, and it can learn the dependence of longer sequences of context without being limited to local features. Compared with LSTM, the network structure is simplified, the model parameters are less and the training rate is increased. The GRU consists of a reset gate, an update gate and a memory unit. We use GRU-Attention to obtain user latent vector U from the user's auxiliary side information, as shown in Figure 2.
Input Layer: The user's auxiliary side information is pretrained using the Skip-Gram model in Word2Vec, then use Lookup to convert the words in the document into the corresponding pre-trained word vector {w 1 , w 2 , . . . , w n } and use it as the input of the next layer.
GRU Layer: The sequence of the information input at stime is w 1 , w 2 , , w s . The hidden layer output state h s is obtained by updating h s−1 of GRU at s − 1time. At different times, the hidden layer w 1 , w 2 , , w s corresponding to the GRU for each word output vector is h 1 , h 2 , . . . , h s ∈ R n_hid , n_hid is the number of neurons in the hidden layer of the GRU. h s is input as a sentence feature vector to the next layer of the network. The feature extraction of text information is expressed as: Attention Layer: The attention mechanism of the relation classification task is used to capture the key semantic information in the sentence, and the word-level features at each moment are combined into a sentence feature vector, which can be expressed as: ×T , d w is the dimension of the word vector, w is a trained parameter vector, and w T is a transpose. The representation τ of the sentence is formed by a weighted sum of context vector and word feature vector.
The GRU network structure based on the attention mechanism accepts the user's original document as input and outputs the latent vector of each user, which is defined as follows: ε i is Gaussian noise, which is used to further optimize the user's latent vector.
For each weight parameter W + k in W + , the conditional distributions of W + and user latent vector U are: The objective of our CNN architecture is to obtain documents' latent vectors from documents of items, which are used to compose the items' latent factors with epsilon variables. Figure 3 reveals our CNN architecture that contains five layers: 1) input layer, 2) embedding layer, 3) convolution layer, 4) pooling layer, 5) output layer. Input Layer: Input information of the movie type and movie description.
Embedding Layer: Convert the original document into a number matrix according to the word length, the document matrix D ∈ R p×l is as follows: In which, l represents the length of the document, p represents the embedding dimension of each word w i .
Convolutional Layer: extract features of project text information. The contextual feature c j i ∈ R is extracted by jth shared weight W j c ∈ R p×ws , whose window size ws determines the number of surrounding words.
A shared weight can only capture one type of context feature vector. Therefore, multiple shared weights are used to capture multiple types of context feature vectors to generate n c context feature vectors with W c (e.g., W j c where j = 1, 2, · · · n c ).
Pooling Layer: Extract representative features from the convolutional layer, and process variable-length documents by constructing a pooling operation of fixed-length feature vectors.
In which, c j is the context feature vector of l −ws+1 length extracted by the jth shared weight W j c . Output Layer: We project d f on the k-dimensional space of the project's latent factors, and generate the latent vector of the document by using conventional nonlinear projection.
The CNN network structure accepts the original document of the project as its input and outputs the latent vector of each project, as follows: ε j is Gaussian noise, which is used to further optimize the items' latent vector.
For each weight parameter W k in W , σ 2 w is weight parameter variance, σ 2 V is item latent vector variance, the conditional distributions of W and item latent vector V are:

D. ADVERSARIAL LEARNING
The concept of robustness usually refers to the degree that an algorithm can resist the profile injection attack. However, few works have focused on the robustness of recommender system, which may fail to capture fine-grained and stable results due to noise data. Small random perturbations on the parameters of linear models can lead to large backward errors. Here we propose the AGAMF model, and this work is inspired by the recent developments of adversarial machine learning techniques [18]- [20]. Generally speaking, it was found that normal supervised training process makes a classier vulnerable to adversarial examples [21], which revealed the potential issue of an unstable model in generalization. Then researchers proposed adversarial training methods which augment the training process by dynamically generating adversarial examples to address the issue.
Building upon adversarial learning techniques [22]- [25], our approach injects adversarial perturbations to the model parameters based on neural networks in matrix factorization recommendations. Intuitively, the adversarial perturbations tend to attack model parameters, while the model parameters aim to defense against those perturbations for selfimprovement. We formulate a unified objective function to take both adversarial perturbations and model parameters into account. As such, our method reaps the benefits of neural networks and matrix factorization, while enhancing the robustness of a recommender model, and thus improves its eventual performance.

III. AGAMF MODEL PARAMETERS OPTIMIZATION
This paper refers to the parameter optimization method of [17], and uses the maximum posterior estimation to optimize the parameters. The posterior probability of the parameters is: The negative logarithm of (19) can be redefined as follows: 208152 VOLUME 8, 2020 The coordinate ascending method is adopted to iteratively optimize the latent variables while fixing other variables, W . U and V are updated through the optimization function L until convergence, which is expressed as: where λ U and λ V are regularization parameters that are usually set to alleviate model overfitting. I i , I j is a diagonal matrix of I ij (i = 1, 2, , N , j = 1, 2, , M ), when user i has a rating on item j, I ij = 1, otherwise it is 0. W + , W are related to GRU-Attention and CNN models, and cannot be optimized like U and V . When U and V are temporarily constant, we observe that L can be interpreted as a squared error function with L2 regularized terms as follows: The overall optimization process (U , V , W + and W are alternatively updated) is repeated until convergence. With optimized U , V , W + and W , finally we can predict ratings of users on items: The optimization process is as follows:

A. EXPERIMENTAL SETTINGS
In order to verify the recommended performance of the AGAMF model proposed in this paper. We use keras as the deep learning framework and employ tensorflow as the background. Adam is selected as our optimizer to learn model parameters. The comparative experiment is under the Perform backpropagation and update W + through (24) End for Step7: until convergence Step8: until satisfying early stopping using a validation set environment of Windows 10 x64-based processors, Pycharm 2018, Inter(R) Core (TM) i7-8700k CPU @ 3.70GHz, 16 GB memory, and python 3.7.

1) DATASETS
The Movielens-1M and Movielens-10M public datasets are widely used in movie scoring prediction. Each user in the dataset has at least 20 rating data, which rating a movie using a 1 (worst) to 5 (best) scale. MovieLens-1M contains more than 1 million rating data on 3706 items from 6040 users. Movielens-10M contains more than 9 million scoring data on 10,073 items from 69,878 users. User auxiliary side information includes attributes such as ID, gender, age, and occupation, and item auxiliary side information includes information such as movie type and movie description. In this experiment, the entire dataset is divided into training data, validation data and test data at the ratio of 80%, 10%, 10%.

2) EVALUATION METRICS
In order to evaluate the performance of the comparison algorithm in rating prediction, Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Recall are used as how accurately the recommender system is in predicting rating VOLUME 8, 2020 values, which are defined as: T represents the total number of ratings, R ij represents the observed rating, andR ij represents the predicted rating.
u is the user set, R(u) is the list of items recommended to the user, T (u) is the list of items actually watched by the user, and K is the top K items recommended to the users.

3) BASELINES
To verify the performance of our model, we compare with the following methods: a. Probabilistic Matrix Factorization (PMF) [9]: it is a probabilistic method for matrix factorization, which assigns a D-dimensional latent feature vector (following Gaussian distributions) for each user and item. The ratings are derived from the inner-product of corresponding latent features.
b. Additional Stacked Denoising Autoencoder (aSDAE) [10]: it fuses aSDAE with the matrix factorization (MF) model to construct a hybrid collaborative filtering model, which can extract effective potential features from auxiliary information at the same time, and obtain the implicit relationship between users and items.
c. Convolutional Matrix Factorization (ConvMF) [12]: it captures subtle contextual differences of a word in a document and further enhances the rating prediction accuracy when the rating data is extremely sparse. d. A Probabilistic Model of Hybrid Deep Collaborative Filtering (PHD) [17]: it proposes a probabilistic model that combines a stacked denoising autoencoder and a convolutional neural network together with auxiliary side information (e.g., both from users and items) to extract users and items' latent factors.
e. GRU-Attention Matrix Factorization (GAMF): we integrate GRU-Attention and CNN into PMF with users and items' auxiliary side information.
f. Adversarial GRU Matrix Factorization (AGAMF-N): we integrate GRU and CNN into PMF with users and items' auxiliary side information, and the perturbations are enforced on each embedding vector of user and item.

4) PARAMETER SETTINGS
Several comparative experimental method parameters are shown in Table 1: Considering that the parameters λ U and λ V will affect the performance of the AGAMF model, where λ U and λ V are balancing parameters [26], the experimental results on the two datasets are shown in Table 2.   Table 2 shows the impacts of λ U and λ V on two datasets. We observe that on a dataset with sparse rating data, better results can be obtained by decreasing λ U and increasing λ V . Setting proper λ U and λ V values can map the auxiliary side information of users and items to the appropriate potential space, better balance the auxiliary side information of users and items, and improve the rating prediction accuracy of the AGAMF model.

B. EXPERIMENTAL ANALYSIS 1) RMSE COMPARISON
Discuss the performance of different methods in the same environment. Table 3 shows the rating prediction performance on two different sparsity datasets.
From Table 3 we can observe that AGAMF model achieve better RMSE performance than other methods on the two datasets. On ML-1M, compared with PMF, aSDAE, ConvMF, and PHD models, the RMSE value of the AGAMF model increased by 5.23%, 3.63%, 2.92%, and 2.16% respectively. It shows that the use of GRU based on the attention mechanism and adversarial learning in the framework of matrix factorization have effectively improved the performance of the model. In addition, on ML-10M, the RMSE of the AGAMF model compared with PMF, aSDAE, ConvMF, and PHD models has increased by 9.53%, 4.69%, 4.31%, and 3.44% respectively, indicating the effectiveness of combining users and item's auxiliary side information. It also shows that the AGAMF model has a strong ability to extract auxiliary side information. The performance of AGAMF, AGAMF-N and GAMF is better than PHD model, the first three models all use the GRU network to extract the deep features of the contextual information, and emphasize the long-term dependence between words in the document. Since we consider improving a recommender model by making it resistant to adversarial perturbations on its parameters. We can get a more robust and stable predictive function, and in turn improving its generalization performance.
Both AGAMF and AGAMF-N models have better performance than GAMF model. The former two models are enforced on adversarial perturbations to the potential vectors of users and items, it is crucial to increase a model's robustness by learning with adversarial perturbations, which in turn can increase its generalization performance. We believe that this insight is particularly useful for the recommendation. On ML-1M, the performance of AGAMF and AGAMF-N has improved compared with GAMF model. When on ML-10M, the RMSE of AGAMF and AGAMF-N is increased by 1.46% and 0.8% compared with the GAMF model, it shows that adversarial learning can effectively reduce the interference of model training by model parameters, thereby improving model performance.
The performance of the AGAMF is better than the AGAMF-N model. The former model uses the attention mechanism to express the characteristic information of important words, assigns corresponding weights to each word, and highlights the key information in the context. Both on ML-1M and ML-10M, the RMSE of the AGAMF model is 0.67% higher than that of the AGAMF-N model on average, which verifies the effectiveness of the attention mechanism.

2) RECALL COMPARISON
Discuss the recall of top-K value on different methods. Experiments are performed on two datasets with different sparsity, as shown in Figure 4.
From Figure 4 we can observe that several algorithms are on the rise with the increase of K value on the two datasets. Among them, the traditional method PMF has the lowest performance because PMF ignores the auxiliary side information of users and items, which makes the recommendation result poor. The performance of the PHD model is better than that of aSDAE and ConvMF, which shows that the combination of traditional matrix factorization and deep learning models can learn effective latent factors and better extract auxiliary side information. The performance of AGAMF, AGAMF-N and AGMF models is significantly better than the PHD model, which shows that GRU can establish long-term dependence between words can make up for the shortcomings of aSDAE to extract context information, better representation and modeling of the context, thereby improving recommended performance. Models other than PMF perform better In ML-1M, indicating that neural network-based models are more suitable for sparse relational data. The AGAMF model is superior to the AGAMF-N model because the attention mechanism can adaptively combine context information to achieve different levels of attention to context information, effectively improve the accuracy of model classification, and thereby improve model performance. The AGAMF model is better than the GAMF model because the introduction of adversarial learning makes the model fitting process stable and the model robustness is enhanced. The AGAMF model still has robust and good performance when VOLUME 8, 2020 compared with traditional approaches and other deep learning models.

3) THE IMPACT OF ITERATIONS ON PERFORMANCE
Discuss different methods' RMSE on two datasets with different sparsity under different iteration. As shown in Figure 5.
From Figure 5 it can be seen that several models' RMSE gradually decreases as the number of iterations increases, and eventually stabilizes. However, too many iterations will result in lower model performance, because too many iterations will cause the model to overfit, resulting in poor performance. The AGAMF model is better than the PHD model, indicating that GRU and attention layer can quickly extract the deep features of the context, highlight the key information of the context, and converge faster than the coding layer of the additional stacked denoising autoencoder. The adversarial learning makes the model fitting process stable, so that the recommendation performance is better. The AGAMF, AGAMF-N, and GAMF models converge faster, and the RMSE is better than the other 4 models during initial training. High performance can be achieved with a small number of iterations, which can achieve a high number of cost-effective iterations performance, that is, the fewer iterations, the more effective the training process.

4) MAE COMPARISON
Discuss the performance of different methods in the same environment. Experiment on two datasets with different sparsity, as shown in Figure 6. Figure 6 shows that AGAMF model's MAE is better than other models on the ML-1M and ML-10M dataset. On the sparse ML-1M dataset, PMF's performance is better than aSDAE and ConvMF, indicating that when the data matrix is too sparse, aSDAE and ConvMF can't effectively extract the latent factors. Even the data is too sparse, compared with traditional methods and other deep learning models, the AGAMF model still has good performance. This shows that deep learning structure can create better quality of auxiliary information features, especially when GRU-Attention and CNN are combined to extract latent factors that is more effective. In addition, after adversarial perturbations are added, the larger backward error caused by the interference of the linear model parameters is reduced, so that the model is more robust.

V. CONCLUSION
Due to the data sparsity problem in traditional recommendation systems, the recommendation model is extremely vulnerable to the interference of its parameters, and the additional stacked denoising autoencoder lacks the ability to extract deep features and key information of the context. This paper adopts a matrix factorization recommendation model combining adversarial learning and GRU-Attention to improve recommendation performance. In addition, compared with several methods that combine deep learning, the experimental results show that the AGAMF model shows good results on two datasets. This shows that the AGAMF model proposed in this paper improves the recommendation performance by modeling auxiliary side information, fully learning the contextual semantic relationship, and adding adversarial perturbations to stabilize the fitting process.
Although the accuracy of the AGAMF model has been improved to some extent, due to the sparsity rating information and auxiliary side information of users and items. The model framework combined with deep learning is complex, the training time is long, and the experimental results are not greatly improved. Shainoor J etc. [27] gives a good research idea, and follow their framework we will consider how to deal with sparsity data more effectively and build a simplified and reasonable recommendation framework in the future.