Self-Attention Based Sequential Recommendation With Graph Convolutional Networks

Learning embeddings representations of users and items lies at the core of modern recommender systems. Existing methods based on Graph Convolutional Network (GCN) and sequential recommendation typically obtain a user’s or an item’s embedding by mapping from pre-existing features into better embeddings for users and items, such as ID and attributes. GCN integrates the user-item interaction as the bipartite graph structure into the embedding process, which can better represent sparse data, but cannot capture users’ long-term interests. Sequential recommendation seek to capture the “context” of users’ activities based on their historical actions, but requires dense data to support it. The goal of our work is to combine the advantages of GCN and sequential recommendation models by proposing a novel Self-Attention based Sequential recommendation with Graph Convolutional Networks (SASGCN). It uses multiple lightweight GCN layers to capture high-order connectivity between users and items, and by introducing ratings as auxiliary information into the user-item interaction matrix, it provides richer information. By incorporating self-attention based methods, the proposed model capture long-term semantics through relatively few actions. Extensive experiments on three benchmark datasets show that our model outperforms various state-of-the-art models consistently.


I. INTRODUCTION
Recommender system is an important technology for alleviating information overload on the web, and have been widely deployed in fields such as e-commerce, social networks, music and movie services.
Collaborative filtering (CF) is a common recommendation method that identifies users' interests by analyzing their historical behavior [1].To improve the accuracy and effectiveness of CF, recent research has explored the integration of GCN.NGCF is one such method, which introduces multiple GCN layers to capture high-order connectivity between users and items [2].Another recent method, LightGCN further The associate editor coordinating the review of this manuscript and approving it for publication was Xiong Luo .
simplifies the model by removing the feature transformation and nonlinear activation parts inherited from GCN in NGCF, achieving model lightweight while still maintaining high accuracy and effectiveness [3].
Sequential recommender systems aim to combine personalized recommendations for users based on their historical activities and the ''context'' of user behavior, the main challenge is how to capture the high-order dynamics of user behavior succinctly.Recurrent Neural Network (RNN) is a common method for capturing dynamic user information, but RNN sequence models have not been able to attain state-of-the-art results in small-data regimes [4].Recently, self-attention based Transformer models have achieved stateof-the-art performance and efficiency for Natural Language Processing (NLP) tasks [5], the Transformer model has also begun to be applied in sequential recommendation [41], [44].SASRec replaces the RNN component in traditional sequence recommendation with Transformer, resulting in significantly improved computational efficiency [6].
Inspired by the success of these models, we seek to combine self-attention and GCN for sequential recommendation.we propose a novel Self-Attention based Sequential recommendation with Graph Convolutional Networks, called SASGCN.SASGCN simultaneously achieves the efficiency of parallel computation and the high-order connectivity between users and items.We also noticed that the above models only consider the existence of user-item interactions when representing user and item embeddings, which cannot accurately express the degree of user preference for items.To improve the accuracy of our model, we introduce ratings as auxiliary information when constructing the user-item interaction graph, allowing us to assign weights to different user-item interactions.Importantly, introducing auxiliary information when constructing the interaction graph does not increase model complexity.Specifically, we associate users with items to generate the user-item interaction graph, and introduce ratings as auxiliary information to obtain weights for user-item interactions.We then combine the embeddings learned at different propagation layers with a weighted sum to obtain the graph embedding.Meanwhile, the self-attention part adaptively assigns weights to previous items at each time step.Finally, we map the obtained weights to the graph embedding to obtain the final embedding [3], [6].
To summarize, our work makes the following main contributions: 1) We introduce ratings as auxiliary information into the user-item interaction graph, allowing us to obtain weight information for different users on different items.
2) We propose SASGCN, which combines self-attention and GCN to improve the accuracy and computing efficiency of the model without increasing its complexity.
3) We conduct extensive experimental studies on three real-world datasets.The results demonstrate that our model outperforms other state-of-the-art models in terms of both accuracy and efficiency.
The contents of this article are as follows.The first section is introduction, the second section introduces some related work, the third Section gives the detailed description of SASGCN model, the fourth section presents experiment results and analyses on recommender datasets, and the fifth Section is conclusion and future outlook.

II. RELATED WORK
Several lines of work are closely related to ours.We discussed the existing work on GCN-based CF methods and sequential recommendation methods.
CF is a classic recommendation algorithm that predicts users' interest in recommended items by analyzing their historical behavior.It is one of the most common method in the field of recommendation systems.Matrix factorization (MF) maps the ID of each user and item to an embedding vector and predicts their interaction through the inner product between them [7].NeuMF predicts user preferences via Multi-Layer perceptron (MLP) [8].
Despite great success, from CF signals it is difficult to obtain embeddings that satisfy desired properties due to being implicitly captured.In recent years, recommendation algorithms based on graph embeddings have addressed this issue by constructing user-item interaction graphs to explicitly encode CF signals [9], [10].NGCF is a hybrid model that combines neural networks and graph embeddings, which stacks multiple GCN layers to capture high-order connectivity between user and item nodes, improving recommendation performance.LightGCN is a lightweight improvement over NGCF that removes the feature transformation and nonlinear activation parts inherited from GCN.Since each node in the user-item interaction graph of NGCF only contains ID information, the lightweight operation of LightGCN improves the effectiveness and accuracy of the model [2], [3].
Sequence recommendation refers to predicting the next item of interest for users based on their historical behavior sequence, which is an important research direction in recommendation systems.The main challenge lies in how to handle the long and short-term interest evolution in the user behavior sequence and use it to predict their future behavior.Markov chain (MC) assumes the user's next action can be predicted based on their previous actions [11].Factorizing Personalized Markov Chain (FPMC) combines MF and item-item transition to capture long-term preferences and short-term transitions respectively [12].High-order MC can consider the long-term preferences of more users compared to first-order MC [13], [14].Caser treats the embedding matrix of L previous items as an ''image'' and applies convolutional operations to extract transitions [15].Other than MC-based methods, RNN is also commonly used to model user sequences.GRU4Rec uses Gated Recurrent Units (GRU) to model click sequences for session-based recommendation [4].
Transformer is a purely attention-based sequence-tosequence method that has achieved state-of-the-art performance and efficiency on machine translation tasks which had previously been dominated by RNN-based approaches [5], and is gradually being applied to recommendation systems [17], [18].SASRec was the pioneering work adopting Transformer for sequence recommendation.It employs selfattention modules to learn the weights of items at different positions in the sequence [6].STOSA embeds each item as a stochastic Gaussian distribution and introduce Wasserstein distances as self-attention weights to measure the pair-wise relationships between items in the sequence [16], [19].BERT4Rec employs the deep bidirectional self-attention to model user behavior sequences, predicting the random masked items in the sequence by jointly conditioning on their left and right context [22].
Inspired by LightGCN and SASRec, we seek to build a new Self-Attention based Sequential recommendation model with Graph Convolutional Networks, and introduce ratings as auxiliary information to alleviate the problem of insufficient node information in the user-item interaction graph.

III. METHOD
In this section, we introduce the proposed SASGCN model, which is illustrated in Figure 1.The SASGCN model consists of four main components: (1) an embedding layer that initializes user and item embeddings, (2) multiple embedding propagation layers that refine the embeddings by incorporating high-order connectivity relations, (3) a selfattention layer with position encoding; and (4) a prediction layer that computes user-item relevance scores.

A. EMBEDDING LAYER
We give the sequence of item IDs interacted by user u, denoted as To capture the temporal dynamics of user behavior, we consider that recent behaviors better represent the user's current preferences, while distant actions have less impact.Therefore, we define a maximum length n and transform the sequence (S u 1 , S u 2 , . . ., S u |S u |−1 ) into a training sequence s = (s 1 , s 2 , . . ., s n ), where s n consists of the most recent n actions of the user.If the sequence length is less than n, we pad the left side of the sequence with zero vectors until the length is n [6].
We set the user embedding matrix as E u ∈ R M ×d and the item embedding matrix as where d is the latent dimension, and M and N denote the number of users and items, respectively.We generate the user-item interaction matrix R ∈ R M ×N from the user-item interaction graph.By introducing the rating of users for items, and each entry R ui is rating if u has interacted with item i otherwise 0.

B. EMBEDDING PROPAGATION LAYER
We construct the relationship between users and items as a user-item interaction graph, aggregating the features of neighbors as the new representation of a target node through the GCN [20].Considering that nodes only contain ID and rating information, we aim to minimize unnecessary computations and model burden by adopting the simple weighted sum aggregator instead of the nonlinear activation and feature transformation in traditional GCN.Thus, we adopt a lightweight graph convolution operation [2], [21], defined as follows: where is the symmetrically normalized term of the graph adjacency matrix, N u and N i denote the first-hop neighbors of user u and item i [20].
To further enhance the representation capability of our model, we stack K layers of GCN and perform a weighted sum operation on the embeddings obtained at each layer.The weighted sum operation is defined as follows: where α k ≥ 0 denotes the importance of the k-th layer embedding.To avoid unnecessary complexity, we set α k uniformly as 1/(K + 1).

C. SELF-ATTENTION LAYER
The self-attention mechanism has been widely applied in various machine learning fields since its inception.When processing sequences, self-attention can better handle fixedlength inputs than attention mechanisms through position encoding [5].Given a maximum length n, we use a position embedding table P ∈ R n×d , where element p i denotes the position embedding for the i-th position in a sequence.The final input embedding of sequence s is: where e n is the item embedding e i .The self-attention layer mainly consists of multi-head attention, feed-forward network, residual networks and layer normalization [5].
Multi-head attention splits the input embeddings into h different embeddings and performs attention calculations on each of them separately.The multi-head attention on the input embeddings is [22]: where W, W Q , W K , W V are the weight matrices, Attention is the scaled dot-product attention, Q represents the queries, K the keys, V the values, and √ d is a regularization term that can prevent gradient vanishing problems and improve the generalization ability of the model.Multi-head attention enables the model to pay attention to information in different subspaces, thus capturing richer feature information.
The feed-forward network consists of two fully connected layers with a ReLU activation in between.This design choice enables the model to capture both linear and nonlinear relationships among items in a sequence.By incorporating nonlinear activation functions, the feed-forward network can better model these nonlinear relationships, thereby improving the model's expressive power and overall performance.The feed-forward network is defined as: (1)  + b (1) )W (2)  + b (2) , where W (1) and W (2) are d × d matrices and b (1) and b (2) are d-dimensional vectors.
Our model employs residual networks and layer normalization to enable training of deeper models while also improving stability and convergence.The core concept of residual networks is to propagate low-layer features to higher layers using residual connection, which has been shown to improve training of deeper models [23], [24], [42].Layer normalization is used to normalize inputs across features, which further stabilizes and accelerates neural network training [25].Batch normalization is not used because the mean and variance of a batch of samples may not adequately represent the mean and variance of the entire population of samples when the batch size is too small [26].Specifically, assuming the input is a vector x containing all features of a sample, layer normalization is defined as: where ⊙ is an element-wise product, µ and σ are the mean and variance of x, α and β are learned scaling factors and bias terms.

D. PREDICTION LAYER
After obtaining the outputs of the embedding propagation layer and the self-attention layer, for given the first t items, we predict the next item based on F t .To reduce model size and prevent overfitting, we use a single item embedding e i .Specifically, we employ an MF layer to predict the relevance of item i: where r i,t is the relevance of item i being the next item given the first t items (i.e., s1, s2, . . ., st ).By ranking the relevance scores r i,t , we can generate top-N recommendations for the user.

E. MODEL TRAINING
To prevent overfitting during training, we employ dropout by randomly dropping out sequence messages passing into the self-attention layer with probability p.Note that dropout is only used during training and must be disabled during testing.
Considering that we truncate or pad the user sequence's last n elements when constructing a fixed-length sequence s = (s 1 , s 2 , . . ., s n ), we define o t as the expected output at time step t: (10) where < pad > indicates a padding item, and we ignore the terms where o t =< pad > since we use zero for padding.When the input sequence is s, the corresponding sequence o serves as the expected output, and we adopt the binary cross entropy loss: We employ the Adam optimizer in a mini-batch manner [27], which is a variant of Stochastic Gradient Descent (SGD) with adaptive moment estimation.In each epoch, we randomly generate one negative item j for each time step in each sequence [8], [35].

F. COMPLEXITY ANALYSIS
Space Complexity: Our model's learned parameters come from the embeddings and parameters in the multi-head attention, feed-forward network, and layer normalization.The total number of parameters is given by O(|I + U|d + nd + d 2 ), where I and U denote the sets of item and user, respectively.
Time Complexity: The computational complexity of our model primarily arises from the embedding matrix, multihead attention and feed-forward network.The embedding matrix's computation complexity is O(nd), while the multihead attention and feed-forward network have complexities of O(n 2 d) and O(nd 2 ), respectively.Therefore, the time complexity of our model is O(nd + n 2 d + nd 2 ).

IV. EXPERIMENTS
We perform experiments on three real-world datasets to assess the performance of our proposed method.Our aim to answer the following research questions: RQ1: Does SASGCN outperform state-of-the-art models, including sequence recommendation and GCN-based models?
RQ2: How do different hyper-parameter settings, (e.g., dimensions and number of heads) affect SASGCN?RQ3: What is the impact of various components in the SASGCN architecture?

A. EXPERIMENTAL SETTINGS 1) DATASETS
To evaluate the effectiveness of SASGCN, we conduct experiments on three benchmark datasets: Amazon-games, Book-crossing, and Movielens, which are publicly accessible and vary in terms of domain, size, and sparsity.
Amazon-games: Amazon-review is a widely used dataset for product recommendation [29], and we select Amazongames from the collection.This dataset has high sparsity characteristics.
Book-crossing: This dataset was collected from the Book-Crossing community using a web crawler and contains 278,858 users' 1,149,780 ratings on about 271,379 books [30].
Movielens: A widely used benchmark dataset for evaluating collaborative filtering algorithms.We use the version (Movielens-ml-latest-small) that contains approximately 100,836 ratings from 610 users on 9,724 movies [31], [32].
To ensure data quality, we retained users and items with at least ten interactions [33], [39].We divided each user's historical sequence S u into three parts: (1) 1 shows the data statistics.We see that Movielens is the densest dataset, while Amazon-games and Book-crossing are sparse datasets.

2) EVALUATION METRICS
To evaluate the effectiveness of top-N recommendation, we adopt two widely-used evaluation metrics [8], [34], [43]: Recall@N and Normalized Discounted Cumulative Gain (NDCG@N ).Recall@N measures the fraction of relevant items being retrieved at top-N recommendations out of all relevant items, while NDCG@N evaluates the top-N ranking performance.By default, we set N = 10.

3) BASELINES
To demonstrate the effectiveness, we compare our proposed SASGCN with several state-of-the-art recommendation methods: 1) NGCF [2]: A recommendation model based on GCN that learns the embedding vectors of users and items.By stacking multiple GCNs to capture the collaborative signal in high-order connectivities, it finally predict the degree of interest of users to items.
2) LightGCN [3]: A method that removes the feature transformation and nonlinear activation functions in NGCF, reducing unnecessary computations and improving the efficiency and accuracy of recommendation.This also enabling the model to have better generalization capability.
3) Caser [15]: A CNN-based method that captures higherorder Markov Chains by applying convolutional operations on the embedding matrix of the K most recent behavior, and achieves sequential recommendation performance.4) SASRec [6]: A method that uses self-attention to capture ''context'' in the sequence instead of traditional sequence recommendation models that use MC and RNN.Due to the self-attention block being suitable for parallel acceleration, SASRec efficiently achieves state-of-the-art recommendation performance.
5) STOSA [16]: A stochastic self-attention sequential model for modeling dynamic uncertainty and capturing collaborative transitivity.By introduce a novel regularization to BPR loss, guaranteeing a large distance between the positive item and negative sampled items.6) BERT4Rec [22]: A method model user behavior sequences with a bidirectional self-attention network and introduce the Cloze task which predicts the masked items using both left and right context.

4) PARAMETER SETTINGS
For fair comparison, we implement all models using Pytorch with the Adam optimizer [27], the learning rate is set to 0.001, and the batch size is set to 32.The dropout rate is set to 0.5.The maximum sequence length n is set to 200 for the Movielens dataset and 50 for the other two datasets.We consider latent dimension d from 32784 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.NGCF and Caser perform relatively poorly across all datasets, while LightGCN improves recommendation performance by introducing lightweight modifications to GCN.SASRec, STOSA, and BERT4Rec perform distinctly better than Caser, suggesting that self-attention mechanism is a more powerful tool for sequential recommendation.STOSA employs stochastic self-attention, and BERT4Rec adopts bidirectional self-attention, both better than SASRec which utilizes the simple self-attention.
SASGCN not only combines the advantages of LightGCN and SASRec, introducing lightweight graph neural networks and self-attention mechanisms into sequential recommendations but also takes into account user rating information on items.According to the result, it is obvious that SASGCN performs best among all methods on three datasets in the terms of all evaluation metrics.It achieve an average improvement of 5.1% in Recall@10 and 3.2% in NDCG@10 compared to the strongest baselines.We conduct t-tests and p-value < 0.02 indicates that the improvements of SASGCN over the strongest baseline are statistically significant.
While SASGCN demonstrates lower performance on Games and Books datasets than on Movielens dataset.This is due to different levels of sparsity across the datasets, with Books and Games datasets being more sparse, while the Movielens dataset is denser.These results further demonstrate that integrating GCN and auxiliary information into sequence recommendation can effectively mitigate the issue of data sparsity.

C. PARAMETER IMPACT (RQ2)
In Figure 2, we analyzed the effect of the latent dimension d, by showing NDCG@10 of SASGCN with d from {32, 50, 64, 128}.We see that our model typically benefits from larger numbers of latent dimensions.For all datasets, our model achieved satisfactory performance with d = 128.
In Figure 3, we compared the impact of different ''heads'' on the model performance with d = 128.We see that only the performance on the Movielens dataset was improved when using two heads, while the performance on Games and Books datasets decreased.This might multi-head divides the embedding in h subspaces (each a d/h-dimensional space) to capture more diverse information features from different subspaces.Due to the different sparsity levels of the datasets, the information richness contained in the embedding varies.For sparse datasets, only some subspaces contain effective information, while the rest only contain ineffective information, at this time, multi-head cannot capture more diverse feature information but reduces the efficiency and accuracy of information capture.

D. ABLATION STUDY(RQ3)
To compare the impact of each component in our model on performance, we conducted an ablation study to analyze each component separately.Table 3 presents the performance of our default method and its three variants on all three datasets (with d = 50).We introduce these variants and analyze their effects: (1) SASGCN w/o auxiliary information (ratings): Without user ratings, only consider user-item interactions, similar to Caser and LightGCN, the weights for items interacted with by users are set to 1 and 0 for the rest.This variant performs the best among all variants, indicating that auxiliary information has the smallest impact on the model architecture, but still performs worse than SASGCN.
(2) SASGCN w/o nonlinear: Replace the ReLU activation function in the feed-forward network with the linear activation function Identity.This variant performs worse than (1), indicating that the non-linear activation function in the feedforward network helps the model better capture the non-linear relationships within the sequences and improve modeling accuracy.
(3) SASGCN w/o SA: Removing self-attention has a significant impact on the dense dataset Movielens.When dealing with richer information, computing different weights for each position in the sequence is beneficial for capturing long-term dependencies in the sequence.
(4) SASGCN w/o GCN: Without GCN, only selfattention is retained.Without graph embedding propagation, higher-order collaborative signals cannot be captured.The results show that GCN can capture richer embedding signals to achieve better recommendation performance, especially on sparse datasets.

V. CONCLUSION AND FUTURE WORK
In this work, we proposes a Self-Attention based Sequential recommendation with Graph Convolutional Networks, SASGCN.It incorporates ratings as auxiliary information into the user-item interaction graph, and utilizes highorder connectivities in multiple layers of GCN to learn item embeddings.To capture long-term dependencies in the sequence, we introduce position encoding and self-attention mechanism.SASGCN supports parallel computing, thus achieving high efficiency and accuracy.Extensive empirical results on two sparse datasets and one dense dataset show that our model outperforms state-of-the-art baselines.In the future, we plan to introduce richer auxiliary information (e.g.behavior type, social relationship, and item tags) that does not conflict to construct the user-item interaction graph or express the relationship between users and items in the form of a knowledge graph [36], to learn embeddings with richer information.Furthermore, we are interested in exploring the use of bidirection self-attention block for pre-training data to capture the dynamic changes in user interests more accurately [28], and even via reversely pre-training generated data to alleviate data sparsity [37], [38], [40].

FIGURE 1 .
FIGURE 1.The framework of the proposed SASGCN model.Obtaining item embedding vectors through the user-item interaction graph with ratings.After multiple layers of GCN propagation and combining with initial vectors, the representations are passed through a self-attention mechanism to capture contextual information before making the final prediction.

TABLE 2 .
Model performance.The best performing method in each row is boldfaced, and the second best method in each row is underlined.

Table 2
presents the performance comparison results (with d = 50), and we have the following observations:

TABLE 3 .
Ablation analysis on three datasets.