Introduction
In our daily lives, services from big tech companies, such as YouTube and Amazon, have become indispensable. It is difficult to determine the frequency of non-use of services, such as music and e-commerce, in daily life; for example, YouTube is also a mobile app used by many users. If you look closely at these services, you are recommending the next video you might like based on your individual purchase and search patterns or the list of items you searched. Accordingly, one of the commonalities of these big tech companies is that they actively use “recommender systems” for various services. For this reason, their scope has expanded, including entertainment (movies, music, IPTV, etc.), content (personalized newspaper recommendations, e-mail filtering, web page recommendations, etc.), e-commerce, and services (travel service recommendations, expert recommendations, real estate, etc.) [1]. As recommender systems integrated with everyday life have been applied in several ways, there have been numerous efforts to systematically study them. There are various views on recommender systems; however, the basic idea of the recommender system is to infer customers’ interests using various data sources, and present good directions for future choices to users based on past interactions between users and items [2].
When did recommender systems begin to be used in earnest? In 1979, a book-related recommender system was proposed; however, in the 1990s, recommender systems were commonly known. Early models of recommender systems, such as collaborative filtering, were proposed in the early 1990s. Accordingly, recommender systems began to be applied in earnest in the field of e-commerce, an industry closest to recommender systems. From 2003 to 2006, owing to the development of data and combination with big data, the scope of dealing with recommendations considerably increased. Around 2010, deep learning-based models began to be actively applied to the recommender system field, and big-tech companies such as Google and YouTube began to lead this development direction [3], [4]. When a recommender system was first developed, its role was elementary, therefore a recommender system could not catch up with users’ preferences and dynamics. Based on the users’ diverse and complex preferences and dynamics considered in a recommendation, concepts such as personalized recommendations have become a trend and prominent development direction in the recommender systems field.
Figure 1 shows the sequence-aware recommender system, which was originally proposed by Quadrana et al. and is one of the initial theoretical attempts, by the temporal flows which is one of the principal factors determining recommendations in a general recommendation situation, Input shown in figure 1 is defined as an interaction log; according to the passage of time (defined as a sequence in the ID, Action Type, and Purchased Item in this study), User ID, Action Type, and Purchased Item are captured [5]. User-item interactions, one of the important factors in general recommendation situation, are difficult to comprehend without clear consideration of temporal factors presented in figure 1. This leads to the question of why sequential recommendations are important in the recommendation field.
As the answer to this question, Wang et al. suggested a deep relationship between temporal dependence and interaction between users and items and explained that it is the best approach for understanding users through temporal factors [6]. Regarding to the temporal and sequential recommender models mentioned here, Kang and McAuley proposed a temporal recommender model that purely considered only temporal factors and a sequential recommender model that considers the user’s behavior for a more comprehensive recommendation [7]. Wu et al. presented collaborative filtering, content-based, and temporal/ sequential recommendation as neural network-based recommender model fields for the overall classification of neural network-based recommender systems [8]. Zhang et al. emphasized the deep learning-based recommender models’ sequential modeling capabilities and suggested solutions to important problems such as user behavior identification and item prediction [9].
Out of these various recommendation models, this paper will deal with deep learning-based models with a sequential recommendation approach. They can be classified as three kinds of models: 1) traditional sequence models such as sequence pattern mining and Markov chain, 2) latent representation models such as factorization machines and embedding, and 3) deep neural network models including from basic neural networks such as RNN, CNN, and GNN to advanced models such as attention, memory, and mixture models. Among these, this paper will focus on basic neural networks such as RNN, CNN, and GNN [6]. As one of the methods of user behavior modeling (UBM), these deep learning models have the advantage of being able to capture dependency patterns in simple behavioral sequences to better understand item dependencies and correlations. The captured dependencies are increasingly complex and practical as they evolve from simple one-way dependency RNNs to behavioral dependency skipping CNNs and multi-relational modeling GNNs, which can broaden the scope of user behavior modeling and significantly improve performance by uncovering implicit feedback.
At last, this paper mainly presents a comparison of deep learning-based sequential recommender models by categorizing and describing models based on their characteristics and is made up as follows. In the introduction section, the definition and a brief history of the recommender systems and a brief introduction of sequential recommender models will be discussed. In the background section, we define data and user-item interactions as key factors of the recommender system. This is related to the dynamics of deep learning, natural language processing, and sequential recommendation. Sections III and IV will examine the existing RNN and CNN-based systems, and present GAN, GNN, transformer, and self-supervised learning (SSL)-based models as novel systems for deep learning-based sequential recommender systems, respectively. Section V will deal with model training cost briefly, and Section VI present the conclusion and future directions of this study.
Background
A. Key Factors of a Recommender System: Data and User-Item Interaction
Data used in recommended situations include user, item, and user–item interaction data, of which user–item interaction data is particularly important because it is a tool that can identify user preferences and interests. The “users–items” interactions in the recommender systems began to be mentioned in earnest in collaborative filtering, which includes concepts such as latent factor models. According to Koren et al., the latent factor method is a framework for finding and recommending patterns because the user’s rating score for items will have a pattern. Moreover, the latent factor method is applied through matrix factorization to solve the problem of data scarcity [11]. Examples of implicit feedback used in matrix decomposition include the user purchase history and search patterns. The characteristics of matrix decomposition are that data about the user are classified with implicit feedback and recommendation based on the inter-relationship between the user and the item [12]. These attempts initially played a certain role in the user-item relationship but gradually faced limitations. In contrast, He et al. indicated that matrix factorization, a methodology of collaborative filtering, treats user-item inter-relationships as linear information, making it difficult to flexibly process the interrelationships from a comprehensive perspective [13]. This is an example of the effectiveness of deep learning in recommendation situations. Zhang et al. emphasized that nonlinear transformation, expression learning, sequence modeling, and flexibility are the strengths of a deep learning-based recommendation system [9].
Aggarwal noted that the interaction between users and items is important in a recommender system because the basic idea behind a recommender system is to infer customer individual’s interests using various data sources. Moreover, users’ future choices are often presented through specific conditions, such as past interests and preferences [2]. For example, Steck et al. introduced the deep learning application of Netflix emphasizing personalized services in the way in which the recommender service collects users’ personal data and reflects it on personalized profiles [14]. It can be observed from Netflix’s case that data and user-item interactions are factors that should be considered in a recommendation situation and require a deep understanding.
Users and items are basic components under the recommendation situation, and the data between the two components flow. User–item interaction data can be classified into explicit and implicit feedback. Explicit feedback refers to information in which the user directly evaluates the item, and implicit feedback means that the system can indirectly detect the user’s tendency through the user’s behavior [15]. In recommender systems, it is important to predict the next behavior through user preferences; therefore, there are many cases where the recommendation system is learned using implicit feedback. For example, Liu et al. emphasized that in general recommender systems, user preferences are learned from stored user behavior data, and items are sequentially recommended using implicit user feedback to improve recommendation quality [16].
B. Deep Learning-Based Models in Sequential Recommender Systems
In this section, we examine how sequential recommendation fields have evolved and how deep learning-based recommendation models have emerged in sequential recommendation fields. Among many recommender models, the rise of SRS is loud and clear. Taking a sequence of user-item interactions as the input and trying to predict the subsequent user-item interactions that may happen soon through modeling the complex sequential dependencies embedded in the sequence of user-item interactions are epitomes of the SRS. Because SRS suggests an interaction between ‘user’ and ‘item’ precisely, SRS is a recommender model which shows the true meaning of ‘recommendation’. From now, we will discuss about various aspects of SRS in two parts: theoretical and practical.
First, we will look at the general learning method of SRS from a theoretical point of view. The learning task of SRS is much more difficult because the sequence structure is more complex. The equation below is a formula showing the learning method of SRS, which is a triple structure consisting of the user, the user’s action, and the corresponding item. R is a list of items ordered by rank score, and S represents the order of user-item interactions. And when viewed in the form of a polynomial such as conditional probability or interaction score, F is a utility function that outputs the rank score of a candidate item. When a recommendation list is generated, it consists of the top-ranked candidate items, maximizing the utility function value, and is given a sequence of user-item interactions. SRS models the complex sequential dependencies embedded in a sequence of user-item interactions to predict subsequent user-item interactions that may occur soon and takes as input a sequence of user-item interactions [6].\begin{equation*} \text {R}=\text {arg max f(S)}\end{equation*}
To interpret sequential recommendation situation, many models have been proposed. Table 1 clearly distinguishes advantage and disadvantage between traditional sequential model and deep learning based sequential model. Traditional sequential recommender models like Markov chains (MC) and matrix factorization (MF) used to be a powerful tool to interpret sequential recommendation situation. Interpretability can be one of traditional sequential recommender models’ strong points. However, obstacles like data sparsity and scalability become limitations for traditional sequential recommender models such as MC and MF. Therefore, researchers get their attentions to new waves of sequential recommender systems (SRS): deep learning-based recommender models. Flexibility is one of deep learning-based recommender models’ distinctive strong points. Deep learning-based recommender model employs various forms of data, including users’ implicit and explicit feedbacks for model learning, it uses data augmentation for data processing, and explicit user representation will be added for model training. So, deep learning-based recommender model can be updated with these methods to interpret sequential recommendation situation.
Next, we’ll look at how a deep learning-based recommender model works natively through the data pipeline. Through this, it is possible to understand the distinct difference between the existing sequential recommender model and the deep learning-based recommender model. Figure 2 shows the data pipeline of a deep learning-based recommender model, which is divided into training and testing aspects. In a nutshell, deep learning-based recommender models accept various data with and without labels and can improve the learning efficiency of the model through data preprocessing processes such as data augmentation and its training such as incorporating with attention mechanisms, combining with traditional models, and adding explicit user representations [17].
Additionally, as the user-item interactions become more diverse and complex in sequential recommendation situations, considerations arise and these must be addressed, Wang et al. mentioned [6], and for this purpose, concepts such as modeling and predicting the user’s previous interaction history, similar to the concept of modeling sentences or word sequences (language models) in the field of natural language processing (NLP), are proposed. also became. The field of recommendation has a strong correlation with NLP in that it is based on human language and behavior. Moreover, as the concept of deep learning is recently applied to the field of natural language processing and recommendation, many models related to the field of recommendation derived from the field of NLP are being applied, and many studies on the academic convergence of NLP and recommendation are appearing. Moreira et al. revealed that the field of NLP has greatly developed with the development of deep learning [18], and Wu et al. mentioned that neural network-based approaches dealing with unstructured data from other fields, such as NLP, could be developed by applying them to the recommendation field [8]. In addition, identifying and predicting the user’s long-term and short-term preferences through sequential information between the user and the item, and finding sequential patterns in the user’s past behavior to recommend the next item is the essence of sequential recommendation, and Agarwal introduced dependency, a concept of NLP, and showed that future behavior can be predicted through correlation between individual items [2].
Survey of Traditional Systems
A. Recurrent Neural Network (RNN)
1) Concept, Architecture, Performance, and Application
RNN-based sequential recommender system can be characterized by modeling sequential dependencies for a given interaction to predict the next interaction through a series of past user-item interactions, as suggested by Wang et al. [6]. Moreover, related studies are underway in various aspects to sequentially capture dependencies [6]. Sequentially capturing dependencies has been studied in various fields. For example, in the field of automatic speech recognition, Baskar et al. [19] proposed a residual memory neural network (RMN) that models short-term dependencies using a deep feed-forward layer, and in the field of time series such as stock price prediction, Lai et al. [20] proposed a Long-term and Short-term Time-series network (LSTNet) that extracts short-term local dependency patterns between variables and discovers long-term patterns for time series trends. In addition to this, research on capturing sequential dependencies has been actively conducted in areas such as video summarization [21] and EEG emotion recognition [22]. Figure 3 illustrated the architecture of RNN, in which data is cycled while previous information is accumulated on the current information by the internal circulation structure [23]. RNN have inputs, outputs, and weights for each time step, and there are interrelationships between inputs and outputs. It receives the input value, \begin{align*} \text {S}_{\text {t}}=&\text {f}_{\mathrm {W}}(\text {S}_{\mathrm {t-1}}, \text {X}_{\mathrm {t}})\\&\qquad \downarrow \\ \text {S} _{\mathrm {t}}=&\text {tanh }(\text {W} _{\mathrm {hh}} \text {S} _{\mathrm {t-1}}+ \text {W} _{\mathrm {xh}} \text {X} _{\mathrm {t}})\\ \text {O} _{\mathrm {t}}=&\text {W} _{\mathrm {hy}} \text {S} _{\mathrm {t}}\end{align*}
At the same time as it is output to other layers upwards, the RNN data of the next time is circulated through recursive activities that use the values from the hidden layer as their input, so that the information can be constantly updated. Therefore, it can be used to learn time-dependent or sequential data, which is used in various fields such as language modeling and machine translation. In addition, the Pros of RNNs are that they can handle sequential information regardless of the input values, and the Cons of RNNs are that they cannot process information flexibly.
2) Comparison of Recurrent Neural Network Based Models
The dynamic recurrent basket model (DREAM) [24] considers only one type of behavior, that is, purchasing an item, without considering actions such as clicking on an item, and observing that two actions of buying and clicking occur simultaneously. DREAM’s main characteristic is that it implements an integrated understanding of the user’s purchase preferences or frequently recommended items on the next visit through sequential data.
- Context-aware recurrent neural networks (CA-RNN) [25] replace the constant input and transition matrices of the cyclic neural network with adaptive context-specific input and transition matrices to reflect the time interval between external situations and behavior sequences. Bayesian personalized ranking (BPR) was used to learn the model, and it was proposed based on a theoretical foundation combining sequential information and contextual data. The characteristic of a CA-RNN is that it usually exhibits flexibility when applying time intervals between all related behaviors within the entire sequence, away from approaching a circular neural network.
- Hierarchical periodic memory network (HPMN) [26] is the first model to attempt long-term predictions by considering long- and short-term actions together. It does not just focus on the user’s recent actions and is based on the user’s sequence of actions. Memory-augmented networks presented in the NLP field have been multi-scaled into the core mechanism of the model, like the human memory process, for user prediction. The characteristic of the HPMN is that it is effective in sequential behavior modeling for latent element models that can determine the characteristics hidden in user and item data.
B. Convolutional Neural Network (CNN)
1) Concepts, Architecture, Performance, and Application
The CNN network uses composite multiplication instead of general matrix multiplication as a specialized neural network for processing data, such as time series and image data, which are neural networks that use convolution. Images can be received as they are without loss of spatial/regional information by effectively recognizing and emphasizing the characteristics of adjacent images. Due to these advantages, researchers are actively working on utilizing CNNs for image data in various fields. For example, Hsu and Lin [27] proposed a convolutional neural network (CNN) that can find features in a model with an input image set if k samples are selected using the proposed CNN with a pre-trained initial model from an image net dataset and extracted as initial cluster centroids, and Luo et al. [28] proposed a HSI-CNN that processes one-dimensional data such as hyperspectral image data in fields such as hyperspectral image classification. In addition, researchers are actively conducting research on utilizing CNN image data in areas such as speech emotion recognition [29], signal analysis [30], and object detection and recognition for intelligent systems [31].\begin{equation*} y^{j}=f_{acti}\left ({b^{j}+\sum _{i} \boldsymbol {w}^{ij}\ast x^{i}}\right)\end{equation*}
As in the equation above, a single layer as opposite to a function intergraded in convolution layer is regard as
Figure 4 shows the architecture of CNN, which is largely composed of a feature extraction part and a classification part. In the feature extraction layer, each layer of the network receives the output of the previous layer as input and passes the output as input to the next layer. At the lower and middle levels of the network, there are even layer types for convolution operations and odd layer types for max pooling operations. Here, the convolution layer checks how many parts of the image data match the pattern being compared, and the pooling layer serves to reduce the amount of computation by compressing the input data [33]. In the case of computer vision, there are many popular datasets such as Imagenet, CIFAR-10, and CIFAR-100, and performance comparisons are made through tasks such as image classification. In the case of CNN, there are famous models such as ResNet, VGG, and DenseNet, and when comparing performance on Imagenet based on the image classification task, CNN models (e.g. RevCol-H) have an accuracy of 90.0%, showing a small difference from 91.1% of BASIC-L models. In terms of performance, in the field of image classification, Transformer-based models show good results, and among them are models that combine CNN’s ConvNet and Transformer (e.g. BASIC-L, Model soups), which still show CNN’s adaptive characteristics in the performance field [34].
2) Comparison of Convolutional Neural Network Based Models
Three-dimensional CNNs (3D-CNNs) [35] have attained higher accuracy in the E-Commerce field for predicting items that users want to add to their shopping carts; experiments have demonstrated results beyond this model. In addition, a user can capture the correlation between conceptual elements constituting sequential recommendation situations, such as items, sessions, and clicks, through this model. A 3D-CNN is characterized by its flexibility in modeling session data through character-level encoding and 3D CNNs that model spatial and visual information simultaneously.
- ConvolutionAl sequence embedding recommendation (CASER) [36] presents approaches regarding the concept of top-N sequential recommendations. Moreover, it is broadly based on the user’s latest behavioral information. Accordingly, its purpose is to learn the time-series pattern in the sequence and recognize the embedded sequence of the item as an “image”. In addition, it is significant in that it is possible to identify time-series patterns between items at the set unit. CASER is characterized by the fact that users can capture various sequential patterns using this model.
- NextitNet [37] is a convolution generation model, simple and efficient for session-based top-N item recommendation, in which one-dimensional extended convolution filter was used to model long-term dependencies within user-item interactions. For example, a situation where a user recognizes a history of ordering a past item in one sequence and converts it into a latent matrix. This is also applied as item embedding, where the convolution and pooling structures of a convolutional neural network are used. NextitNet uses a one-dimensional extended convolution filter to model long-term dependencies and uses the concept of residual learning.
Extending Capabilities of Traditional Systems
A. Generative Adversarial Network (GAN)
1) Concept, Architecture, Performance, and Application
GAN is a deep learning approach invented by Ian Goodfellow in 2014 [38], and as the name “generative adversarial neural network” implies, it is an unsupervised deep learning model in which two different neural networks (a generator and a discriminator) compete against each other in a zero-sum game to generate data that resembles real-world data. These models are popular because they learn by producing images or speech directly without human intervention. Due to these advantages, researchers have been focusing on the Min-Max Game in GANs in various fields. For example, Zehni and Zhizhen [39] studied the recovery of image and projection angle distributions in tomographic reconstruction using an unsupervised adversarial learning approach, and Lee and Choi [40] proposed a GAN for image generation that captures the spatiotemporal dependence of a given data distribution in video images using natural language and its associated spatial properties. In addition to this, there are also active research efforts in fields such as speech recognition [41] and biotechnology [42] that focus on the Mix-Max Game of GANs.
Figure 5 shows the concept of the GAN model, and the equation shows that the generator and the discriminator are playing a min-max game as a function of V(D, G). In the case of image generation, the generator starts with Gaussian noise and generates an image, the discriminator determines how good the lifelike image is, and the process is repeated until the generated output is close to the true input sample [43].\begin{align*}\min _G \max _D V(D, G)=\mathrm{E}_{X \sim P_{\text {data }}(X)}[ & {[\log \mathrm{D}(\mathrm{X})] } \\& +\mathrm{E}_{\mathrm{Z} \sim P_Z(\mathrm{Z})}[\log (1-\mathrm{D}(\mathrm{G}(\mathrm{z})))]\end{align*}
Looking at the equation above more specifically, [log D(x)] is the discriminator’s result on the real data and [log(1-D(G(z)))] is the discriminator’s result on the data generated by the constructor. If we call the real data 1 and the fake data 0, then to maximize the objective function, the discriminator should classify the fake data by bringing its result on the real data closer to 1 and the discriminator’s result on the data generated by the constructor closer to 0. Conversely, to minimize the objective function, the constructor is taught to fool the discriminator by producing a result for the real data of 0 and the discriminator’s result for the data generated by the constructor of close to 1 [44].
2) Comparison of Generative Adversarial Network Based Models
Recurrent generative adversarial network (RecGAN) [45] is a model proposed to improve recommended performance by learning the temporal potential elements of users and items focusing on recurrent recommender networks (RRRNs) and information retrieval GAN (IRGAN) models. RecGAN is characterized by the ability to learn the temporary latent functions of users and items and implement subdivided user and item modeling using both the time-series modeling function of the cyclic neural network and latent element modeling function of the generative adversarial neural network.
- Multifactor generative adversarial network (MFGAN) [46] is a model proposed to accurately model contextual information, which is a crucial factor for understanding user–item inter-relationships in sequential recommendations. MFGAN is characterized by being able to grasp which factors contribute to the overall recommendation decision over time, and flexibly recognize the multiple information of related factors.
- Adversarial organic sequential learning for sequential recommendation (AOS4Rec) [47] considers Seq2Seq auto-regressive learning in NLP. Moreover, it is a model with transition dependencies and behavioral continuity at the item and sequence levels. AOS4Rec is characterized by solving optimization issues using models, such as WGAN, for fast and stable learning.
B. Graph Neural Network (GNN)
1) Concept, Architecture, Performance, and Application
The concept of Graph Neural Networks (GNNs) was solidified in a 2009 paper by Gori, Scarselli, and others [48]. GNNs are one of the deep learning-based neural networks that have gained a lot of attention recently due to their ability to directly analyze graph data. There are many research works focusing on information propagation in GNNs in various fields. For example, Gao and Xu [49] proposed a framework to model video image classification by considering more diverse relationships and aspects, and Zhou et al. [50] proposed an Inductive Graph Transformer (IGT) to predict package delivery time using raw feature information and structural graph data. In addition to this, there are also active research works focusing on information propagation in GNNs in fields such as image processing [51] and video image classification [52]. The GNN can make predictions at the point, line, and graph levels, and has been actively used in areas such as social networks, molecular graphs, and Euclidean spaces. A graph, the basis of GNN, is a data structure consisting of points and lines connecting them, and a set of vertices consisting of nodes in the form of points and a set of edges consisting of edges and links in the interactions between each entity [53]. In the recommendation situation, it explicitly states the interrelationship between users and items as a graph that connects users and items. Each node is a user or an item, and a user’s preference appears through the edge. It is possible to distinguish the interaction history of each user more easily using graphs related to the core and surrounding interests [54]. Core interest nodes have a higher degree compared to that of peripheral interest nodes and connect more similar interests. The subgraph becomes denser and larger and persists as the frequency of similar interests increases; and a priori framework is constructed through the core interests of the same user, such as neighboring nodes resembling dense subgraphs [55].\begin{equation*} \text {h}_{u}^{(k)}=\sigma \left ({\text {W}_{\text {self}}^{(k)} \text {h}_{u}^{(k-1)}+\text {W}_{\text {neigh}}^{(k)} \sum _{v\in N (v)}\text {h}_{v}^{(k-1)}+b^{(k)}}\right)\end{equation*}
As in the equation above, multiple layers of separation between parameters, embeddings, and dimensions are used to accomplish this with the UPDATE and AGGREGATE functions. Although the inclusion of a bias term,
The basic principle of GNNs can be understood through a series of information cycles called message passing. Figure 6 illustrates the concept of a Message Passing Neural Network (MPNN), the most basic of GNNs. MPNN is the most basic framework of GNN that performs updates to the state of a node using information from the node’s neighbors. It consists of a message passing phase to aggregate and update information from neighbors and a readout phase to derive results. To understand the information L of node A in figure 1 II-B2(a), the information of neighboring nodes B, C, and D must be aggregated as shown in figure 1 II-B2(b), and the information of each neighbor A, C / A, B, E, / A can be aggregated to obtain the message of neighboring B, C, and D. This process enables the structural structure of the graph. The advantage of MPNN is that both structural and feature information of the graph can be obtained through this process [57].
2) Comparison of Graph Neural Network Based Models
Regarding GNN recommendation models, we will examine how GNN- based recommendation models deal with sequence modeling through the following three representative models.
- Relational time attractive graph neural network (RetaGNN) [58] is a deep learning-based model that predicts the following items and proposes integrated sequential recommendations. One of the fields that this model learns is the mapping of user–item interactions in a local graph. Built on a local graph pattern that includes relationships between users, items, and other related attributes, high-level user-item interactions are presented through sequential modeling.
- SURGE [59] is a model constructed from a novel perspective. The existing problems of sequential recommendations, such as solving noise problems within the user behavior sequence and identifying user preferences, are solved based on graph neural networks. Graph pooling techniques are used to identify user preferences; the graph convolution network is a neural network based on this model.
- TGSRec [60] proposes using temporal embedding of nodes through a time-series collaborative transformer layer that can simultaneously consider collaborative signals from both users and items, as well as time dynamics within sequential patterns. This allows flexible access to temporal elements in user-item interactions within sequential recommendations.
C. Transformer
1) Concept, Architecture, Performance, and Application
Transformers have been the focus of recent research in various fields that focus on the self-attention of transformers. For example, Berg et al. [61] proposed Keyword Transformer (KWT), which achieves state-of-the-art performance in several tasks related to keyword extraction in automatic speech recognition without prior training or additional data, and Li et al. [62] proposed UniFormer, which perfectly integrates the advantages of convolution and self-attention in a transformer format to improve the efficiency of image and video recognition. In addition to this, research focusing on self-attention in transformers has been actively conducted in fields such as speech recognition [63] and object recognition [64]. Transformer has recently been applied in various fields such as computer vision and natural language processing. Unlike RNN-like algorithms, which receive word inputs sequentially and can reflect the positional information of each input, it is a purely attentional neural network model that can be parallelized by inputting sequences at once. Although this model does not use an RNN algorithm, it maintains an encoder-decoder structure like a conventional sequence-to-sequence, receiving input sequences from the encoder and outputting output sequences from the decoder. The difference is that the encoder and decoder can be stacked with L identical blocks, and there is no ordering information at the input.
Figure 7 overviews the architecture of vanilla transformer model. First, in the case of an encoder, the data is made into arbitrary N-dimensional data in the embedding stage. To convey the position information of the data, the transformer starts positional encoding by creating a sine and cosine function to convey the position information of the information in the sequence and the embedded data as an input to the next layer. This input data then goes through the encoder stage, which consists of Multi-head Self Attention / Add & Normalize / Position-wise FFNN modules. Multi- head Self Attention, which is the core principle of the Transformer, calculates the Attention score value by computing all the vectors that enter the encoder.\begin{align*} \mathrm {Attention}\left ({\mathbf {Q,K,V} }\right)=&\mathrm {softmax}\left ({\frac {\mathbf {Q}\mathbf {K}^{\mathrm {T}}}{\sqrt {D}_{K}} }\right)\mathbf {V=AV} \\ \mathrm {MultiHeadAttN}\left ({\mathbf {Q,K,V} }\right)=&\mathrm {Concat}\left ({{\mathrm {head}}_{1},\!\cdots \!,{\mathrm {head}}_{H} }\right)\!\mathbf {W}^{O} \\ \mathrm {where} {\mathrm {head}}_{i}=&\mathrm {Attention}(\mathbf {Q}\mathbf {W}_{\mathbf {i}}^{\mathbf {Q}}, \mathbf {K}\mathbf {W}_{\mathbf {i}}^{\mathbf {K}},\mathbf {V}\mathbf {W}_{\mathbf {i}}^{\mathbf {V}})\end{align*}
For this purpose, the learnable weight matrix is multiplied for each input as shown in equation above to create the Query (Q), Key (K), and Value (V) vectors of the Attention function, and the Attention Value Matrix is created by computing all key vectors in a ’scaled Dot-Product Attention’ manner to obtain a score for how well they match the current query vector. Next, the Decoder differs from the Encoder in that Masked is included in the first Multi-head Self-Attention and the second Multi-head Self-Attention uses general Attention instead of Self-Attention to guess the output. The reason for Masked is that the Decoder shouldn’t learn future data more than current data, so we put a very small negative value in the future data, Attention scores, to mask this part. And Attention is used as an input to the Masked Attention Matrix used by the Decoder and the matrix created by the Encoder, and the completed Decoder is stacked in M layers, and finally Dense Layer and Softmax Layer are added to complete the Transformer [65].
2) Comparison of Transformer Based Models
The emergence of the sequential recommendation field has presented a new direction beyond areas where recommended systems have focused, such as collaborative filtering and ranking [66]. Although it is difficult to use time information to improve recommended performance, deep learning technology applied in NLP field is still used to deal with sequential data [67]. SASrec, which is based on transformer architecture, achieved state-of-the-art (SOTA) results in the sequential recommendation field and was motivated in the neural machine translation field by the transformer model [7].
-BERT4Rec [68] uses a deep bi-directional self-attention to model user-behavior sequences. The Close task is used as a sequential recommendation, and it adjusts the left and right contexts to predict the masked item in the sequence. The two-way representation model is learned in this manner, which is possible by allowing items of user past behavior to fuse information on both sides. A strong representation of a user’s behavioral sequence can be obtained through the above processes.
- Transformers4Rec [18] includes additional information about users and item context information to improve recommendation performance. Moreover, a library called HuggingFace, it closes the gap between NLP and sequential/session-based recommendations.
- SSE-PT [69] presents better results to individual users than current state-of-the-art models and proposes a novel neural network architecture called a personalized transformer. It tends to pay more attention to recent items of long sequences and utilizes a novel normalization technique called stochastic shared embedding. Experiments have provided evidence regarding the new application of probabilistic shared embedding (SSE) normalization, which is essential for the success of personalization, and found that this method can focus on each user’s recent participation patterns, process exceedingly long sequences, and improve performance and speed.
D. Self-Supervised Learning
1) Concept, Architecture, Performance, and Application
While supervised learning can show high performance, it is limited by the need for labeled data. To compensate, semi-supervised learning, which uses only partially labeled data, and unsupervised learning, which does not require data to be labeled at all, have been developed. And most recently, as a subset of unsupervised learning, self-supervised deep learning, which can learn complex patterns from unlabeled data, has gained attention and is being used in downstream tasks such as image classification and object detection [70].
There are many research works focusing on self-attention in self-supervised learning in various fields. For example, Zhang et al. [71] proposed DialogueBERT, a novel contextual dialogue encoder based on the pre-trained language model BERT, which is widely used for dialogue understanding, and Stojnic and Vladimir [72] extensively analyzed the applicability of self-supervised learning in remote sensing image classification. In addition, research focused on self-attention in transformers has been actively conducted in fields such as biotechnology [73], music [74], and speech recognition [75].
Figure 8 shows the process of how self-directed learning works, which consists of creating a pre-trained model and a downstream task. Pre-trained models are used to learn the general features of an application using large amounts of untagged data, allowing users to improve their understanding of the data itself through a pretext task process where they define a new problem. The pre-trained model can then be transferred to solve the downstream task better and show high performance. Methodologically, it can be categorized into self-prediction and contrastive learning. Self-prediction refers to predicting other parts of a dataset through one part, and contrastive learning refers to learning to adjust the distance by applying positive and negative pairs according to the feature similarity between images. However, self-supervised learning requires a lot of computational power and has the limitation that it is not as accurate as supervised learning models [76].
2) Comparison of Self-Supervised Learning Based Models
Inferring high-quality representations to users through past interactions is an essential issue in the sequential recommendation task, which currently plays a key role in the recommendation system field. However, it struggles with data scarcity problems; therefore, SOTA was achieved using a deep neural network. With the recent developments in deep learning, DNNs have also been applied in various fields of recommendation systems, including RNNs and attention systems. Subsequently, this extends more complex user–item inter-relationships to GNNs that can be analyzed in a larger framework, and they commonly face data scarcity problems. To this end, detailed methods, such as data augmentation within the SSL, make this possible. Moreover, in the field of recommendation, SSL is one of the fields that have been actively studied recently [77].
- CL4SRec [78] is a model in which accurate user representation is inferred from only the user’s interaction behavior, and then suggests data enhancement-related approaches, in particular, cutting, masking, and rearrangement, through the model. It extracts user patterns and identifies user expressions more effectively through the contrastive learning framework, rather than simple next-item predictions.
- ICL [79] optimizes the sequential recommendation model with self-supervised learning and learns the user’s intention distribution function from an unlabeled user’s behavior sequence.
- DuoRec [80] uses item embedding distribution, which is an area where normalization can be implicitly applied through recommended tasks. Solutions have been proposed, such as expanding the model level based on the dropout and enabling better meaning preservation because conventional contrast learning methods rely on data augmentation, and semantically consistent augmented samples can hardly be provided to user–item interaction sequences. In addition, a sequence with the same target item was selected as a hard-positive sample through a newly developed sampling strategy.
Model Training Cost
Personalization is one of the important issues in deep learning-based recommendation systems. In particular, the development of IT technology and various user behaviors and preferences have exponentially increased the amount of data that recommendation systems need to handle for model training. Therefore, efforts to improve efficiency by optimizing model training time have been carried out in the last two to three years, and among them, performance such as data throughput and data processing time is becoming important as recommendation models are personalized and connected to model training costs. The number of tasks processed per second in the inference area is called inference throughput, and the time it takes to process one task is called inference latency [81]. Recommendation data time is an important variable for users of model-based services, such as DLRS-based applications such as YouTube and TikTok, where large amounts of short videos, articles, and images closely related to our daily lives must be provided to users within seconds or minutes. This performance is determined by the target hardware and how well the deep learning model is trained.
Figure 9 illustrates the data processing process in a typical deep learning-based recommendation system and shows how inference latency occurs in the process. Training data (e.g., new content and user activity) is collected from users, and the data reaches the training server in the data center. The training server uses optimizers to calculate gradients that modify the model. The checkpoints are validated first, and only those that can improve the Service-Level Objective (SLO) are deployed to the Parameter Server in the data center 4 to update the model. The latency of a DLRS model update is composed of the time it takes to calculate the model update and distribute the update to the global data centers, where concepts such as latency to create new content for users correspond to SLOs. When updating models, recent DLRSs (e.g., NVIDIA Merlin and Meta Check-NRun) exhibit latencies on the order of minutes or hours. Therefore, to optimize the performance of deep learning models, it is necessary to improve the SLO, which means to optimize the model update process efficiently [82]. There are two main ways to update a model to improve latency or SLO performance. First, there are checkpoint broadcasts, which use the SOTA technique to update multiple long-term latencies in the model. These include Checkpoint and Broadcast, which update latencies from seconds to minutes, and Validation, which updates latencies from minutes to hours. Second, there is localized machine learning to improve SLO loss.
Conclusion and Future Research Directions
The sequential recommender system has been recognized as an essential field for academics and researchers in the last 3–5 years. As a result, many related papers have been published or are under research in top AI related conferences, such as SIGIR, CIKM, and RecSys [84], [85], [86], [87], [88], [89], [90], [91], [92], [93]. This paper introduces sequential recommendation, which can analyze user-item interactions more accurately and flexibly through temporal factors, and explores the concepts, architecture, application, and detailed models of related representative deep learning-based recommendation models. RNN and CNN have been widely used as the initial models of deep learning-based recommendation systems, and recommendation system models such as GAN, GNN, Transformer, and SSL have been attracting more attention recently as they compensate for the shortcomings of the initial models. First, RNN recommendation models are useful for learning time-dependent or sequential data in a structure where data is cycled through recursive activities to continuously update information, and CNNs can use imaging information identified from user-item interactions in time and latent space to understand the full context of the recommendation process and learn sequential patterns based on it. GAN is a representative unsupervised deep learning model that can optimize recommendation quality by playing a zero-sum game in which two different neural networks (generator and discriminator) generate data like real-world data, while GNN can more clearly understand the similarity between users and items through visualized elements such as points, lines, and nodes, resulting in highly interpretable and expressive recommendation results. In addition, transformer, developed to solve the long-term dependency problem of RNN models, utilizes bidirectional learning and masking techniques to outperform other models in terms of learning performance and speed, and semi-supervised learning can compensate for data sparsity by using data augmentation approaches such as truncation, masking, and rearrangement.
In the future, backbone technologies such as RNN, CNN, and GNN will not change much in this field, but many efforts to improve the efficiency and effectiveness of SRS are expected to receive continued attention from researchers. Network compression [83], [84] is expected to be one of the important topics to be studied among researchers because it is closely related to agility, efficiency, and accuracy to improve the performance of recommender systems. Network compression is one of deep learning lightweight methods that use model reduction techniques to reduce the number of parameters, and related learning techniques include data augmentation [85], knowledge distillation [86], and transfer learning [87]. Data augmentation is a technique proposed to predict parameters well and achieve high performance of deep learning models and is used when training data is small. Knowledge distillation can utilize the information delivered from the deep learning model learned when there is a lot of training data, and transfer learning can conduct effective learning based on data learned from other domains.