Next Basket Recommendation Model Based on Attribute-Aware Multi-Level Attention

Next basket recommendation is a challenging problem, mainly due to the relationships among the items in a basket almost not being considered in current research. In this article, we address next basket recommendation with a novel deep learning architecture. In particular, we consider both the short-term user interests and the long-term user preferences, and we design a new attention that considers the relationships among the items in a basket. We extensively evaluated the proposed model on two benchmark data sets, the Ta-Feng and JingDong datasets. The experimental results show that the proposed model outperforms several state-of-the-art next basket recommendation models. In the experiments of the modules’ effects, we also verify the effectiveness of each module. The model significantly improves the NDCG by 25.5 percentage points and 5 percentage points when compared with the uncompleted network model on the JingDong and Ta-Feng datasets, respectively; while in terms of the F1, the performance is improved by 14.6 and 23.4 percentage points, respectively.


I. INTRODUCTION
In recent years, increasingly e-commerce platforms have emerged, such as Taobao, JingDong and Amazon, and recommendation systems play important roles in them. Recommendation systems can identify users' needs and increase sales in these platforms. Although recommendation systems have been proven useful [1], [2], recommending items that are highly related to users is a challenging problem [3]. In addition, recent studies have focused on recommending lists of items to users [4], [5], where each item is independent. However, in the real scenario, the items in a user's basket are often related. The task of recommending a basket of items to users is called ''next basket recommendation'' [6]. It uses a user's historical purchase to predict what the user will buy next. Next basket recommendation helps e-commerce websites to present the right items to users, thus promoting the growth of their trading volumes and profits, and it can help users quickly select the right items from massive numbers of items.
The difference between next basket recommendation and general recommendation is that it should recommend a basket The associate editor coordinating the review of this manuscript and approving it for publication was Pasquale De Meo. of items to users each time; therefore, the relationships between the items inside the basket should be considered. Therefore, when analyzing the historical baskets of users, the implicit information between the items and users should be more fully mined. First, users' short-term purchase behavior and their long-term purchase behavior should be considered, both of which reflect users' preferences for items. Second, the preferences of users change over time, and so we need to mine the implicit information using the purchase history sequence of users. In addition, it is also important to find the dependencies between the baskets and between the items in the basket. In combination with the basket recommendation application scenario, we summarized two points to consider in modeling: • Mining users' long-term and short-term preferences from their historical purchase sequence, and • Considering the relationships between the items in the basket and the relationships between baskets. Using machine learning in recommendation systems has been a hot research area. The model based on collaborative filtering can find the overall interest of the user, but it is difficult to consider the sequential characteristics of historical transactions. The order feature can be considered in the Markov model, but the high order Markov model is too complex. In recent years, deep learning has also been widely used in the recommendation field. A session-based recommendation system used an RNN to model users' behavioral sequences [8]. This work regards the behavior of clicking on a commodity in a session as a sequence and applies the GRU to mine the sequence information. A hierarchical attention network to solve the sequential recommendation problem was proposed [9]. Through the hierarchy layer, this work combined the long-term and short-term preferences of users to generate an advanced hybrid representation of users. A hybrid model was proposed that incorporated the gating mechanism of learning, which can combine different models given different context information [10]. Regarding next basket recommendation, many models have also been proposed: the HRM aggregates the representation of items and the representation of users through nonlinear maximum pooling [11], the DREAM uses an RNN to model the next basket sequence for the first time [12], and the ANAM integrates the attribute information of the item into the item representation [13]. However, what the above works miss is that they do not explicitly focus on the relationships between the items inside the basket. To solve the problem, this article proposes a novel Next basket Recommendation Model Based on Attribute-aware Multi-level Attention (NbR-AM). Firstly, we pay attention to the item information inside the basket, construct the basket hybrid representation with the self-attention mechanism at the basket level, then apply the attention mechanism to a user's basket sequence level to pay attention to different baskets, and finally recommends the next basket.
The main contributions of this article are summarized as follows: • The basket representation is constructed by applying self-attention to the items in the basket, basket representations are encoded by bi-LSTM as a latent representation.
• The second attention layer focuses on the user's different baskets to construct the representation of the user's purchase intention.
• When recommend the item in the next basket in the recommendation layer, the context representation and previous time step recommended item are fully considered. We performed experiments on two common datasets and compared them with other existing models. The experimental the results show that the proposed model improves the effectiveness of basket recommendations.
The remainder of this article is organized as follows. Section II briefly reviews the key research works in the literature related to this area. The Problem Formulation is presented in Section III. In Section IV, a next basket recommendation model based on attribute-aware multi-level attention is proposed and its algorithm are described in detail. Extensive experimental results and discussions is reported in Section V. Finally, the conclusions related to this research study are drawn in Section VI.

II. RELATED WORK
In order to provide personalized recommendation to users, review-based models, social-based models and behaviorbased models have been used. The Diffnet simulates how a user's potential embedding evolves as the social diffusion process continues [14]. The HRDR is a hybrid neural recommendation model that learns the depth representation of users and items from ratings and reviews [15]. Expressing users' long-term and short-term interests effectively is the key of next basket recommendation. Therefore, it is necessary to model users' historical basket sequences. Currently, there are four models to address the sequence problem: the collaborative-filtering based model, the frequent-pattern based model, the Markov-chain based model and the neural network model. Matrix factorization (MF) is a typical model based on collaborative filtering [16]. By factorizing the user-item matrix composed of all the historical transaction data, latent vectors can be used to represent the overall interest of users. Reference [17] proposed NCF, a hybrid model with CF and a neural network. This work complements the mainstream shallow models for CF, opening up a new avenue of research possibilities for recommendation based on deep learning. However, the model based on collaborative filtering cannot model the basket sequence. The frequentpattern based model mainly mines the frequent patterns of items from the basket data [18], [19]. The basic idea is to recommend a basket by mining the frequent patterns in the user's purchase history. Reference [19] applies the extended Apriori algorithm to mine user purchase patterns in two stages. Reference [18] considered the four factors of the co-occurrence, sequence, periodicity and reproducibility of users' purchase behavior; and proposed a time series labeling cycle mode, which was used as the decision-making factor for basket recommendation. The frequent-pattern based model mainly is deficient in its expression ability, and it has higher space and time consumption when applied to large-scale data. Markov based models are a popular approach in traditional sequence recommendation models. They extract the sequence characteristics from historical transactions and then predict the next purchase based on these sequence behaviors. Factorizing Personalized Markov Chains (FPMCs) can model the sequential behavior between two adjacent baskets [20]. However, FPMCs can only conduct linear operation on data features, thus hardly modeling the interaction between multiple features.
Deep learning based models are a hot research direction of basket recommendation in recent years. There is also a lot of excellent work combinate deep learning and traditional models. The DeepCF explored the possibility of fusing representation learning-based CF models and matching function learning-based CF models [21]. In addition, the Sli-Rec has done an excellent job in finding a user's long-term and shortterm preferences in the recommendation system [22]. Various VOLUME 8, 2020 deep neural network models have been applied to solve the basket modeling and sequence modeling problems in basket recommendation. The HRM first conducts the maximum pooling operation for the embedding vector of the recently purchased items, and then it constructs the user behavior representation using the maximum pooling operation between the user's embedding vector and the first layer representation to form the next item recommendation [11]. However, both FPMCs and the HRM are based on Markov chain approaches, which model the sequential behavior of users only using adjacent transactions, which is not sufficient to capture the longterm trend of the basket. To solve this problem, the DREAM uses an RNN to model the global sequential characteristics between baskets and uses the hidden states of the RNN to represent the dynamic interests of users over time [12]. The ANAM uses a hierarchical structure to apply an independent attention mechanism to items and their respective attributes, uses an RNN to model the user's sequential behavior over time, and passes the user's preferences for items and their attributes to the next basket by sharing the attention weight between two baskets in different hierarchies [13].

III. PROBLEM FORMULATION
In the next basket recommendation scenario, we have a set of where |U | and |I | are the number of users and the number of items, respectively. Given a user u and this user's existing . Each item has some attribute information, such as the category information of the item. The attributes of item i are expressed as a i ∈ A, where A is the set of attributes. The attribute information of basket B u t is expressed as The goal of next basket recommendation is use a user's historical transaction records to predict the next basket B u t+1 = {i 1 , i 2 , . . . , i n }.

IV. THE PROPOSED APPROACH
In this article, the Next basket Recommendation Model Based on Attribute-aware Multi-level Attention is proposed. Figure 1 shows the overview of the architecture of the proposed model, which consist of four layers: the basket layer, the sequence layer, the recommendation layer and the output layer. In the basket layer, the basket embedding v basket is calculated by the self-attention mechanism. The input of the self-attention mechanism is the embedding sequence of items {i 1 , i 2 , . . . , i n } and the output of the self-attention is basket embedding v basket . In the sequence layer, user's basket representation sequence {v basket } is input into the bi-LSTM so as to determine the long-term interests and short-term preferences in the user's sequential behavior. These basket representations are encoded by bi-LSTM as a latent representation h i . In the recommendation layer, the second level of the attention mechanism is used to pay different degrees of attention to these latent representations to model the relationships between baskets so as to obtain the current basket's context representation c t . The next hidden state s t of the decoder at time step t is computed by the context representation c t , the previous time step hidden state s t−1 of LSTM and the global embedding (GE). In the output layer, the hidden state of LSTM in each time step is used the recommendation layer to calculate the probability that each item may be purchased. In each time step, we select the item with the highest probability as the recommended item for users. To prevent repeated predictions, we use the masked softmax (MS) to output the probability distribution.

A. THE BASKET LAYER
The self-attention mechanism was first proposed by Google for machine translation and it achieved the optimal results at that time [23]. Self-attention uses an attention mechanism to calculate the association between each word and other words. The basic structure is shown in Figure 2.
For self-attention, Q, K and V come from the same input. First, we need to calculate the dot-product between Q and K; and then to prevent the result from being large, it is divided by a scale √ d k , where d k is a dimension of a Query and Key vector. The result is then normalized to a probability distribution using the softmax operation, and then multiplied by matrix V to produce a sum of the weights, which can be formalized as follows:  Multi-head attention can have different representations head i of Q, K and V for the same input, and then combine the self-attention results. The structure of multi-head attention is shown in Figure 3 and the formula is as follows: where . The information of items and their attributes in the basket can reflect the purchase intention or preference of the users from different aspects, which play a complementary role in the information representation of the basket. In Equations (1) and (2), we take the category and id of the items in the basket as the input of the self-attention mechanism to get two kinds of basket representations v basket category and v basket item respectively, and then concatenate the two vector representations together to get the final attribute-aware basket representation v basket . The implementation of the lookup operation is as follows:

B. THE SEQUENCE LAYER
The Sequence to Sequence model is a popular RNN model in recent years, which is widely used in machine translation, automatic question answering systems and other fields, and has achieved good results [24], [25]. As we all know, LSTM is a popular model to deal with sequence labeling because it solves the disappearing disappearance and exploding gradient problems of the traditional RNN in the long sequence training process [26]. Through LSTM's three gates mechanisms, each hidden state can express the effective information of the current basket and the previous basket. Further, bi-LSTM can perform long-term memory and short-term memory simultaneously from two directions [27], which is conducive to mining more potential preference relationships from the user's behavior sequence. We use bi-LSTM in basket sequence modeling. The hidden state representation h i not only contains the basket information from the front of the current basket, but it also contains the basket information form the back of the current basket. This helps the model to learn the purchase preferences of users as a whole. For the basket representation sequence {v basket , . . . , v basket m }, in order to determine the user's long-term and short-term purchasing preferences, we used LSTM from two directions to get the hidden states − → h i and ← − h i of each basket and concatenate these two vectors to get the current basket representation h i , which reflects the sequence information centered on the current basket.

C. THE RECOMMENDATION LAYER
In the traditional RNN structure, when the input sequence is very long, the context vector is not enough to represent the whole input sequence, which makes it difficult to deal with long sequences. To solve this problem, the attention mechanism is introduced. The attention mechanism has been widely used in the fields of computer vision and natural language processing [28]- [31]. In the next basket recommendation problem, many users have very long histories of baskets of items, with dozens of baskets. Due to the existing basket sequence having different contributions to the next basket prediction, we use the attention mechanism to pay attention to different parts of the basket representation sequence and aggregate m basket representations to generate the context representation c t , which can more effectively express the purchase intentions of users at a certain time, and thus recommend items to the users more accurately. In particular, the attention mechanism at time step t assigns weights α ti to the i-th basket representation.
where W a , H a , and U a are the weight parameters, and s t−1 is the hidden state of the LSTM at step t − 1. e ti is the weight influence factor. We believe that the weight of the current basket representing h i is not only related to h i itself, but also to the state of other time steps. The final context vector c t associated with to the recommendation at time t is calculated as follows: In our model, we use the users' history basket in time steps {1, 2, . . . , t} to recommend the next basket in time step t + 1.
The process of recommend a basket of several items is loops, as shown in Algorithm 1. In the recommendation layer, each step in the loops consider the recommended item of previous steps to recommend the item by the global embedding. In this layer, unidirectional LSTM is used. The context representation c t , the last step hidden state s t−1 of LSTM and the global embedding g t−1 of the output layer are used as the inputs to calculate the next hidden state representation s t of LSTM.
where [g t−1 ; c t ] means the concatenation of the vectors g t−1 and c t . The global embedding g t−1 is the average of the embeddings of the predicted items at previous t − 1 steps in the loops. Each step in the loops, we select the item which have the highest probability as the recommend item and use the items' embedding to calculate the global embedding g t−1 . When t = 1, g t−1 is a random initialized vector. When t ≥ 2, g t−1 is computed as follows: where y i,j ∈ [0, 1] s.t. |I | j=1 y i,j = 1. The y i,j is the probability that item j being purchased by users in the i-th loop step.

D. THE OUTPUT LAYER AND MODEL LEARNING
In this layer, we recommend a basket with five items to user. In the output layer, y t is the probability distribution that each item is likely to be bought at time step t + 1 and is calculated as follows: where W o , W d , and V d are weight parameters, and I t is the mask vector that is used to avoid recommend repeated items in the output layer. If the item has been recommended in the previous t − 1 steps, (I t ) i set to −∞. In addition, f is a nonlinear activation function.
In order to effectively learn from the training data, we use the weighted cross-entropy as the loss function, which is defined as follows: where k and q are the weights of the positive and negative classes, respectively. p i is the probability of item i being purchased in the next basket. The y i indicates whether the user purchased the item i. If the user purchased item i, y i = 1; otherwise, y i = 0. We further formalize the above two levels of the proposed approach in Algorithm 1:

Algorithm 1 Next Basket Recommendation Model Based on Attribute-Aware Multi-Level Attention
Input: Recommend the next basket R(u) for each user Begin 1. for each u ∈ U do 2.
construct the basket representations v basket category and v basket item via self-attention with items and categories, respectively the basket representation sequence for t = 1 to N do 5.
the hidden representation put into the second attention mechanism to get the context vector c t via Equation (11) 6.
The context vector s t is obtained via Equations (12) and (13). 7.
get the probability y t of item maybe purchased via Equations (14) and (15)  8. select one item with highest probability and put it into R(u) 9. end for 10.
return R(u) 11. end for End

V. EXPERIMENTS A. DATASETS
In order to evaluate the performance in next basket recommendation, we use two real-world transaction data sets to conduct the experiments.
-The Ta We preprocessed the above two datasets to remove those items that were seldom purchased and those users who seldom purchased items. Users who purchased less than 15 items were deleted, and items that were purchased less than 15 times were deleted. Table 1 summarizes the statistics of the two datasets after data preprocessing.
Finally, we divided all the data sets into training sets and test sets. The test sets contain only the last transaction for each user while all other transactions are used the training sets.

B. BASELINES
We use the following representative start-of-the-art models as the baselines for the experiments: -NMF: The recommended model based on non-negative matrix factorization is proved to be the best model among the traditional models [32]. -HRM: It is a hierarchical presentation model that use a shallow neural network and nonlinear operations to aggregate the embedding representation of the user and the embedding representation of the last purchased items [11]. -DREAM: The model uses an RNN to model the global sequential characteristics between baskets and uses the hidden states of the RNN to represent the dynamic interests of users over time [12]. -ANAM: The model uses a hierarchical structure to apply an independent attention mechanism to items and their attributes respectively; uses an RNN to model the user's sequential behavior over time; and passes the user's preference for items and their attributes to the next basket by sharing the attention weights between two baskets in different hierarchies [13]. -IIAAN: The model applies an attention mechanism on the basket sequence to act on all historical user baskets to model the user's long-term preferences, and within the basket to act on the commodity level in the most recent basket to model the user's dynamic interests and shortterm preferences [33]. The traditional NMF model is based on the scikit-learn 0.20.3 machine learning library. The HRM, DREAM, ANAM and IIAAN models are based on deep learning and implemented with PyTorch 1.0.
The model proposed in this article are implemented using PyTorch 1.0 [34]. In our experiments, we recommend 5 items to each user. The embedding dimension is set to 100, and the number of multi-head attention mechanisms is set to 2. The loss function is optimized by the Adam algorithm. Due to the difference in the numbers of positive and negative instances in the loss function, we set k to 800 times as big as q in the experiment for the Ta-Feng dataset, and set k to 1000 times as big as q for JingDong dataset in Equation (17). The corresponding learning rates for the two datasets are set as 0.002 and 0.004. Dropout was not used in the experiment. The batch size is 64 in the training stage. The number of epochs is 100.

C. EVALUATION METRICS
For each model, we recommend a list of N items (N = 5) for user u denoted as R(u), where R i (u) represents the i-th recommended items. The baskets actually purchased by users are recorded as T (u). The performance is evaluated on the test set using the following metrics: In the above evaluation metrics, I (·) is an indicator function where when the condition is true, the value is 1; otherwise, it is 0. N k is a constant that denotes the maximum value of NDCG@K. Table 2 and table 3 illustrate the experimental results of all compared models on the two datasets. All models were run four times and we use the average values as the results. The following can be concluded from the experimental results:

D. THE EXPERIMENTAL RESULTS COMPARED WITH THE BASELINE
Firstly, it can be seen that the neural network obviously improves the effect, and the models based on neural networks are better than NMF because NMF does not model the basket and is far inferior to neural networks in complex learning.
Secondly, among the models based on neural networks, several models are better than the HRM because the HRM network is too simple and only uses the sequence information of two adjacent baskets, which is weaker than the other deep learning models in feature representation. In addition, here it is indicate that a reasonable network design can indeed help the model learn more information to improve the effect of the model. Thirdly, our model has achieved the best results on the two data sets. This shows that designing the properties of the perception of the multilayer attention model can better focus VOLUME 8, 2020  on the inside of the basket to recommend item information to build the next basket and the order of the items in the different baskets, can help to recommend the next basket.

E. THE INFLUENCE OF COMPONENTS
To evaluate the contribution of each component module to the network, we remove a module from the proposed network to evaluate the recommendation effect of the remaining network. This study conducted a series of contrast experiments utilizing several neural network models: -attention: The second attention module in sequence modeling is removed. -selfAttention: Average pooling replace the selfattention module in basket modeling. -globalEmbed: Remove the effect of the current prediction on the subsequent prediction in the output layer. singleHead: The basket is modeled using the singleheaded self-attention module, that is the number of multi-head is set to 1. normalAttention: Use the normal attention to model the basket embedding vector. The equations are as follows: where ω is the weight parameter, and e i is the item embedding in the basket, L is the number of items in the basket. Table 4 and table 5 respectively show the effects of each component model on the two datasets, and the following conclusions can be drawn: Firstly, the complete network obtains the best performance, which indicates that each module of the network we designed has played a role in the network to a certain extent. In the self-attention ablation study, we compared self-attention with  normal attention. In terms of effect, the latter is slightly better than the former.
Secondly, in the JingDong dataset, the complete network is about 11.4% more efficient than the -globalEmbed network regarding their F1, it improves the HR by 0.8 percent and it increased the NDCG by 8.4 percent. This shows that it is effective to consider the effect of the recommended item on the subsequent recommended items. Regarding the F1 and NDCG metrics, self-attention showed the most obvious improvement effect, increasing these metrics by 14.6 percent and 25.5 percent, respectively. This shows that self-attention is effective in basket modeling. In the Ta-Feng dataset, the complete network improved the NDCG by up to 5 percent compared to the attention network and at least 0.8 percent compared to the singleHead network. Meanwhile, in terms of the Hit-rate and NDCG, the performance is improved by up to 21.9 and 5 percent points compared toattention network respectively. Therefore, applying the second attention layer to different baskets in the basket sequence also achieved good results.

F. THE INFLUENCE OF HYPER-PARAMETERS
In the loss function, we use k and q to balance the effects of positive and negative samples in Equation (17), respectively. Figure 4 shows the influence of different k to q ratios on the F1. We can see that the performance of the model tends to be stable when the ratio of k to q set to 500. This shows that the ratio of k to q helps to balance the sample imbalance.

G. THE STATISTICAL TESTING
In order to make a comprehensive comparison of the performances of different learning algorithms, it is not enough to rely on applying the models to measure the performance on data sets. We need to use hypothesis tests, which provides an important reference for our comparison of the learning algorithms. We choose the Friedman test, which is a nonparametric test. We draw the Friedman test graphs for the four metrics in Figure 5. In the graph, if there is no overlapping area between the algorithms, it is proved that algorithms have obvious differences. It can be seen that the results of the hypothesis tests are basically consistent with our experimental results.

VI. CONCLUSION
In this article, a novel Next basket Recommendation Model Based on Attribute-aware Multi-level Attention was proposed. This model uses a self-attention mechanism to encoding the basket representation, and incorporates the attribute information of items. In the aspect of modeling the long-term and short-term preferences of users, the model is use of bi-LSTM. Then, the hidden state representation of the encoder is decoded by attention and LSTM to make recommendations for users. Experiments on two public data sets show that the proposed model are superior to traditional and existing deep learning-based recommendation models.
In the next basket recommendation area, there is still a lot of room for improvement. Although existing deep learningbased recommendation models have achieved better results than traditional machine learning models, there is still a lot of research to be done to improve the recommendation effect and promote the commercial application of the research results. At present, there are still some problems in the field of personalized recommendation, such as the lack of diversification. In addition, in order to be realized in commercial applications, more item attribute information and user attribute information must be integrated; thus, personalized recommendation based on knowledge graphs is an important research direction.