Item-Based Collaborative Memory Networks for Recommendation

Item-based Collaborative Filtering (ICF or Item-based CF) has been widely applied for recommender systems in industrial scenarios, owing to its efficiency in user preference modeling and flexibility in online personalization. It captures a user’s preference from his or her historically interacted items, recommending new items that are the most similar to the user’s preference. Recently, several works propose advanced neural network architectures to learn item similarities from rating data, by modeling items as latent vectors and learning model parameters by optimizing a recommendation-aware objective function. While much literature attempts to use classical neural networks such as multi-layer perception (MLP), to learn item similarities, there has been relatively less work employing memory networks for ICF, which can more accurately record the detailed information about entities than classical neural networks. Therefore, in this paper, we propose a powerful Item-based Collaborative Memory Network (ICMN) for ICF, which bases on the architecture of Memory Networks. Besides, a neural attention mechanism is adopted to focus on the most important historically interacted items. The core of our ICMN is the cooperation of external and internal memory and the contribution of the neural attention mechanism. Compare to the state-of-the-art ICF methods, our ICMN possesses the merit of powerful representation capability. Extensive experiments on two datasets demonstrate the effectiveness of ICMN. To the best of our knowledge, this is the first attempt that applies memory networks for ICF.


I. INTRODUCTION
In the era of information explosion, the recommender system is an important tool to deal with the problems caused by information overload. It can automatically search and filter the products we need among the massive number of products. On the one hand, it protects users from suffering from being at a loss for the enormous number of selections. On the other hand, it increases the traffic and profits of online services rapidly.
Recently, collaborative filtering (CF), a technique that predicts personalized preference of users from historical user-item interaction only, plays an essential role in modern recommender systems especially in the phase of candidate generation [18], [26], [37]. Popularized by the Netflix The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Messina .
Prize, Matrix Factorization (MF) has become the most popular approach to model-based recommendation in academia and has been deeply studied in the literature [19], [47]. Although MF methods have shown superior accuracy over neighbor-based methods in terms of rating prediction, they have seldom been used in industrial applications due to its personalization scheme [17].
Item-based collaborative filtering, as a well-known kind of nearest-neighborhood based method, modeling users' preference by representing a user with his historically interacted items, predicts the interaction between the user and new items based on the item-item similarity. Comparing to MF which represents a user by a simple ID, ICF offers more detailed information about a user's profile, which also provides more interpretable prediction in various recommendation scenarios. It is more suitable for real-time application than MF as the major process that computes item similarity can be processed VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ offline and the only thing that the recommendation module needs to do is to perform a series of lookups on similar items that user historically interacted. Typically, ICF has been widely used in industrial applications [7], [13], [26], [35]. Primitive ICF methods use statistical measures such as Pearson coefficient and cosine similarity to quantify item similarities. Such methods typically underperform machine learning-based methods in top-K rank tasks [32]. Therefore, to overcome the drawback, the authors of SLIM (Sparse Linear Method) [28] combined the merits of model-based method and ICF method by optimizing a recommendation-aware objective function. Although SLIM achieved better accuracy, learning the whole item-item similarity matrix directly is quite time-consuming and the item similarity can not be transmitted between items. In FISM (Factored Item Similarity Model) [20], each item is represented by a latent vector. To estimate the similarity value between a pair of items, we only need to compute the inner product of the two latent vectors. Therefore, the similarity between items can be well transmitted. NAIS (Neural Attentive Item Similarity Model for Recommendation) [17] addressed the inefficiency issue of FISM that considering the importance of items uniformly by using attention network to discriminate which interacted items are more influential. Deep-ICF (Deep Item-based Collaborative Filtering for Top-N Recommendation) [46] argued previous works on ICF only model the second-order relations between items and higher-order relations are discarded. Deep-ICF also illustrates the difference between high-order and second-order relations between pairs of items and come up with an idea that capturing the high-order relations with deep networks. However, these models lack the capacity of memorizing detailed information about historically interacted items to model the complex preference of users, since shopping online becomes convenient with the development of e-commerce and some users may purchase a lot of items. In other words, the traditional neural networks maybe not competent with such tasks as they lose much detailed information when they are modeling user's preference by a massive number of historical items, making it infeasible for recommenders to practically tackle the myriads of changes of user's interests.
In recent years, a surge of interest in using deep learning to recommender systems has emerged. Literature [8], [29], [39] applied DNNs to model auxiliary information. [31] addressed collaborative filtering by applying Restricted Boltzmann. Autoencoders are also popular choices for recommender systems [23], [33], [40], [45], [49]. These deep learning architectures achieved effective performance and demonstrated the advantages of nonlinear transformations over traditional linear models [48]. However, there is little work on employing memory networks for recommender systems compared with the huge number of literature applying other deep learning architectures for recommenders.
In this work, we propose an approach to portray the complicated personalized preference of users through detailed information about numerous historical items. Hence, a strong memory of the model for items is indispensable. Although previous neural networks such as Recurrent Neural Networks (RNN) [44] possess the capability of memory for an entity, the literature [36] argues that memory components of these networks are too small to remember details of a complex entity, and are not compartmentalized enough to accurately remember facts from the past (knowledge is compressed into dense vectors) [43]. Therefore, we develop a model that capitalizes on the advantages of Memory Networks and neural attention mechanisms for Item-based CF with implicit feedback. Complicated preference of users will be modeled with the cooperation of external and internal memory; high-order complex item-item relations will be captured based on multi-layer and non-linear architecture, and the neural attention mechanism places higher weights on specific subsets of historically interacted items which are the most similar to target items. Finally, the interactions between the user and target items are predicted by an output module in memory networks.
Our Item-based Collaborative Filtering (ICMN) model inherits the advantage of FISM in terms of high efficiency in online services, while is more expressive and accurate than FISM due to the complex non-linear transformation architecture.
Our primary contributions are as follows: 1) We utilize Memory Networks for Item-based CF to capture the high-order item-item relation and encode detailed information about items into the memory module, combing with the attention mechanism to predict the user-item interaction. 2) We introduce a small trick submitted in the NAIS [17] called Smooth to address the problem that standard attention mechanism can not learn from users historically interacted items as the large variance on the lengths of user histories. 3) We employ Layer Normalization [1] which seldomly appears in Item-based CF literature to address the vanishing gradient problem. 4) We conduct a series of experiments on two popular real-world datasets to demonstrate the effectiveness.

II. PRELIMINARIES
In this part, we first brief the traditional item-based CF, then recapitulate the SLIM which is the first learning-based ICF. After that, we introduce FISM and NAIS. Both of them are state-of-the-art approaches to ICF. Finally, we analyze Deep-ICF [46] which extends the NAIS.

A. ITEM-BASED COLLABORATIVE FILTERING
Item-based CF [32] calculates the similarity between historically interacted items and the target item by mathematical statistics methods and uses the user's satisfaction for similar items to predict the user's rating score for target items.
Formally, the predictive model of item-based CF is: 213028 VOLUME 8, 2020 where S ij is the similarity score between interacted item j and target item i. r uj is the preference of user u for item j, which can be a real-valued rating score (explicit feedback) and a binary value 1 or 0 (implicit feedback). R + u is the set of items that the user has interacted with.

B. LEARNING-BASED METHOD FOR ITEM-BASED CF
The authors of SLIM [28] came up with a learning-based method for Item-based CF, which learns item-item similarities by building an objective function then optimizing it. The primary idea is to minimize the loss between the ground-truth prediction and the reconstructed one out of the model. Formally, the objective function is as follows: where W is a matrix about the item-item similarity, A is the original rating score matrix. β is the L 2 regularization coefficient which prevents the model from overfitting. And λ is a L 1 regularization coefficient which enforces sparsity on matrix W , since there are only a few items that are particularly similar to an item in the real world dataset. Besides, the non-negativity constraint on each value of W is to ensure it is a meaningful similarity metric. The zero constraint on the diagonal values of W is to avoid the impact of the target item itself when the model performs the prediction. Although recommending accuracy is improved, the inherent drawback of SLIM is obvious. First, refreshing the similarity matrix will cost O(I 2 ) of time complexity. Second, it can only learn similarities for two items that have been co-rated by the same person before. Therefore, SLIM fails to capture transitive relations between items. To repair the drawback, FISM [20] projects all the items into the same vector dimension, representing an item with a low-dimensional embedding vector. Based on this, the similarity score can be easily measured, which is parameterized as the inner product between the embedding vector of item i and item j. Formally, the predictive model of FISM is: where p T i and q j represent the embedding vector for item i and j, respectively. The notation \ {i} means constraint of diag(W) = 0, and α is a hyper-parameter which controls the normalization effect. From the perspective of Matrix Factorization, the formulation in the bracket can be seen as an embedding vector of user u. Therefore, to some extend, FISM is quite similar to MF but provides more information about users for the model to capture the preference of users. While FISM achieves advanced performance among item-based CF methods and inherits the advantage of Item-based CF which is suitable for the online recommendation, its equal treatments on all historical items of a user limit its representation ability.

C. ATTENTION-AWARE METHOD
NAIS [17] employs a neural attention network to learn the varying weights of item-item relations on the foundation of FISM, which is formulated as: where a ij represents the attentive weight of similarity, which means the importance of interacted item j when the model predicts the interaction between user u and item i. Here, the operation of a ij is different from the standard solution of attention, which is specifically formulated as: where softmax is a variant of the standard softmax function that takes the normalization on user history length into account, which can be formulated as: where β is the smoothing exponent, a hyperparameter ranging from 0 to 1, which is used to restrict the problem that large variance of user historically interacted items will restrict the learning ability of the attention layer. The basic theory is: softmax function performs L1 normalization on attention weights, which may overly punish the weights of active users with a long history. And smoothing the denominator of softmax can reduce the punishment on attention weights of active users, meanwhile decrease the variance of attention weights. Besides, the original normalization term |R + u | α in FISM is aborted into the attention weight a ij without losing the representation ability in NAIS. W and b are the weight matrix and bias vector, respectively. h is a weight vector that projects the hidden layer into the scalar output.

D. DEEP-BASED METHOD
Deep-ICF [46] replaces the dot product operation by multi-layer perceptron on the foundation of NAIS, capturing high-order features by taking the advantages of non-linear hidden layers, which is what NAIS can not achieve. The model can be formulated as: v L denotes the output of L th layer, each layer takes the output of last layer as the input, and v ui denotes the second-order VOLUME 8, 2020 item-item relation. z denotes the weight vector of the prediction layer. Besides, b u and b i are introduced to capture the bias of user u's preference and item i's features, respectively. Meanwhile, Deep-ICF has proved that NAIS is a special case of Deep-ICF.
Although these works attempt to model the volatile preferences of users through interacted items, the detailed information about historical items will lose because the features are compressed into a dense vector.

E. MEMORY AUGMENTED NEURAL NETWORKS
Before we elaborate on our proposed model, a brief overview of the memory-based architecture is necessary. Memory augmented neural networks generally consist of a memory module and a controller. The memory module typically contains an external memory encoding long-term information about entities. And the controller performs a series of operations on the memory components such as read, write and erase. The memory module improves model representation capacity independent of the controller while offering an internal representation of knowledge to long-term dependencies.
The initial framework proposed in [43] shows a promising performance to track long-term dependencies in synthetic question answering tasks. However, its strong levels of supervision demand limit its flexibility. End-To-End Memory Networks alleviate the requirement to train the original memory network and become an End-to-End system. Moreover, the attention mechanism of the networks enhances the architecture to focus on specific subsets of most essential information rather than uniformly process all information in a given task.

III. METHODS
In this section, we elaborate on our proposed model Item-based Collaborative Memory Networks(ICMN), Figure 1a shows a visual depiction of the architecture. As can be seen, ICMN maintains two memories: a user historically interacted item memory (long-term memory) and a user preference internal memory (short-term memory), and an embedding vector of the target item. The architecture can remember the detailed high-order item-item similarity and depict the preference of the user depending on the memory module which is capable of encoding the large volume of detailed information. Besides, the neural attention mechanism allows learning a nonlinear weighting function for the historically interacted items, where the more similar items contribute higher weights at the output module. And Figure 1b shows the architecture of multiple hops (layers), which can adaptively learn nonlinear weighting function based on memory.

A. ITEM EMBEDDING
The memory module contains two item-memory matrixes: an internal memory matrix M ∈ R Q×d and an external memory matrix C ∈ R Q×d , where Q represents the total number of items and d denotes the size of each memory cell (embedding size). Each historically interacted item j of a user is embedded in two memory slots separately: m j ∈ M storing the long-term memory for attributes of the items and c j ∈ C storing the short-term memory for features of the items. Each user input is represented by a set of items that the user has interacted with: {x 1 , x 2 , x 3 . . . x k }, each element in the set denotes a historically interacted item. Then the user input is transformed into a multi-hot encoding. For each multi-hot encoding of a user, we query the internal memory of the historical item m j . Then we project target item input, which is denoted by a one-hot encoding, into an embedding vector e i , which contains the feature information of the target item. Finally, we measure the similarity vector p ji which consists of the matching information of items as the element-wise product of short-term memory slot m j and item latent vector e i :

B. LAYER NORMALIZATION
Due to the large variance on the lengths of user histories [17], the attention mechanism is infeasible to learn from user histories even we utilize the technique mentioned in NAIS which is called Smooth. Specifically, gradients are easily disappearing when we are stacking the hops or enlarging the size of the embedding layer. To resolve the problem, we employ a technique that has been demonstrated effectiveness in the NLP area, namely layer normalization [1]. However, although the vanishing gradient problem disappears, the model losses a bit of capability for representation, which we will introduce in the Experiment section. And later, we will prove the choice that employs the layer normalization in the model is reasonable. Layer normalization is a technique that forces the data to obey the normal distribution and is more suitable for online learning tasks, as well as the learning task that varies with the length of the sequence than batch normalization. We transform the similarity vector by layer normalization as follow: where µ ij , σ ij are the means and variance of each similarity vector p ij . γ , b are defined as the bias and gain parameters of the same dimension as p ij . p ijv is the elements of p ij and µ ij is the average of p ij .

C. INTERACTED ITEM ATTENTION
The attention mechanism learns an adaptive weighting function to focus on a subset of most influential items within historical items to predict the interaction. Previous works predefine a uniform weighting function such as inner product and consider all the historical items equally, which may decrease the model fidelity. Therefore, NAIS employs an attention mechanism to learn the most similar neighbor items. However, directly utilizing the standard attention mechanism in the situation where input sequential lengths are varying largely will cause problems, which we have stated in the preliminary part clearly. NAIS has alleviated the problem with a technique called Smooth which has shown effectiveness. Therefore, we introduce this technique to resolve the problem as NAIS did. The attention weights are computed with the following formulation: The attention mechanism allows the model to focus on higher specific historical items while paying less attention to historical items which may be less similar. Then we construct the final representation of the user's preference by interpolating the external item memory with the attention mechanism activated by short-term item memory: where c j is another embedding vector for item j called external memory, which allows the storage of long-term information about the features of each historical item of users. Intuitively, the internal memory network acts as short-term memory which is activated by the target item, driving the model to search the information about the most similar historical items in the long-term memory then response to the target item. ICMN captures the similarity of items and dynamically assigns the degrees of importance to the historical items instead of uniform weights for historical items which may limit the representation capacity of the model.

D. OUTPUT MODULE
On the foundation of what we constructed before, we perform the prediction for the interaction between user u and target item i. Intuitively, it's inevitable to integrate the information of the target item and the preference of the user to predict the user's response for target items. In other words, we have to combine the embedding vectors e i which consist of the features of target items and the preference of user o ui which we explored before:r where e i denotes the embedding vector of a target item i, and H is a linear transform matrix that can learn with other parameters in the same time. [43]. Our proposed model possesses the following advantages. First, the neural attention mechanism provides adaptive weights for each historical item to the final predicted interaction dependent on the specific item. Second, the memory module integrates the long-term external memory and short-term internal memory for each historical item, and the associative contribution of both memory components provides more complete historical information about the items for the model to achieve accurate prediction.

E. MULTIPLE HOPS
We now extend the model to tackle an arbitrary number of memory hops. Figure 1b shows the ICMN's architecture with multiple hops. The initial hop introduces the demand to acquire additional information of items. Starting from the second hop, the model begins to consider the high-order item similarity. Each additional hop handles the last hop's newly obtained information. In other words, the model has chances to retrospect and reconsider the most similar historical items. Specifically, multiple memory modules are stacked together by taking the output from the l th hop as input to the (l + 1) th hop: where H l ∈ R d×d is a linear project matrix mapping the target item's feature to a latent space coupled with the existing information from the last hop [10]. The newly formed VOLUME 8, 2020 target item attribution vector then awakes the short-term item memory and recomputes the adaptive attention weights. This process is repeated for each hop generating an iterative refinement. The output module receives the final weighted neighborhood items from the last hop to predict the final interaction. We employ the Layer-wise type of weight tying within the model [36]. That is to say, the input and output embeddings are the same across different layers, which can reduce massive space complexity and time complexity:

F. OPTIMIZATION
To optimize our model, we should build an objective function to minimize the error between ground-truth interaction and the interaction predicted by the model. Due to the implicit feedback where each entry is a binary value 0 or 1, we can view the learning task as a binary classification task. As the previous [18] did, we treat the observed user-item interaction as a positive instance and sample negative instances from the remaining unobserved interactions. Meanwhile, we let R + and R − denote the sets of positive and negative instances respectively, and the loss function is defined as follows: (17) where N denotes the total number of training instances. The λ is a hyper-parameter that prevents the model from overfitting, θ denotes all the trainable parameters. And σ is a sigmoid function that restricts the predictionr ui in the range of [0,1]. Based on this, the prediction can be transformed into a probability that represents the likelihood that the user will interact with the item. It is obvious that the objective function we constructed is same with binary cross-entropy loss. There are also other options of objective functions, such as the pointwise regression [19], [41] and pairwise ranking losses [30], [50], suitable to learn ICMN with implicit feedback. Considering the focus of this work is to research employing memory network to item-ICF, we leave the exploration of other loss functions as future works. For convenience, we adopt Adam [22] to optimize the objective function, which updates the learning rate automatically based on adaptive estimates of lower-order moments to alleviate the pain of tuning the learning rate.

IV. EXPERIMENT
In this section, we show a series of experimental results and analyze them.

A. EXPERIMENTAL SETTINGS 1) DATASETS
We experimented on two well-known datasets: the Movie-Lens and the Pinterest. The characteristics of the two datasets are summaries in Table 1. More details about the datasets have been elaborated in [18] and we do not restate it. Due to both datasets have been preprocessed such as removing sparse users and train-test splitting, we directly utilize the preprocessed data to train and test our model. What is worth to note is that each positive instance is paired with 4 negative instances so the total number of training instances is five times the number of interactions.

2) EVALUATION PROTOCOLS
We adopt the leave-one-out evaluation protocol [18], [30], which is popular in implicit feedback based collaborative filtering. It reserves the latest interaction as the validation data, the rest are fed to training. We randomly sample 99 unobserved items (negative instances) for each user, pairing it with the validation data as the test data set; then feed the test data set to each method and get prediction scores for the 100 instances. Finally, Hit Ratio (HR) [9] and Normalized Discounted Cumulative Gain (NDCG) [15] are chosen to be the evaluation criteria. Here, we shortly brief the two metrics: HR can be deemed as a recall-based measure that estimates the percentage of users who are successfully recommended by observing whether the positive instance present at the top-10 rank list, and NDCG measures the quality of ranking, which assigns higher scores to hit at top position ranks. That means the higher values of both metrics are, the better performance is. Both metrics are widely used in ranking systems [16].

3) BASELINES
We compare ICMN with the following recommendation methods: • Pop. This is a non-personalized method to benchmark the performance of the top-K recommendation task. It ranks items by their popularity, judging by the number of interactions that an item received.
• ItemKNN [32]. This is the standard item-based CF method we mentioned in Equation 1. We used consine similarity to measure s ij ; and followed the setting of [17] to adapt it.
• FISM [20]. This is a learning based item-based CF model as formulated in Equation 3. We set α to 0.1 which leads to best result on both dataset [17].
• Youtube Rec [6]. A deep neural network architecture for recommending YouTube videos. It maps video IDs to a sequence of embeddings and feed them into a feed-forward neural network. The input layer is followed by several layers of fully connected hidden layer and the nerual is activated by Rectified Linear Units (ReLU).
• MF-BPR [30]. MF-BPR optimizes MF with the pair-wise Bayesian Personalized Ranking BPR) loss. It is a highly popular baseline for item recommendation. We optimized it with fixed learning rate, and reported the best performance.
• MF-eALS [19]. MF-eALS also learns a MF model, but optimizes a different pointwise regression loss which called element-wise Alternating Learning Square (eALS) algorithm, treating all unobserved interactions as negative feedback with a smaller weight.
• MLP [18]. This method learns the scoring function from user-item interacions based on the effectiveness of multi-layers perception. We employed a 3-layer MLP and optimized the same cross-entropy loss, which was illustrated to perform well on the two datasets.
• Deep-ICF [46]. Considering that NAIS is a special case of Deep-ICF, to prove the effectiveness of large memory in collaborative filtering, we chose Deep-ICF to compare with our model rather than NAIS. Meanwhile, We tuned the parameters of Deep-ICF as same as the original literature.

B. PARAMETER SETTINGS
We implemented our proposed methods based on Keras and ran the process on GPU. To simplicity, without special mention, we set the hyper-parameter β to 0.5 and sampled four instances per positive instance. Besides, We split the training instance with a fixed mini-batch size of 4096 on the Pinterest dataset, while splitting the training instance with user-based mini-batch on the MovieLens Dataset, since masking larger variance of the historical item length on the MovieLens cost quite much time, but on the Pinterest, the variance is relatively smaller and the number of users is massive which will consume time to transmit data from CPU to GPU. Moreover, for ICMN models that are trained from scratch, we initialized model parameters with a Xavier normal distribution. We tested the learning rate of

C. EFFECT OF PRE-TRAINING
Due to the non-linearity of the neural attention mechanism, the complexity of the memory network, and the non-convexity of the objective function, optimizing the model from scratch can be easily trapped to local minimums of poor performance. Therefore, the initialization of model parameters plays a crucial role in the model's final performance. Intuitively, optimizing memory module and neural attention mechanism simultaneously is difficult, as such, we pretrained the ICMN/w which is ICMN without attention mechanism (without layer normalization), and used the pretrained weights to initialize ICMN. To demonstrate the effect of pre-training (i.e., using the parameter learned by the model without attention as ICMN's initialization), table 2 shows the performance of ICMN with and without pre-training at embedding size 64. It's obvious that the ICMN improves significantly with the pretrained parameters. Moreover, as our observation, ICMN with pre-training converges faster than the one with random initializations. To show the best performance of our proposed ICMN, without special mention, we use pre-training tricks to optimize the model.

D. EMBEDDING SIZE AND THE NUMBER OF HOPS
Since we mentioned that stacking hops can help the back hops utilize the information that previous hops captured and the neural attention recomputes the adaptive weights for items. Moreover, it's obvious that embedding size is vital for the model since a larger embedding size vector provides more features of an item. Now we explore the effect of embedding size and the number of hops. Table 3, 4 shows the performance of ICMN with embedding size of [8,16,32,64] and the number of hops of [1,2,3,4]. As can be observed: on the Pinterest dataset, with the increasing of embedding size and the number of hops, the performance of ICMN is increasingly improving, while on the MovieLens dataset, with the increasing of embedding size, the performance is generally better, but is up and down as stacking the hops in large embedding size. The reason is that the MovieLens dataset lacks enough data to train the ICMN and the model becomes overfitting.

E. EFFECTIVENESS OF ATTENTION NETWORKS WITH LAYER NORMALIZATION
As the effectiveness of only neural attention networks in the collaborative filtering model has been illustrated in the literature [17], we do not restate it. We mentioned before that layer normalization may limit the representation for an entity, and now we illustrate it here: layer normalization can force the data flowing in the model to obey normal distribution to accelerate the speed of training and prevent the model from vanishing/exploding gradient. However, restricting the features of items to be standard normal distribution is not a wise selection. For example, a dimension of embedding vector of an item represents that the item services for female    16. or male; obviously, it's a binary value, and another dimension represents the age of client that the item services; clearly, forcing the two dimensions to obey the same normal distribution will cause loss of information. However, directly employing the attention mechanism makes us be at a loss for the vanishing gradient problem (the result of our experiments). Thus, we analyze whether we should employ the attention mechanism with layer normalization or not. Table 5 shows the performance of the model with attention (with layer normalization) and the model without it in the embedding size of 16. As can be seen, the model employing attention with layer normalization achieves better performance. Although on the Pinterest dataset, with one hop, the model with attention performs slightly worse than the one without attention because of losing the capacity of representation but performs better when we are stacking hops as the advantage of reconsidering the attention covers the limitation. Besides, with the increasing number of hops, the model improves slightly without the attention mechanism but improves significantly with the assistance of the attention mechanism.

1) SMOOTHING EXPONENT β
In this section, we study the performance of varying the smoothing exponent β for ICMN reporting HR@10 and NDCG@10. We collected the result of three hops. Figure 3 shows the performance of ICMN with different smoothing exponent β varying from 0 -1 on both datasets. We can see that when β is smaller than 1 the performances of ICMN is acceptable. However, when β is set to 1, the performance of ICMN degrades rapidly. The result is consistent with the conclusion of NAIS [17] and the issue is caused by the large variance of the length of user histories.

2) BASELINE COMPARISON
Now, we compare the performance of ICMN with other stateof-the-art collaborative filtering methods. We first make a comparison with other recommendation approaches in the embedding size of 16 to analyze the performance of all methods. Next, we alter the embedding size to observe varying embedding size trends. Table 6 shows the overall recommendation accuracy. Through the observation, we can easily acquire the following conclusion: 1) ICMN achieves the best performance on both datasets (the highest HR and NDCG scores) We attribute improvements to the effect of the adaptive neural attention weights for historical items, and the powerful combined action of both long-term and short-term memory capacity. 2) There is no doubt that learning-based CF methods perform better than heuristic-based approaches such as Pop and ItemKNN. Comparing FISM with ItemKNN, both methods are item-based CF, we can readily see the advantages of learning-based methods over traditional statistical methods. 3) There is no overall winner between the user-based and item-based CF models. In particular, user-based models perform better than FISM on the MovieLens, while worse on Pinterest. We conclude that item-based CF methods are more advantageous on a highly sparse dataset [20].    64 are similar to the one of 16, generally. Our ICMN approaches outperform all the other approaches. Figure 2 shows the performance in the first 50 epochs of ICMN, ICMN/w, and FISM at embedding size 16 on the two datasets. First, we can see that each of the three ICF models shows a smooth curve on the Pinterest but an uneven curve on the MovieLens. We argue the reason is the different variance of the user's historical item length in the two datasets. [17]. Second, among the three approaches, ICMN achieves the best result followed by ICMN/w and then FISM, we attribute the improvement of ICMN/w over FISM to the cooperation of the two memory components and the further improvements of ICMN over ICMN/w to the advanced neural attention mechanism.
The above findings provide empirical evidence for the effectiveness of the rationality of memory networks for ICF. However, a massive number of parameters of the architecture makes the model easily overfit, we will take measures to tackle the issue in future work.

V. RELATED WORK
Early works on recommendation primarily focus on explicit feedback [31], [32], which is to minimize the loss between observed ratings and the predicted ratings by the corresponding model. Typically, the well-known regression-based CF method MF, associating each user an item with a latent vector, modeling the rating score of the user by the inner product of the vectors, achieved the best performance in the Netflix challenge. Literature [24], [25], [27], [34], [38], [42] extended MF with additional information such as review texts and social influence.
Later, a surge of interest in recommendation with implicit feedback has emerged [2], [5], [11], [12], [15]. While implicit feedback is more challenging to utilize as user satisfaction can not be observed and the natural scarcity of negative feedback, implicit feedback is much easier to collect for content providers since it indirectly reflects users' preference through behaviors such as clicking items, watching videos, and reading news.
Recently, literature about applying deep neural networks (DNN) for recommendation springs up. These DNN models construct a strong representation of the entity content. Neural Collaborative Filtering [18] deals with implicit feedback by learning a model associating a matrix factorization and a deep feedforward neural network. The prediction is based on linear information and non-linear relation.
Cheng et al. [3] jointly trained wide linear models and DNN to combine the benefits of memorization and generalization for recommender systems. Convolutional neural networks (CNN) are also popular in recommendation literature, which has been used to capture local feature representations of entities such as images, [48], text [5], [21], music [8], and so on. Besides, Attention mechanisms also have been focused on recommendation systems. Gong and Zhang [14] perform hashtag recommendation with CNN and the attention mechanism to focus on the most informative words. NAIS [17] develops a smooth technique to make the attention mechanism suitable for variant sequence length, which inspires us to propose our method.
The most similar method to ours is the Collaborative Memory Network (CMN) [36], which combines model-based CF and neighborhood-based CF with End-to-End Memory Network. Our methods differ from CMN and all previous works by employing a memory network and attention mechanism with layer normalization for item-based CF. Besides, we introduce the technique that smooths the denominator of the softmax function due to the standard attention network does not perform well on user historical interaction.
To the best of our knowledge, there is no prior work that has applied the End-to-End memory network architecture to deal with Item-based collaborative filtering and our experimental results show that Memory networks is a promising choice to track user tastes on different attributes of items.

VI. CONCLUSION
In this work, we explored memory network architectures for item-based collaborative filtering. Our key argument is that users' past behavior will influence the present behavior, and long-term memory for user's behavior is vital for prediction but hard to record. Besides, learning the attention network from largely varying historical item length the large variance on the lengths of user histories will cause problems. Therefore, we adopted smooth and layer normalization into our method to enhance the robustness of the model and achieved a promising effect. As can be seen, we proved the models with layer normalization and attention beyond the models without them. We also showed the salient effect of the models with a strong memory and illustrated the effectiveness of the smoothing exponent β.
In the future, we will explore new technologies to resolve the vanishing gradient problem without losing representation. Besides, although the result is promising, the model exists the problem of overfitting since the model has a huge number of parameters. We will investigate more mathematical methods to handle this issue.