Learning Users’ Visual Preferences for Improving Recommendations

Sequential recommender systems (SRSs) aim to predict the next item interest to a user by learning the users’ dynamic preferences over items from the sequential user-item interactions. Most of existing SRSs make recommendations by only modeling a user’s main preference towards the functions of items, while ignoring the user’s auxiliary visual preference towards the appearances and styles of items. Although visual preference is less significant than the main preference, it may still play an important role in most of users’ choice on items. On the one hand, a user often prefers to choose the item which matches her/his visual preference well from multiple items with the same function. For example, a lady may choose one clothes whose style suits her best from multiple clothes with the same function. On the other hand, some particular users (e.g., young girls) are usually very concerned about the appearances of some special items (e.g., clothes, jewelry). Therefore, the overlook of modeling users’ visual preference may generate unsatisfied recommendations which can not match a user’s various types of preferences and thus reduce the consumption experience. To address this gap, in this paper, we propose modeling users’ visual preferences to improve the performance of sequential recommendations. Specifically, we devise a coupled Double-chain Preference learning Network (DPN) to jointly learn a user’s main preference and visual preference as well as the interactions between them. In DPN, one chain is for modeling a user’s main preference by taking the IDs of items as the input and the other chain is for modeling the user’s visual preference by taking appearance images of items as the input. Finally, the two types of preferences are carefully integrated with an attention module for the next item prediction. Extensive experiments on two real-world transaction datasets show the superiority of our proposed DPN over the representative and state-of-the-art SRSs.


I. INTRODUCTION
In the era of the digital economy, recommender systems (RSs) are becoming increasingly popular and have been planted in almost every part of our daily life [1]. They can help us with efficient and effective decision making when facing a large number of choices on content, products and service [2,3,4]. However, most of the traditional RSs are built on static user-item interaction data generated during a long period and thus they usually ignore the intrinsic dynamics of user preferences towards items, i.e., a user's preference over items changes over time [5]. To bridge such a gap, sequential recommender systems (SRSs) have been proposed to effectively model the sequential behaviors of users and thus capture their timely preferences for more accurate recommendations. By modeling the sequential useritem interactions generated in a continues time period as a sequence, an SRS is able to capture a user's dynamic and most recent preference over items and accordingly recommend precise items which can well match the user's preference.
In practice, in a sequence of user-item interactions, a user's choice on items not only depends on her main preference over items' function, but also depends on her auxiliary visual preference [6,7]. In this paper, a user's visual preference refers to her/his preference regarding the appearances and styles of items and it can be revealed by the appearance images of those items chosen by the user, such as the appearance and style of a clothes [8]. Although auxiliary and less significant compared with the main preference, a visual preference can still play an important role when the user makes choices on items. The reasons are two folds. On the one hand, a user often prefers to choose the item whose appearance can match her/his visual preference well from multiple candidate items with the same function. For example, a lady may choose one clothes whose style suits her best from multiple clothes with VOLUME 4, 2016 context choices context choices FIGURE 1. The images of items sampled from two anonymous users' purchased history from the Amazon product dataset. Each row contains a given context followed by three potential items for the next choice, while the first one in square was chosen by the user. It is clear that similar style of appearance is observed in each user's contextual items, while the item conforms to such style (the one in square) tends to be chosen out of multiple potential items.
the same function. On the other hand, some particular visualsensitive users (e.g., young girls) are usually very concerned about the appearances of some visual-sensitive items (e.g., clothes, jewelry). For example, when a girl purchases a necklace, she may not only focus on the function of the necklace, but also cares about the its appearance and style. Moreover, the appearance of an item, especially for those visualsensitive items, determines a user's first impression of it and thus could influence whether to purchase it or not. Given the significance of visual preferences in affecting users' choices on items, in this paper, we particularly focus on this auxiliary preference and explore its influence on recommendations. It should be noted that the visual preference usually works on those types of items which are visual-sensitive rather than all types of items. Accordingly, in sequential recommendations, at each time step, the user-item interaction is driven by both her main preference and her visual preference. More importantly, a user's visual preferences over items' appearance evolve along with time and visual preferences at different time steps are usually sequentially dependent. In addition, at each time step, the user's current visual preference usually affects her choice on items. Taking Figure 1 as an example, each row is a sequence of items which are purchased sequentially by an anonymous user on amazon.com. For each sequence, if we take the first three items as the context to predict the following item, two observations can be drawn: (1) although the sequentially purchased items are different from each other, but the visual style revealed by their appearance are quite similar, (2) when facing a variety of similar items with a different appearance for the next choice, the user tends to choose the item with appearance similar to that of those recently chosen items in the context. Both observations indicate that a user's visual preference over item appearance is not only sequentially dependent, but also plays an important role in her choice on items.
However, most of the existing studies on SRSs mainly model the user's main preference on items' main aspects revealed by the IDs of items, e.g., function, while overlooking her visual preference towards items' appearance, reducing the recommendation accuracy greatly.Therefore, a majority of SRSs only take the ID of each item within a sequence as the input to model the sequential dependencies among items for recommendations. For example, Markov chain based SRSs simply factorize the known first-order transition matrix over items in a sequence into latent factors of them and then use the obtained latent factors for the estimation of unknown transitions for the prediction of next item [9]. Recurrent neural network (RNN) based SRSs usually take the embedding of each item in a sequence as the input of each time step and then model the sequential dependencies among items for next-item prediction and recommendation [10]. Some other works introduce attention mechanisms to highlight those items which are more important and relevant to the next-item prediction [11]. A minority of works in SRSs incorporate additional information like item attributes for better next-item recommendation. For instance, both item ID and each of item attributes are mapped into a latent vector which is then taken as the input of a shallow network for next-item recommendation [12]. Although much progress has been achieved, all of these works ignored the impact of items' appearance on users' choice and failed to model users' visual preference effectively. This may limit the performance of sequential recommendations to some degree.
In this paper, we aim to address this gap by jointly modeling users' main preference on items and visual preference on items' appearance. Accordingly, a coupled Double-chain Preference learning Network (DPN) has been designed to model both preferences for recommendations. DPN mainly contains two chains which are connected together: item ID chain for modeling users' main preference revealed by the IDs of items and item image chain for modeling users' visual preference embedded in item appearance. Each chain is equipped with an RNN to model the corresponding dynamic preference which may change along with time. In addition, a bridge has been built between the two chains at each time step to incorporate the hidden state of the item image chain into the item ID chain to model the visual impact of items on users' choices on items. The final hidden state from the item ID chain and item image chain embeds the user's current main preference and visual preference respectively. Finally, the hidden states from both chains are attentively integrated together to form a compound representation of both a user's main preference and visual preference for the subsequent recommendations. Those candidate items showing a high possibility to match the user's compound preference are put into the recommendation list.
Thanks to the particularly designed coupled double-chain network structure in DPN, both a user's main preference on items' main aspects, e.g., function, and her visual preference on items' appearance are modelled respectively. And also, the impact of a user's visual preference on her choice on items (revealed by item IDs) has been well modelled with the bridge between the two chains. The contributions of our work can be summarized below: • We propose visual preference learning to particularly model a user's visual preference towards items' appearance, which may greatly affect the user's choices on items in sequential recommendations. • We design the coupled Double-chain Preference learning Network (DPN) to jointly learn a user's main preference and visual preference as well as the impact from visual preference on the user's choice on items. • An item transition unit (ITU) is particularly designed to equip the DPN to model the transitions between item IDs by considering the influence from item appearance. Extensive experimental results on two real-world datasets show the superiority of our approach over the representative and state-of-the-art approaches and verify the significance of learning a user's visual preference. The rest of this paper will be organized as follows: we review the related work on sequential recommendations in Section 2, and then formalize the problem in Section 3. We present our proposed approach in Section 4, followed by the experiment part in Section 5. Finally, we conclude this work in Section 6.

II. RELATED WORK
In this section, we review the existing works in the area of sequential recommendations. Generally speaking, according to the utilized techniques and approaches, sequential recommendations can be divided into conventional sequential recommendations and deep learning based sequential recommendations. In conventional sequential recommendations, conventional sequence mo-deling approaches, e.g., sequential pattern mining, Markove chain models, are mainly utilized to model the sequential dependencies among a sequence of user-item interactions for recommendations. In deep learning based sequential recommendations, deep neural networks, including Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), are mainly employed to learn the sequential dependencies for recommendations. In addition to these two types of sequential recommendations, we also review the attribute-augmented sequential recommendations which incorporate item attribute information and thus they are closely related to our work.

A. CONVENTIONAL SEQUENTIAL RECOMMENDATIONS
Early studies on sequential recommendations mainly focus on the development of conventional SRSs for sequential recommendations. There are mainly four classes of conventional SRSs: sequential pattern based SRSs, Markov chain based SRSs, latent factor model based SRSs and neighborhood model based SRSs. Specifically, Yap et al. [13] introduced a personalized sequential pattern mining based recommendation framework by first mining personalized sequential patterns from users' transaction behavior data and then utilizing the mined patterns to guide the subsequent recommendations. Although simple and sometimes effective, such class of approaches are easy to lose those infrequent but still important items and patterns due to the commonly used frequency constraint in the pattern mining process, and thus reduce the recommendation performance. Markov chain (MC) models are another intuitive solution to model the transitions within sequence data and thus are naturally employed for sequential recommendations. Since Markov chain based SRSs do not involve the aforementioned frequency constraint, and thus they can effectively alleviate the above drawbacks in sequential pattern based SRSs. A typical Markov chain based SRS is the Personalized Ranking Metric Embedding (PRME) model proposed by Feng et al. [14]. PRME was proposed to precisely model personalized check-in sequences for next POI recommendation by utilizing a Markov chain framework. However, Markov chain based SRSs usually only model the first-order dependencies in a sequence of interactions while ignoring the high-order ones. This often decreases the corresponding recommendation performance. Rendle et al. [15] proposed a classic latent factor model called Factorized Personalized Markov Chains (FPMC) model to learn the latent factors of items by factorizing the transition matrix over underlying Markov chains over items from adjacent baskets. The learned latent factors are then utilized to represent items for next-basket recommendation. Similar to Markov chain based approaches, latent factor models can model the firstorder dependencies only while overlooking high-order ones. Moreover, they are easy to suffer from the data sparsity issue. Recently, session-KNN (SKNN) was proposed for nextitem recommendation, which utilizes the similarity between sessions to calculate the score of the candidate items to be the next item [16,17]. Based on SKNN, Garg et al. [18] introduced sequence and time aware neighborhood model for session-based recommendation, which additionally takes into account the readily available position information of items within sessions/sequences for more accurate recommendations.
In summary, most of conventional SRSs only model users' main preferences for sequential recommendations. They often overlook the users' visual preferences towards items, which may greatly reduce the recommendation performance.

B. DEEP LEARNING BASED SEQUENTIAL RECOMMENDATIONS
In recent years, deep learning models including RNN, CNN have shown great potential to capture the complex relations in a variety of areas, including Computer Vision (CV) and Natural Language Processing (NLP). Inspired by this, researchers have introduced deep learning models into sequential recommendations. Consequently, a series of deep learning based SRSs have been proposed and achieved great success. Among various deep neural network architectures, RNN is the prominent one for sequential recommendations, because of its powerful capability to model sequence data. Therefore, a lot of RNN-based SRSs have been developed. Hidasi et al. [10] proposed an RNN-based model called GRU4Rec for predicting the next item by equipping the RNN with Gated Recurrent Units (GRU) to model the longterm sequential dependencies embedded in a sequence of interactions.This model was further improved by introducing VOLUME 4, 2016 novel ranking loss functions tailored to RNNs [19]. Some other similar works replaced the GRU with Long Short Term Memory (LSTM) units to build LSTM-based SRSs [20]. To better model users' preferences across multiple sequences of interactions, a hierarchical-RNN was proposed in [21] to model both the short-term intra-sequence dependencies and long-term inter-sequence dependencies for next item recommendations. However, RNN is easy to generate false sequential dependencies due to the employed rigid order assumption over any two adjacent interactions, which may not be true in the real-world cases since the order over some items may not make much sense [22].
In addition to RNN, CNN are also applied in sequential recommendations to build CNN-based SRSs. Benefiting from the strong capability of CNN to capture the dependencies between different local areas, CNN-based SRSs have strength in capturing the collective or local dependencies among interactions in a sequence [23,24]. Specifically, CNN-based SRSs first learn the embedding of each interaction and then utilize the learned embeddings to generate recommendations. Tang et al. [23] proposed a convolutional sequence embedding recommendation model called Caser by utilizing both horizontal and vertical convolutional filters to learn the item-level and feature-level dependencies respectively for better recommendations. Further, a 3D CNN model was proposed to jointly model the sequential patterns in sequence data and the item characteristics from item content features for next-item recommendation [25]. However, one obvious drawback of CNN-based SRSs is that they may not be able to effectively capture the long-range dependencies in those long sequences because of the limited perceptive filed of CNN.
Except for the aforementioned two basic deep neural network architectures, i.e., RNN and CNN, some advanced mechanisms are also applied to sequential recommendations to address some specific issues. Out of them, attention is one of the most representative one. To be specific, attention mechanism is often incorporated into some basic models, e.g., embedding model, RNN, to emphasize those more important and relevant interactions in a sequence for making more accurate recommendations. Wang et al. [22] incorporated attention into embeddeing model to learn attentive item and session embeddings for next item recommendations.Later, self attention mechanism was introduced into SRSs to emphasize those more relevant actions from users' historical click or purchase actions for accurate sequential recommendations [11,26,27].
Cui et al. [28] combined attention and a hierarchical RNN model to emphasize both important items within each session and important sessions for more accurate sequential recommendations. Liu et al. [29] used attention model to attentively read out the relevant and important interaction information from a memory module for better next item prediction.Benefiting the powerful representation capacity of a natural language processing model called Bidirectional Encoder Representations from Transformers (BERT), multi-head self attention model was introduced into SRSs to further improve the performance of sequential recommendation [30]. Further, Cho et al. [31] proposed a novel multiple self-attention head model to simultaneously extract various patterns from the users' sequential behaviors for accurate sequential recommendation. Although effective, attention mechanism tends to assign larger weights on a few significant interactions while downplaying others, easily leading to bias.
To summarize, all these different types of deep learning based sequential recommendations have achieved great success. However, most of they often only model users' main preferences by taking the item IDs in a sequence as the input (some also consider user IDs as well) while ignoring other aspects, especially item appearance, which may also greatly influence users' choices on items.

C. ADDITIONAL INFORMATION-AUGMENTED SEQUENTIAL RECOMMENDATIONS
In addition to the aforementioned generally utilized item ID information, some works on SRS take additional information such as item attributes, item position information in a sequence, time information, etc. to provide more contextual information for improving sequential recommendation. For example, Wang et al. [12] takes both the item ID and its attributes as the input of a shallow neural networks to first learn a compound item presentation and then input the presentation to the prediction layer for the next item prediction. Tuan et al. [25] proposed a 3D convolutional neural networks to learn item presentations from both ID information and content feature information for the next item prediction. Garg et al. [18] introduced a neighborhood model to incorporate the readily available item position information within sequences for more accurate sequential recommendations. Li et al. [32] exploit timestamps of interactions within a sequence to explore the influence of different time intervals between interaction on next-item recommendation. Ye et al. [33] utilized both the "absolute time patterns" and "relative time patterns" embedded in a user interaction sequence to improve sequential recommendation. Although these works have taken a step forward to incorporate more contextual information to enrich the input for sequential recommendations, they still failed to consider users' visual preference towards item appearance.
Only few works have considered the appearance of items when making sequential recommendations. Hidasi et al. [34] devised a parallel RNN architecture to incorporate both the visual and textual features of items to build informative item representation for next item recommendations. In this work, each RNN serves as one channel to independently model one type of item features or item ID, and finally the outputs from all RNNs are combined together to predict the next item. Although such kind of approaches have improved the recommendation performance to some degree, they simply treat item appearance as a kind of feature without particularly learning users' visual preferences. Moreover, they have not well utilized the item appearance well. The reasons come from three folds: first, they often utilize the image feature data pre-trained from item appearance in other tasks, which are not related to recommendations. Consequently, the pretrained image feature data may lose some important information which is sensitive to sequential recommendations [35]. Second, they treat all parts of an item appearance equally important without any discrimination, consequently, the important information may be overwhelmed by those irrelevant and noisy information embedded in the item appearance. Third, they model each type of features and the item ID independently without interaction at each time step, which may violate the reality that the items' attribute information actually influences users' choice at each step rather than just at the final step. In summary, existing works on SRSs either totally ignore users' visual preference and item appearance, or simply treat item appearance as a kind of attributes while neglecting the users' particular visual preference towards item appearance. To the extent of our knowledge, there is no existing work on SRSs which particularly model users' visual preference for better recommendations. In practice, in some visual-sensitive domains, e.g., clothes, except for users' main preference, e.g., preference towards items' functions, the visual preference also plays an important role on users' choice on items. Inspired by this fact, we propose learning visual preference towards item appearance to generate more reliable and accurate sequential recommendations.

D. ITEM VISUAL APPEARANCE LEARNING IN RECOMMENDATION
Some researchers investigated the effect of visual information of items in recommendations, especially in fashion recommendation area, but most of them focus on non-sequential recommendations. For instance, Liu et al. [36] proposed a DeepStyle model to learn style features of items and sensing preferences of users from non-sequential data for personalized recommendation. Lee et al. [37] proposed Style2Vec to learn and encode the items' visual style into vector representations for the downstream fashion recommendation, i.e., to recommend an outfit of items with good compatibility. Similarly, Yin et al. [7] and Li et al. [38] learn and utilize visual compatibility relationship between items to enhance fashion recommendation or complementary item recommendation. Hidayati et al. [8] try to match clothing style and personal body shape for effective fashion recommendation.
In summary, learning item visual appearance has been widely researched in the fashion recommendation area and has achieved great success. However, fashion recommendation aims at recommending an outfit of items (usually clothes) with good compatibility among them without modeling the sequential relations between users' historical actions [39]. This is totally different from sequential recommendation which mainly relies on the sequential dependencies over users' sequential behaviors for predicting and recommending the next item [40]. Therefore, the aforementioned fashion recommendation methods cannot be utilized for sequential recommendation due to its different work mechanism and aim from sequential recommendation.

III. PROBLEM STATEMENT
Given a sequence set S = s 1 , . . . , s |S| , each sequence s = v 1 , ..., v i , ..., v |s| consists of a list of items which are sequentially interacted by a certain user in one transaction event. All the items together form the item universe set V . |S| and |s| denotes the number of sequences in S and the number of items in s respectively. v i denotes the i th item in sequence s. Each item is indicated by a unique item ID d and is associated with one image m to show its appearance. For example, the ID and image of item v i is denoted as d i and m i respectively. Given one item v t in s, all the items occurring before v t form the sequence context c t of it and each item in c t is called the contextual item of v t . Specifically, c t contains two sequences, one is the sequence of IDs of all contextual items and the other is the sequence of images of all contextual items, namely, .., m t−1 }. Generally, given a sequence context c consisting of (t − 1) precedent items, including the ID and image for each item, an SRS is built to predict the subsequent t th item from all the candidate item set V . Accordingly, our model can be trained as a classifier to learn the conditional probability p(v t |c) over each candidate item. Once the model is trained, it can be used to recommend the next item by ranking all the candidate items in terms of their conditional probability over the given context. Those items with the top-k conditional probabilities are taken to constitute the recommendation list.

IV. COUPLED DOUBLE-CHAIN PREFERENCE LEARNING NETWORK
The architecture of our proposed coupled Double-chain Preference learning Network (DPN) is presented in Figure 2. DPN consists of two main modules: (1) the preference learning module to collaboratively learn a user's main preference and visual preference towards items with an item ID chain and an item image chain respectively, and (2) the prediction module to predict the next item by taking the learned preferences as the input. To be specific, in preference learning module, for each item in a sequence context, its ID d and image m are taken as the initial input in the input layer. d and m are then projected into a latent embedding vector d and matrix m respectively, which are then input into the corresponding item transition unit (ITU) and gated recurrent unit (GRU) in item ID chain and item image chain respectively. Item ID chain and item image chain model a user's main preference and visual preference respectively, while the influence of visual preference on main preference is considered at each time step with. Namely, a user's choice on one item not only depends on the IDs of previous chosen items but also depends on their appearance. Consequently, the last hidden state at time step (t − 1) from item ID chain and item image chain has encoded the user's main preference and visual preference and are combined together with an element-wise attention layer to predict the next item at step t. We present the technical details of our proposed DPN in the following subsections.

A. ITEM REPRESENTATION LEARNING
Given an item from a sequence, we build its ID embedding and visual embedding respectively. a: Item ID embedding.
To be specific, for the i th item in a sequence context c, its ID embedding d i ∈ R K is obtained by projecting its ID d i into a latent space with a embedding layer. Formally, where W e is the embedding matrix, and each column of W e corresponds to the embedding of one item.
b: Item visual embedding.
For the visual embedding of item v i , we first introduce a pretrained convolution neural network (CNN) to extract annotation vectors from the item appearance image m i and then utilize an attention to integrate these vectors together to form the final visual embedding m i . Note that in order to keep more initial information in the image which may be more sensitive to the final task, we use the low-level latent features extracted by CNN rather than the high-level abstract ones from the final layer of the pre-trained CNN. More specifically, a 16layer CNN named VGGNet (shortened as VGG-16) [41] which was pre-trained on ImageNet [35] without fine-tuning is applied to extract the initial features from item appearance images. The VGG-16 network contains 13 convolutional layers and 3 fully-connected layers and it finally outputs a 1000-dimensional feature vector for each image from the final layer. Instead of using the final feature vector directly, we first extract the output, i.e., a feature map with a size of 14 × 14 × 512, of the fifth convolutional layer (conv5-3) of VGG-16 and then split it into a set of l annotation vectors (l = 14×14 = 196) where each vector is with the dimension of 512: Each annotation vector a j can be seen as the representation of a local, small and uniform-sized region of the image of an item. All the annotation vectors are then attentively aggregated into the visual representation m of an item. Noted that any other variants of CNN are also able to achieve similar results as VGG-16 does here.
Once annotation vectors are ready, a feature-level attention model is devised to build an attentive visual representation for each item. Specifically, by assigning larger weights to those more important and relevant annotation vectors and smaller weights to other ones, the feature-level attention model is able to emphasize those local regions in an image which are more sensitive and relevant to a user's preference and her choice on items, while downgrading the noisy information from the irrelevant regions. Note that the weights in attention model are jointly trained with the final prediction in an end-to-end manner. In addition, to incorporate the influence of the last item, the last hidden state h m i−1 of the item image chain is used to guide the calculation of attention weights. Accordingly, the attention weight α ij of the j th annotation vector of the i th item is calculated below: where W α , W a , W h are the corresponding weighting matrices for e, a and h respectively. Once the attention weights are ready, the attentive visual embedding of the i th item can be calculated as:

B. PREFERENCE LEARNING MODULE
After the ID embedding and visual embedding for each item are obtained, they are input into the ITU cell and GRU cell in item ID chain and item image chain for main preference learning and visual preference learning respectively.
a: Visual preference learning via item image chain.
An item image chain built on a recurrent structure is particularly designed to model a user's visual preference over items. At each time step, the visual embedding is input into the corresponding GRU cell to model the sequential dependencies among the appearance of a sequence of items. Specifically, in time step i, the hidden state h m i is calculated based on the current input m i and the last hidden state h m i−1 . First, the reset gate r m i and forget gate z m i are calculated: where W m r and W m z are the corresponding weight matrices to be learned during the training process. σ s and σ t are activation functions and are specified as sigmoid and tanh respectively.
Then the candidate state h m i and the hidden state h m i of the current step can be computed as: where W m h and σ t are the weight matrix and activation function respectively, and σ t is specified to tanh.
Consequently, a user's visual preference at step i is encoded into the hidden state h m i .
. We learn a user's main preference towards items with a particularly designed item ID chain. In the item ID chain, the cell at each time step is equipped with a carefully deigned item transition unit (ITU) to model the transitions between item IDs by considering the influence from item appearance. Specifically, the hidden state of last time step from the item image chain is carefully taken as an input of the ITU at the current time step. To be specific, in time step i, the hidden state h d i is calculated based on both the current item ID embedding d i and the last hidden state h m i−1 in the item image chain, as well as the last hidden state h d i−1 in the item ID chain. First, for the i th ITU, the reset gate r d i and forget gate z d i are inherited from a universal GRU: where W d r , W d z and σ s are the analogical weight matrices and activation functions in Eqs. (6)(7).
To incorporate the influence of a user's visual preference on her main preference, an extra gate g d i , called preference coupling gate is devised. The calculation of g d i is presented as below: g d i is then used to determine how much information from the visual preference modelled by item image chain should be incorporated into the learning of main preference modelled by item ID chain. Specially, in item ID chain, the candidate hidden sate and hidden sate at the current time step can be computed as: where W d g , W m h , σ s and σ t in Eq.(12)(13) are the weight matrices and activation functions respectively. As a result, the last hidden state h d i of the item ID chain has encoded the main preference of the user at the moment. Finally, to learn a user's final compound preference, an element-wise attention model is devised to attentively integrate the user's main preference and visual preference. Specifically, a user's compound preference h t−1 at the last time step (t − 1) is computed with element-wise product between the attentive weight β j , and the last step's hidden state h d t−1 from the item ID chain: where K is the dimension of h d t−1 . For each element (dimension) h d t−1,k in h d t−1 , its attentive weight β k is computed from the corresponding dimension h m t−1,k of the last hidden state h m t−1 in the item image chain with a softmax layer: where W β is a weight matrix for h m t−1 . Consequently, the user's main preference h d t−1 is fine-tuned guided by the attentive weights calculated from the visual preference h m t−1 . In this way, those dimensions in h d t−1 which are more relevant to the next choice as well as the user's visual preference are emphasized.
This compound preference h t−1 is then taken as the representation e c of the sequence context c, i.e., e c = h t−1 , which will be input into the prediction module for subsequent next item prediction.

C. PREDICTION MODULE
Once the sequence context embedding, i.e., the user's compound preference, is learned, a fully connected layer is employed to connect it to the output layer. Specifically, a softmax layer is employed as prediction module to predict the conditional probability distribution p(v|c) ∈ R |V | over the candidate items, with c as the input context. Particularly, the VOLUME 4, 2016 conditional probability of the true target item v t is computed as: where W o is the weight matrix for the aforementioned fully connected layer. θ denotes all the parameters in our model to be learned. Given the conditional probability calculated by Eq.(17), our goal is to find the optimal parameters θ for maximizing the conditional log-likelihood. Consequently, given a training dataset D =< c, v t >, the conditional log-likelihood is obtained by: with which, DPN is able to be trained with Adagrad optimizer [42] w.r.t all the learn-able parameters in DPN. Specifically, the batch size for mini-batch gradient descent is set to 8, and an extra L-2 regularization term is involved in Eq. (18) where regularization parameter λ is set to 0.001. We used adagrad optimizer with a initial learning rate of 0.1. A learning rate decay mechanism is designed to adjust the learning rate along with the training process. Once the model parameters θ have been learned, DPN is ready to generate predictions and further, next-item recommendations. Given a user's transactional context which contains both item ID and its corresponding appearance image, the probabilities of choosing next candidate items are calculated according to Eq. (17), and a recommendation list reflecting the rank of the candidate items is achieved.

A. DATA PREPARATION
Two real-world product review datasets are used for experiments, namely two categories of Amazon Product Data 1 [43]: (1) "Clothing, Shoes and Jewelry" (shorten as "Amazon Clothing" below); and (2) "Cell Phones and Accessories" (shorten as "Amazon Phones" below). Users' visual preferences are supposed to play an important role in their shopping choices on these two categories of products since they are more likely to be visual-sensitive. Both datasets record the reviewed items of each user on amazon.com from May 1996 to July 2014. To avoid data sparsity issue, reviews before 2012 are very sparse and thus are excluded in the experimental datasets.
We follow a common practice in SRS studies to prepare the training-test instances for training and testing our proposed model. First, a set of sequences is extracted from each original dataset by putting all items in one user's review history together to form a sequence, as most of studies on SRSs did [21]. In practice, a user's review on an item is regraded as a kind of her implicit feedback on it [34,44]. Second, those sequences containing less than five items, and items reviewed by less than five times are removed, since those sparse users and items heavily reduce recommendation accuracy. Finally, to build training-test instances of format ⟨c t ; v t ⟩, for each sequence s, the last item v t in s was picked up as the target item and all those items occurred before v t in s constituted its corresponding context c t . For each item v i , in addition to its ID v i , an appearance image m i was obtained from its metadata. Therefore, the context c t =< c d t , c m t > is built on both the sequence c d t = {v 1 , v 2 , ..., v t−1 } of item IDs and the sequence c m t = {m 1 , m 2 , ..., m t−1 } of item images. The statistics of the experimental datasets are shown in Table 1. Once all the training and test instances are built, we randomly select 70%, 20% and 10% to form the training set, test set and validation set respectively. We conducted 10-fold crossvalidation to achieve stable results and reported the average results from them.

B. EXPERIMENTAL SETTINGS a: Evaluation Metrics.
Four commonly-used metrics: Recall, Precision, Mean Reciprocal Rank (MRR) and normalized Discounted Cumulative Gain (NDCG) are employed to evaluate the recommendation accuracy [22] of all compared methods. All four metrics are evaluated on a top-K recommendation list (K ∈ {10, 20}) for evaluating the rank efficacy of candidate items.
b: Comparison Methods.
The following representative and state-of-the-art SRSs built on various frameworks including Markov chain, RNN, CNN, attention model are selected to be the baselines. They are commonly used as baselines in most of SRS studies. In addition, two simplified versions of our proposed model are also implemented for ablation analysis. Next we briefly introduce the baselines one by one.
• RAND. A method generates random recommendation list from the item set to each user. • POP. A method recommends top-K popular items (by their frequencies of occurrence in the whole review history) to each user. • FPMC. A Markov Chain-based model which constructs a personalized interaction matrix computed from item transition probability for next-basket recommendation [15]. The latent factor of FPMC was set to 50. • LSTM. A RNN-based model which utilized recurrent unit to model sequential dependencies for next-basket recommendation [45]. Rather than basic RNN, we adopted LSTM for better performance. The dimension of the hidden state in LSTM is set to 30. • GRU. A RNN-based model which adopted GRU to capture sequential dependencies [10]. The dimension of the hidden state of GRU is set to 30. • Caser. A convolutional sequence embedding model which embeds a sequence of items into a matrix, then learns sequential preferences as local features using convolutional filters [23]. The dimension of the embedding size is set to 50. • ATEM. An attentive embedding model with a soft attention mechanism to emphasize those items in a sequence context which are more important and relevant to users' next choice [22]. The dimension of the embedding is set to 50. • SASRec. A sequential recommendation model with a self-attention mechanism [11]. The dimension of item embedding is set to 50. Other parameters are set as suggested by the authors of the method. • p-RNN. A feature-rich SRSs model which brings textual and visual factors as the additional input [34]. It adopts a parallel GRU-based network architecture where item ID, textual features and visual features are modelled in three sub-nets independently and the hidden states of these sub-nets are fused at the last step for the next item prediction. In our work, we do not involve textual features of items, and thus we adapted p-RNN into our case by implementing it into a parallel GRU-based network containing two independent sub-nets, one for modeling item IDs and the other for modeling item visual features. The dimension of the hidden state is set to 30.

C. RECOMMENDATION ACCURACY COMPARISON
We compare the recommendation accuracy of our proposed model DPN and that of compared SRS methods. Tables 2 and  3 and Figure 3 report the Recall, MRR, nDCG and Precision respectively. RAND cannot model any users' preferences on items, and thus performs worst among all these methods. POP always recommends identical items to each user and thus could not model users' personalized preference, not to say users' preferences embedded in their sequences of interactions with items.
FPMC only captures first-order dependencies in modeling sequential behaviors while ignoring the high-order ones, resulting in worse performance compared with sequential models like LSTM. Although, LSTM and GRU can model the high-order and long-term sequential dependencies, it can easily generate false dependencies due to the employed overstrong order assumption between any two adjacent items in a sequence. In addition, it may be also easy to bias to the most recent items in a sequence. Caser does not perform very well either as it lacks the ability to model long-range dependencies in a sequence due to the limited perceptive fields in the pooling layers in CNN. ATEM dismisses the order of items in a sequence and usually biases to the minor main preference while overlooking other. Hence it could not perform very well. Further more, all of the above compared SRS models only take item IDs as their input to model users' main preferences towards items' main aspects, e.g., function. They are not able to model users' additional preference, e.g., visual preference, on items' other aspects, e.g., appearance. By utilizing the powerful self-attention mechanism to better capture the complex dependencies embedded in sequential context, SASRec clearly improves the performance compared with ATEM. Similarly, it only focus on the users' main preferences towards items while ignoring their visual preference.
By incorporating additional visual features of items into SRSs, p-RNN achieved higher recommendation accuracy than the above methods. However, in p-RNN, item IDs and item features are modelled in two separate sub-nets independently. In this way, the interactions between these two different factors are lost, particularly, at each step, the impact of users' visual preference reflected by item visual features on their main preference and choice on items is ignored.
In our proposed DPN, the dimension of item ID embedding is set to 30 on both datasets while the dimension of hidden state is empirically set to 20 and 30 on Amazon Clothing and Amazon Phones respectively for the best performance. By modeling users' main preference embedded in item IDs and their visual preference embedded in item VOLUME 4, 2016   visual features, as well as the interactions between them, with a particularly designed coupled double-chain network, our DPN model is able to comprehensively model users' preferences towards both the main aspect and visual aspects of items. As a result, DPN achieves the best performance on all datasets. Particularly, DPN demonstrates 5% improvement over the best-performing compared method in terms of Recall@20, nDCG@20 on both datasets (c.f. Tables 2 and 3).
The precision (c.f. Figure 3) on two datasets also shows DPN leads the baselines with a clear margin.

D. ABLATION AND DIMENSION ANALYSIS
To demonstrate the effectiveness of our proposed DPN model for user preference modeling, we implemented the full model of DPN plus three simplified versions of it to show the contribution of each key module. In addition, to test the sensitivity of key hyper parameter, i.e., the dimension of hidden state, we test the recommendation performance of these four models with different dimension settings. To be specific, the four models include: (1) the full model of DPN; (2) a single chain network called DPN (Unified) with only one chain is adopted to model users' main and visual preference together. Specifically, at each time step, the embeddings of ID and visual feature of each item are concatenated into one unified embedding vector of the item, which is then input into the single chain network; (3) a separate-chains network called DPN (Separate), where the bridge at each time step in DPN is removed and thus users' main preference and visual preference are modelled in two separate chains independently. The hidden states from both chains are fused at the last time step for the next item prediction; (4) GRUbased RNN, denoted as GRU, one of the baseline models, which can also be seen as the most simplified version of DPN. It is a single chain RNN by taking GRU as its cell while only the item ID is taken as its input. The dimension of the hidden state in each of these models was set to d = [20; 30; 50; 100]. Figure 4 presents the recommendation performance in terms of Recall@20, MRR@20, and nDCG@20 of the above four models when the dimensions of their hidden states are set to 20, 30, 50 and 100 respectively. It is clear that in most of the cases, the full model DPN clearly outperforms all the simplified versions, which demonstrates the rationality of the design of our model DPN. To be specific, the superiority of DPN over the single chain network, i.e., DPN-Unified, justified the importance to model users' main and visual preference in different chains respectively. The superiority of DPN over the separate-chains network, i.e., DPN(Separate), demonstrates the significance to consider the interaction be-  tween users' main preference and visual preference.
It is also worth noting that on Amazon Phones dataset, the GRU model which only takes item ID as the input achieves similar or sometimes better performance as the other two simplified versions of DPN which takes both item ID and visual features into account, i.e., DPN(Unified) and DPN(Separate). One possible reason is that the visual features of phones are not as sensitive as those of clothing, and thus they may easily bring noisy information for the preference modeling. In practice, to fix this issue, we have devised an attention module in the full model of DPN to carefully extract those more important and relevant information from item appearance images while reducing the interference of noisy information. The superiority of DPN over all other three models proves the efficacy of our designed attention module in DPN. For the dimensions, it seems all the models achieved the best performance when the dimension is set to 20 on Amazon Clothing dataset and 30 on Amazon Phones dataset. When the dimension increase, the performance decrease. This may be caused by the overfilling.

VI. CONCLUSIONS AND FUTURE WORK
In this paper, in Section 1, we introduced the research problem of how to learn users' visual preference in sequential recommendations and justified the significance of this problem. In Section 2, we extensively reviewed the related work in the sequential recommendation area. Then in Section 3, we formalized the problem, followed by our approach proposed in Section 4. Then, we introduced our experiments in Section 5.
To be specific, we have proposed DPN for visual preference learning to better learn users' visual preference towards items' appearance for generating more accurate sequential recommendations, which cannot be well addressed by existing studies on SRSs. In DPN, an item image chain and an item ID chain are devised to learn users' visual preference on item images and main preference on item IDs respectively. In addition, the interaction between these two kinds of preferences at each time step is carefully modelled. Finally, the two kinds of preferences are integrated to form the compound preference for the next item recommendations. Extensive experiments conducted on two real-world datasets have shown the superiority of our proposed DPN over the representative and state-of-the-art SRSs. In addition, the ablation analysis has justified the rationality of our design and the significance of the incorporation of item visual features when making recommendations.
Compared with the state-of-the-art sequential recommendation approaches, our proposed DPN can well capture users' visual preferences towards items. As a result, it is able to recommend items which can not only well satisfy users' demands w.r.t. item functions but also well match users' personalized preference on items' appearance and styles. DPN is particularly useful in the real-world scenarios where the items are visually-sensitive, such as clothes and jewellery.
In the future, on the one hand, we will explore the utilization of more advanced attention mechanisms such as crossattention [46] to better model the influence of item images on users' choice. On the other hand, we will explore the application of DPN to the problem of explainable recommendations, i.e., to better explain the recommendation results by modeling users' visual preference.