Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks

Recommender systems help users deal with information overload by providing tailored item suggestions to them. The recommendation of news is often considered to be challenging, since the relevance of an article for a user can depend on a variety of factors, including the user's short-term reading interests, the reader's context, or the recency or popularity of an article. Previous work has shown that the use of Recurrent Neural Networks is promising for the next-in-session prediction task, but has certain limitations when only recorded item click sequences are used as input. In this work, we present a contextual hybrid, deep learning based approach for session-based news recommendation that is able to leverage a variety of information types. We evaluated our approach on two public datasets, using a temporal evaluation protocol that simulates the dynamics of a news portal in a realistic way. Our results confirm the benefits of considering additional types of information, including article popularity and recency, in the proposed way, resulting in significantly higher recommendation accuracy and catalog coverage than other session-based algorithms. Additional experiments show that the proposed parameterizable loss function used in our method also allows us to balance two usually conflicting quality factors, accuracy and novelty. Keywords: Artificial Neural Networks, Context-Aware Recommender Systems, Hybrid Recommender Systems, News Recommender Systems, Session-based Recommendation


I. INTRODUCTION
R ECOMMENDER Systems (RS) are nowadays widely used on modern online services, where they help users finding relevant content. Today, the application fields of recommenders range from the suggestion of items on ecommerce sites, over music recommendations on streaming platforms, to friend recommendations on social networks, where they can generate substantial business value [1], [2].
One of the earliest application domains is the recommendation of online news [3]. News recommendation is sometimes considered as being particularly difficult, as it has a number of distinctive characteristics [4]. Among other challenges, news recommenders have to deal with a constant stream of news articles being published, which at the same time can become outdated very quickly. Another challenge is that the system often cannot rely on long-term user preference profiles. Typically, most users are not logged in and their short-term reading interests must be estimated from only a few logged interactions, leading to a session-based recommendation problem [5]. Finally, like in certain other application domains, a news RS has to find the right balance between recommending only items with the highest assumed relevance and the diversity and novelty of the recommendations as a whole [6]- [10].
In recent years, we observed an increased interest in the problem of session-based recommendation, where the task is to recommend relevant items given an ongoing user session. Recurrent Neural Networks (RNN) represent a natural choice for sequence prediction tasks, as they can learn models from sequential data. GRU4Rec [11] was one of the first neural session-based recommendation techniques, and a number of other approaches were proposed in recent years that rely on deep learning architectures, as in [12], [13].
However, as shown in [14]- [16], neural approaches that only rely on logged item interactions have certain limitations and they can, depending on the experimental setting, be outperformed by much simpler approaches based, e.g., on nearest-neighbor techniques.
One typical way of improving the quality of the recommendations in sparse-data situations is adopt a hybrid approach and consider additional information to assess the relevance of an item [17]- [19]. Previous approaches in the context of session-based recommendation for example used content [20] or context information [21] for improved recommendations. In our work, we adopt a similar approach.
Differently from existing works, however, we consider multiple types of side information in parallel and rely on a corresponding system architecture that allows us to combine different information types. Specifically, we adopt the general conceptual model for news recommendation that we initially proposed in [22], and base our implementation on the corresponding meta-architecture for news recommender systems called CHAMELEON [23]. This meta-architecture was designed to address specific challenges of the news domain, like the fast decay of item relevance and extreme user-and item-cold start problems.
Going far beyond the initial analyses presented in these previous papers, we investigate, in this current work, the effects of using various information sources on different quality factors for recommendations, namely accuracy, coverage, novelty, and diversity. Furthermore, we propose a novel approach that allows us to balance potential tradeoffs-e.g., accuracy vs. novelty-depending on the specific needs of a given application.
The Research Questions (RQ) of this work are as follows: • RQ1 -How does our technical approach perform compared to existing approaches for session-based recommendation? • RQ2 -What is the effect of leveraging different types of information on the quality of the recommendations? • RQ3 -How can we balance competing quality factors in our neural-based recommender system? We answer these questions through a series of experiments based on two public datasets from the news domain. One of these datasets is made publicly available in the context of this research. Our experiments will show that (a) considering a multitude of information sources is indeed helpful to improve the recommendations along all of the considered quality dimensions and (b) that the proposed balancing approach is effective. To ensure the repeatability of our research, we publicly share the code that was used in our experiments, which not only includes the code for the proposed approach and the baselines, but also the code for data pre-processing, parameter optimization, and evaluation.
The rest of this paper is organized as follows. Next, in Section II, we review existing works and previous technical approaches. In Section III, we summarize the CHAMELEON meta-architecture and present details of our proposed method. In Section IV, the experimental design is described and in Section V we present and discuss our results.
The paper ends with a summary and outlook on future works in Section VI.

II. BACKGROUND AND RELATED WORK
In this section, we will first review challenges of news recommendation in more detail and summarize the conceptual model for news recommendation presented in [22]. We will then discuss previous approaches of applying deep learning for certain recommendation tasks. Finally, we will briefly survey existing works on different quality factors for recommender systems.

A. NEWS RECOMMENDER SYSTEMS
The problem of filtering and recommending news items has been investigated for more than 20 years now, see [24] for an early work in this area. Technically, a variety of approaches have been put forward in these years, from collaborative filtering approaches [25], [26], to content-based methods [27]- [32], or hybrid systems [27], [33]- [39], see also [3] and [40] for recent surveys.

1) Challenges of News Recommendation
The main goal of personalized news recommendation is to help readers finding interesting stories that maximally match their reading interests [36]. The news domain has, however, a number of characteristics that makes the recommendation task particularly difficult, among them the following [3], [40]: • Extreme user cold-start -On many news sites, the users are anonymous or not logged in. News portals have often very little or no information about an individual user's past behavior [26], [27], [36]; • Accelerated decay of item relevance -The relevance of an article can decrease very quickly after publication and can also be immediately outdated when new information about an ongoing development is available. Considering the recency of items is therefore very important to achieve high recommendation quality, as each item is expected to have a short shelf life [25], [40]; • Fast growing number of items -Hundreds of new stories are added daily in news portals [41]. This intensifies the item cold-start problem. However, fresh items have to be considered for recommendation, even if not too many interactions are recorded for them [26]. Scalability problems may arise as well, in particular for news aggregators, due to the high volume of new articles being published [3], [31], [40]; • Users preferences shift -The preferences of individual users are often not as stable as in other domains like entertainment [26]. Moreover, short-term interests of users can also be highly determined by their contextual situation [26], [42]- [44] or by exceptional situations like breaking news [39]. The technical approach chosen in our work takes many of these challenges into account. In particular, it supports the consideration of short-term interests through the utilization of a neural session-based recommendation technique based on RNNs. Furthermore, factors like article recency [3], [45], [46] and general popularity [19] are taken into account along with the users' context. Finally, our next-article prediction approach supports online learning in a streaming scenario [47], and is able, due to its hybrid nature, to recommend items that were not seen in training data.
2) Factors Influencing the Relevance of News Items Fig. 1 shows the conceptual background of our proposed solution. In this model, a number of factors can influence the relevance of a news article for an individual user, including article-related ones, user-related ones, and what we call global factors.
With respect to article-related factors, we distinguish between static and dynamic properties. Static properties refer to the article's content (text), its title, topic, mentioned entities (e.g., places and people) or other metadata [27], [48]. The reputation of the publisher can also add trust to an article [49], [50]. Some news-related aspects can also dynamically change, in particular its popularity [33], [51] and recency [38], [50]. On landing pages of news portals, those two properties are typically the most important ranking criteria and in comparative evaluations, recommending recently popular items often shows to be a comparably well-performing strategy [3].
When considering user-related factors, we distinguish between the users' (short-term and long-term) interests and contextual factors. Regarding the context, their location [52]- [54], their device [55], and the current time [31], [53] can influence the users' short term interests, and thus the relevance of a news article [31], [48]. In addition, the referrer URL can contain helpful information about a user's navigation and reading context [38].
Considering the user's long-term interests can also be helpful, as some user preferences might be stable over extended periods of time [26]. Such interests may be specific personal preferences (e.g., chess playing) or influenced by popular global topics (e.g., on technology). In this work, we address only short-term user preferences, since we focus on scenarios where most users are anonymous. In general, however, as shown in [56], it is possible to merge long-term and shortterm interests by combining different RNNs when modeling user preferences.
Finally, there are global factors that can affect the general popularity of an item, and thus, its relevance for a larger user community. Such global factors include, for example, breaking news regarding natural disasters or celebrity news. Some topics are generally popular for many users (e.g., sports events like Olympic Games); and some follow some seasonality (e.g., political elections), which also influences the relevance of individual articles at a given point in time [33], [50], [51].

B. DEEP LEARNING FOR RECOMMENDER SYSTEMS
Within the last few years, deep learning methods have begun to dominate the landscape of algorithmic research in RS, see [57] for a recent overview. In this specific instantiation of the CHAMELEON meta-architecture [23], we implement two major tasks using deep learning techniques: (a) learning article representations and (b) computing session-based recommendations.

1) Deep Feature Extraction from Textual Data for Recommendation
Traditional recommendation approaches to leverage textual either use bag-of-words or TF-IDF encodings to represent item content or meta-data descriptions [58]- [60] or they rely on topic modeling [61], [62]. A potential drawback of these approaches is that they do not take word orders and the surrounding words of a keyword into account [63].
Newer approaches therefore aim to extract more useful features directly from the text and use them for recommendation. Today's techniques in particular include words embeddings, paragraph vectors, Convolutional Neural Networks (CNNs), and RNNs [64]. Kim et al. [63], for example, proposed Convolutional Matrix Factorization (ConvMF), which combines a CNN with Probabilistic Matrix Factorization to leverage information from user reviews for rating prediction.
Similarly, Seo et al. [65] aim to jointly model user preferences and item properties using a CNN, using a local and global attention mechanism.
Using a quite different approach, Bansal et al. [64] used an RNN to learn representations from the textual content of scientific papers. Besides predicting ratings for a given article, they used multi-task learning to predict also item metadata such as genres or item tags from text.
Our work shares similarities with these previous works in that we extract features using deep learning, in our case with a CNN, based on pre-trained word embeddings. However, instead of predicting ratings, our approach learns a representation of an article's content by training a separate neural network for a side task-predicting article metadata attributes based on its text.
Differently from [64], we also do not rely on an endto-end model to extract features and to recommend items. Instead, we rely on two different modules in order to ensure scalability, given the often huge amount of recorded user interactions and news articles published every day [3], [40]. The details of our approach will be discussed in Section III.

2) Deep Learning for Session-based Recommendation
RNNs are a natural choice for session-based recommendation scenarios as they are able to model sequences in datasets [66]. GRU4Rec, proposed by Hidasi et al. [11], represents one of the earliest approaches in that context. In their approach, the authors specifically use Gated Recurrent Units (GRU) to be better able to deal with longer sessions and the vanishing gradient problem of RNNs. Later on, a number   [67].
One limitation of GRU4Rec in the news domain is that the method can only recommend items that appeared in the training set, because it is trained to predict scores for a fixed number of items. Another potential limitation is that RNNbased approaches that only use item IDs for learning with no side information might not be much better or even worse in terms of prediction accuracy than simpler approaches. Detailed analyses of this phenomenon can be found in [14], [15], [47].
A number of works, however, exist that propose RNNbased approaches that use additional side information about the user's context or the items. In [68], for example, the authors extended GRU4Rec to additionally use image and textual descriptions of the items. Like in our work, they did not apply an integrated end-to-end approach, but extracted image features independently by using transfer learning from a pre-trained network [69] and used simple TF-IDF vectors for textual representations.
Contextual information was used in combination with RNNs, for example, in [70] or [71]. In [70], the authors consider not only the sequence of events when making predictions but also the type of the event, the time gaps between events, or the time of the day of an event, leading to what they call Contextual Recurrent Neural Networks for Recommendation (CRNN). Similarly, Twardowski [71] considers time as a contextual factor that is combined with item information within a hybrid approach.
A work that has certain similarities with ours in terms of the recommendation approach is the Recurrent Attention DSSM (RA-DSSM) model by Kumar [72].
The RA-DSSM is an adaptation for the news domain of the Multi-View Deep Neural Network (MV-DNN), which extended the Deep Structured Semantic Model (DSSM) [73] information retrieval architecture to recommender systems. The (MV-DNN) maps users and items to a shared semantic space and recommend items that have the highest similarity with the users in the mapped space.
Technically, the authors use a bidirectional LSTM layer with an attention mechanism [74]. Similarly to our instantiation of the CHAMELEON framework, they rely on RNNs as a base building block, use embeddings to represent textual content and implement a similarity-based loss function derived from MV-DNN. The CHAMELEON meta-architecture however, as will be discussed in Section III-A, lives at a higher level of abstraction than the specific RA-DSSM model.
Our solution also differs from RA-DSSM in a number of other dimensions. RA-DSSM for example uses doc2vec embeddings [75] to represent content, while we propose a specific neural architecture to learn textual representations based on pre-trained word embeddings for improved accuracy.
Furthermore, the RA-DSSM does not use any contextual information about users or articles, which may limit its accuracy in cold-start scenarios that are common in news recommendation. Article recency and popularity were not considered in their model as well. Additionally, we use a temporal evaluation protocol to emulate a more realistic scenario, described in Section IV-C, while their experiments do not mimic the dynamics of a news portal.

3) Deep Reinforcement Learning for News Recommendation
Reinforcement learning is an alternative technical approach for recommending online news, and often multi-arm (contextual) bandit models were applied for the task [76]. In [4], the authors propose a novel deep reinforcement learning technique for news recommendation. Differently from our problem setting, the authors focus on session-aware recommendations, where longer-term information about individual users is available. Similarly to our work, however, the approach proposed in [4] relies on a number of features that we also used in our models, e.g., article metadata, recent click counts, and context features. In their problem setting with longer-term models, the authors in addition included a number of user-related pieces of information, which are typically not available in session-based recommendation task, e.g., preferences regarding different content categories over longer periods of time.

C. BALANCING ACCURACY AND NOVELTY IN RECOMMENDER SYSTEMS
It is known for many years that prediction accuracy is not the only factor that determines the success of a recommender. Other quality factors discussed in the literature are, e.g., novelty, catalog coverage, diversity [77], or reliability [78]. In the context of news recommendation, the aspect of novelty is particularly relevant to avoid a "rich-get-richer" phenomenon where a small set of already popular articles get further promoted through recommendations and less popular or more recent items rarely make it into a recommendation list.
The novelty of a recommended item can be defined in different ways, e.g., as the non-obviousness of the item suggestions [79], or in terms of how different an item is with respect to what has already been experienced by a user or the community [80]. Recommending solely novel or unpopular items can, however, be of limited value when they do not match the users' interests well. Therefore, the goal of a recommender is often to balance these competing factors, i.e., make somewhat more novel and thus risky recommendations, while at the same time ensuring high accuracy.
In the literature, a number of ways have been proposed to quantify the degree of novelty, including alternative ways of considering popularity information [81] or the distance of a candidate item to the user's profile [3], [82], [83]. In [80], the authors propose to measure novelty as the opposite of popularity of an item, under the assumption that less popular (long-tail) items are more likely to be unknown to users and their recommendation will, hopefully, lead to higher novelty levels. In our work, we will also consider the novelty of the recommendations and adopt existing novelty metrics from the literature.
Regarding the treatment of trade-off situations, different technical approaches are possible. One can, for example, try to re-rank an accuracy-optimized recommendation list, either to meet globally defined quality levels [84] or to achieve recommendation lists that match the preferences of individual users [85]. Another approach is to vary the weights of the different factors to find a configuration that leads to both high accuracy and good novelty [86].
Finally, one can try to embed the consideration of tradeoffs within the learning phase, e.g., by using a corresponding regularization term. In [87], the authors propose a method called Novelty-aware Matrix Factorization (NMF), which tries to simultaneously recommend accurate and novel items. Their proposed regularization approach is pointwise, mean-ing that the novelty of each candidate item is considered individually.
In our recommendation approach, we consider trade-offs in the regularization term as well. Differently, from [87], however, our approach is not focused on matrix factorization, but rather on neural models that are derived from the DSSM. Furthermore, the objective function in our work uses a listwise ranking approach to learn how to enhance the novelty level of the top-n recommendations.

III. TECHNICAL APPROACH
The work presented in this paper is based on an instantiation of the CHAMELEON meta-architecture, which we presented in an initial version in [23]. The meta-architecture is designed for building session-based news recommendation systems, which are context-aware and can leverage additional content information.
We will discuss this meta-architecture next in Section III-A. Afterwards, in Section III-B, we provide information about the specific instantiation used for our experiments. Finally, in Section III-C propose a novel technical approach to balance accuracy and novelty based on a parameterizable loss function.

A. THE CHAMELEON META-ARCHITECTURE
The CHAMELEON meta-architecture was designed to deal with some of the specific requirements of news recommendation, as outlined in Section II-A. Generally, when building a news recommender system, one has several design choices regarding the types of data that are used, the chosen algorithms, and the specific network architecture when relying on deep learning approaches. With CHAMELEON, we provide an architectural abstraction (a "meta-architecture"), which contains a number of general building blocks for news recommenders and which can be instantiated in various ways, depending on the particularities of the given problem setting. Fig. 2 shows the main building blocks of the metaarchitecture and also sketches how it was instantiated for the purpose of this research. At its core, CHAMELEON consists of two complementary modules, with independent life cycles for training and inference: • The Article Content Representation (ACR) module used to learn a distributed representation (an embedding) of the articles' content; and • The Next-Article Recommendation (NAR) module responsible to generate next-article recommendations for ongoing user sessions. In a CHAMELEON-based architecture, the ACR module learns an Article Content Embedding for each article independently from the recorded user sessions. This is done for scalability reasons, because training user interactions and articles in a joint process would be computationally very expensive, given the typically large amount of recorded user interactions. Instead, the internal model is trained for a side classification task-predicting target metadata attributes (e.g. news category, topic, tags) of an article.  The NAR module, which provides recommendations for active sessions, is designed as a hybrid recommender system, considering both the recorded user interactions and the content of the news articles. It is also context-aware in that it leverages information about the usage context, e.g., location, device, previous clicks in the session, and the article's context -popularity and recency -which quickly decay over time. All these inputs are combined by feed-forward layers to produce what we call a User-Personalized Contextual Article Embedding. As a result, we obtain individualized article embeddings, whose representations depend on the user's context and other factors such as the article's current popularity and recency.
Generally, considering these additional factors can be crucial for the effectiveness of the recommendations, in particular as previous work has shown that RNNs without side information are often not much better than relatively simple algorithms [14], [15]. Additional details about the CHAMELEON meta-architecture can be found in [23].

B. SPECIFIC INSTANTIATION
For the experiments conducted in this work, we used an instantiation of the ACR module that is similar to the one from [23]. Specifically, we extract features from textual content with a CNN. The Article Content Embeddings were trained to predict target article metadata attributes. In order to support multiple target attributes, a new loss function was designed to compute a weighted sum of classification losses for single-label (softmax cross-entropy) and multi-label at-tributes (sigmoid cross-entropy), e.g., tags and keywords. The architecture of the ACR module and the training protocol is described in more detail in [23]. The input and output features for each dataset used in the experiments will be presented in Section IV-A.
Furthermore, the NAR module was instantiated with some improvements compared to [23]. Generally, the NAR module uses RNNs to model the sequence of user interactions. We empirically tested different RNN cells, like variations of LSTM [88] and GRU [89], whose results were very similar. At the end, we selected the Update Gate RNN (UGRNN) cell [90], as it led to slightly higher accuracy. The UGRNN architecture is a compromise between LSTM/GRU and a vanilla RNN. In the UGRNN architecture, there is only one additional gate, which determines whether the hidden state should be updated or carried over [90]. Adding a new (non bi-directional) RNN layer on top of the previous one also led to some accuracy improvement.
In a first step, the NAR module derives what we call a User-Personalized Contextual Article Embedding as described above. Specifically, in our instantiation, we consider the recent popularity of an article (e.g., by considering the clicks within the last hour) and its recency in terms of hours since its publication. As the user's context, we consider the time, location, device, and referrer type in case this information is available. The overall training phase of the NAR module then consists in learning a model that relates these User-Personalized Contextual Article Embeddings of the recommendable articles with the Predicted Next Article Embeddings, based on representations learned by the RNN from past session information.
Specifically, the optimization goal is to maximize the sim-ilarity between the Predicted Next-Article Embedding and the User-Personalized Contextual Article Embedding corresponding to the next article actually read by the user in his or her session (positive sample), whilst minimizing its similarity with negative samples (articles not read by the user during the session) 1 . Using this strategy, a newly published article can be immediately recommended, as soon as its Article Content Embedding is added to the repository. Details regarding the optimization problem are described next.

C. A PARAMETERIZABLE LOSS FUNCTION TO BALANCE ACCURACY AND NOVELTY
In this section, we describe the loss function of the NAR module, designed to optimize for accuracy (Section III-C1) and a newly proposed extension to balance accuracy and novelty (Section III-C2).

1) Optimizing for Recommendation Accuracy
Formally, we can describe the method for optimizing prediction accuracy as follows. The inputs for the NAR module, described later in Table 3, are represented by "i" as the article ID, "uc" as the user context, "ax" as the article context, and "ac" as the article textual content. Based on those inputs, we define "cae = Ψ(i, ac, ax, uc)" as the User-Personalized Contextual Article Embedding, where Ψ(·) represents a sequence of fully-connected layers with non-linear activation functions to combine the inputs for the RNN. The symbol s stands for the user session (sequence of articles previously read, represented by their cae vectors), and "nae = Γ(s)" denotes the Predicted Next-Article Embedding, where Γ(·) is the output embedding predicted by the RNN as the next article.
In (1), the function R describes the relevance of an item i for a given user session s as the similarity between the nae vector predicted as the next-article for the session and the cae vectors from the recommendable articles.
In the NAR module instantiation presented in [23], the sim(·) function was simply the cosine similarity. For this study, it was instantiated as the element-wise product of the embeddings, followed by a number of feed-forward layers. This setting allows the network to flexibly learn an arbitrary matching function: where φ(·) represents a sequence of fully-connected layers with non-linear activation functions, and where the last layer outputs a single scalar representing the relevance of an article as the predicted next article. In our study, φ(·) consisted of a sequence of 4 feed-forward layers with a Leaky ReLU activation function [93], with 128, 64, 32, and 1 output units.
The ultimate task of the NAR module is to produce a ranked list of items (top-n recommendation) that we assume the user will read next 2 . Using i ∈ D to denote the set of all items that can be recommended, we can define a rankingbased loss function for a problem setting as follows. The goal of the learning task is to maximize the similarity between the predicted next article embedding (nae) for the session and the cae vector of the next-read article (positive sample, denoted as i + ), while minimizing the pairwise similarity between the nae and the and cae vectors of the negative samples i − ∈ D − . i.e., those that were not read by the user in this session. Since D can be large in the news domain, we approximate it through a set D , which is the union of the unit set of the read articles (positive sample) {i + } and a set with random negative samples from D − .
As proposed in [73], we compute the posterior probability of an article being the next one given an active user session with a softmax function over the relevance scores: where γ is a smoothing factor (usually referred to as temperature) for the softmax function, which can be trained on a held-out dataset or which can be empirically set.
Using these definitions, the model parameters θ in the NAR module are estimated to maximize the accuracy of the recommendations, i.e, the likelihood of correctly predicting the next article given a user session. The corresponding loss function to be minimized, as proposed in [73]: where C is the set of user clicks available for training, whose elements are triples of the form (s, i + , D ).
Since accuracy_loss(θ) is differentiable w.r.t. to θ (the model parameters to be learned), we can use backpropagation on gradient-based numerical optimization algorithms in the NAR module.

2) Balancing Recommendations Accuracy and Novelty
In order to incorporate the aspect of novelty of the recommendations directly in the learning process, we propose to include a novelty regularization term in the loss function of the NAR module. This regularization term has a hyper-parameter which can be tuned to achieve a balance between novelty and accuracy, according to the desired effect for the given application. Note that this approach is not limited to particular instantiations of the CHAMELEON meta-architecture, but can be applied to any other neural architecture which takes the article's recent popularity as one of the inputs and uses a softmax loss function for training [73].
In our approach, we adopt the novelty definition proposed in [80], [94], which is based on the inverse popularity of an item. The underlying assumption of this definition is that less popular (long-tail) items are more likely to be unknown to users and their recommendation will lead to higher novelty levels [3].
The proposed novelty component therefore aims to bias the recommendations of the neural network toward more novel items. The corresponding regularization term is based on listwise ranking, optimizing the novelty of a recommendation list in a single step. The positive items (actually clicked by the user) are not penalized based on their popularity, only the negative samples. The novelty of the negative items is weighted by their probabilities to be the next item in the sequence (computed according to (3) in order to push those items to the top of the recommendation lists that are both novel and relevant.
Formally, we define the novelty loss component as: where C is the set of recorded click events for training, D − is a random sample of the negative samples, not including the positive sample as in the accuracy loss function (4). The novelty values of the items are weighted by their predicted relevance P (i | s, D − ) in order to push both novel and relevant items towards the top of the recommendations list.
The novelty metric in (6) is defined based on the recent normalized popularity of the items. The negative logarithm in (6) increases the value of the novelty metric for long-tail items. The computation of the normalized popularity sums up to 1.0 for all recommendable items (set I), as shown in (7). Since we are interested in the recent popularity, we only consider the clicks an article has received within a time frame (e.g., in the last hour), as returned by the function recent_clicks(·): a: Complete Loss Function The complete loss function proposed in this work combines the objectives of accuracy and novelty: where β is the tunable hyper-parameter for novelty. Note that the novelty loss term is subtracted from the accuracy loss, as this term is higher when more novel items are recommended. The values for β can either be set based on domain expertise or be tuned to achieve the desired effects.

IV. EXPERIMENTAL EVALUATION
We conducted a series of experiments to answer the research questions described above. In the context of RQ1, our goal was to compare our method (CHAMELEON) with existing session-based recommenders in the news domain. For RQ2, we try to understand the effects of leveraging different types of information on the quality of the recommendations. Finally, RQ3 addresses the effectiveness of our approach on balancing the accuracy and novelty trade-off.
In this section, we first discuss our experimental design, including the used datasets and the evaluation approach. The results of the evaluation will be discussed later in Section V.

A. DATASETS
We use two public news portals datasets for our evaluation. The datasets contain recorded user interactions and information about the published articles: • Globo.com (G1) dataset -Globo.com is the most popular media company in Brazil. This dataset was originally shared by us in [23]. With this work, we publish a second version 3 , which also includes contextual information. The dataset was collected from the G1 news portal, which has more than 80 million unique users and publishes over 100,000 new articles per month; • SmartMedia Adressa dataset -This dataset contains approximately 20 million page visits from a Norwegian news portal [95]. In our experiments we used the full dataset, which is available upon request 4 , and includes article text and click events of about 2 million users and 13,000 articles. Both datasets include the textual content of the news articles, article metadata (such as publishing date, category, and author), and logged user interactions (page views) with contextual information. Since we are focusing on session-based news recommendations and short-term users preferences, it is not necessary to train algorithms for long periods. Therefore, and because articles become outdated very quickly, we have selected for the experiments all available user sessions from the first 16 days for both datasets.
In a pre-processing step, like in [15], [39], [71], we organized the data into sessions using a 30 minute threshold of inactivity as an indicator of a new session. Sessions were then sorted by timestamp of their first click. From each session, we removed repeated clicks on the same article, as we are not focusing on the capability of algorithms to act as reminders as in [96]. Sessions with only one interaction are not suitable for next-click prediction and were discarded. Sessions with more than 20 interactions (stemming from outlier users with an unusual behavior or from bots) were truncated.
The characteristics of the resulting pre-processed datasets are shown in Table 1. Coincidentally, the datasets are similar in many statistics, except for the number of articles. For the G1 dataset, the number of recommendable articles (clicked by at least one user) is much higher than for the Adressa dataset. The higher Gini index of the articles' popularity distribution also indicates that the clicks in the Adressa dataset are more biased to popular articles, leading to a higher inequality in clicks distribution than for the G1 dataset.

B. COMPARED RECOMMENDATION APPROACHES
This section describes the implementation of a specific instantiation of CHAMELEON and of a number of baseline techniques.

1) CHAMELEON-Implementation Specifics
This instantiation of the CHAMELEON meta-architecture, presented in Fig. 2, was implemented using TensorFlow [97], a popular Deep Learning framework. We publish the source code for our neural architecture and for the baseline methods to make our experiments reproducible 5 . The Article Content Embeddings were trained by the ACR module, whose input and target features for the classifier are described in Table 2. Within the Next Article Recommendation (NAR) module, rich features were extracted from the user interactions logs, as detailed in Table 3. The features were prepared to be used as input for both the ACR and NAR modules as follows.
Categorical features with low cardinality (i.e., with less than 10 distinct values) were one-hot encoded and features with high cardinality were represented as trainable embeddings. Numerical features were standardized with znormalization. The dynamic features Novelty and Recency were normalized based on a sliding window of the recent clicks (within the last hour), so that they can accommodate both repeating changes in their distributions over time, e.g., within different periods of the day, and abrupt changes in global interest, e.g., due to breaking news.

2) Baseline Methods
In our experiments, we consider (a) different variants of our instantiation of the CHAMELEON meta-architecture to 5 https://github.com/gabrielspmoreira/chameleon_recsys assess the value of considering additional types of information and (b) a number of session-based recommender algorithms, described in Table 4. While some of the chosen baselines appear conceptually simple, recent work has shown that some of them are able to outperform very recent neural approaches for session-based recommendation tasks [14], [15], [47]. Furthermore, the simple methods, unlike neuralbased approaches, can be continuously updated over time and take newly published articles into account.

C. EVALUATION METHODOLOGY
One main goal of our experimental analyses is to make our evaluations as realistic as possible. We therefore did not use the common evaluation approach of random train-test splits and cross-validation. Instead, we use the temporal offline evaluation method that we proposed in [23], which simulates a streaming flow of user interactions (clicks) and new articles being published, whose value quickly decays over time. Since in practical environments it is highly important to very quickly react to incoming events [99], [100], the baseline recommender methods were constantly updated over time.
CHAMELEON's NAR module supports online learning, as it is trained on mini-batches. In our training protocol, we decided to emulate a streaming scenario, in which each user session is used for training only once. Such a scalable approach is different from many model-based recommender systems, like GRU4Rec and SR-GNN, which require training for some epochs on a large set of recent user interactions to reach competitive accuracy results.

1) Evaluation Protocol
The evaluation process works as follows: • The recommenders are continuously trained on the users' sessions ordered by time and grouped by hours. Each five hours, the recommenders are evaluated on sessions from the next hour, as exemplified in Fig. 3. With this interval of five hours (not a divisor of 24 hours), VOLUME 7, 2019 it was possible to sample different hours of the day across the dataset for evaluation. After the evaluation of the next hour was done, this hour is also considered for training, until the entire dataset is covered. 6 It is important to note that, while the most of the baseline methods were continuously updated during the evaluation hour, the neural methods-CHAMELEON, SR-GNN, and GRU4Rec-were not trained as evaluation progressed. 7 This allows us to emulate a realistic scenario in production where the neural network is trained and deployed once an hour to serve recommendations for the next hour; • For each session in the evaluation set, we incrementally "revealed" one click after the other to the recommender, as done, e.g., in [11] and [56]; • For each click to be predicted, we created a set containing 50 randomly sampled recommendable articles not viewed by the user in the session (negative samples), plus the true next article (positive sample), as done in [101] and [102]. The sampling strategy was popularity-biased (i.e., the item sampling probability is proportional to its support), so that strong (popular) negative samples are always present. We then evaluate the algorithms in the task of ranking those 51 items; • Given these rankings, standard information retrieval metrics can be computed. For a realistic evaluation, it is important that the chosen negative samples consist of articles which would be of some interest to readers and which were also available 6 Our datasets comprises 16 days. We used the first two days to learn an initial model for the session-based algorithms and report the averaged measures after that warm-up period. 7 Additionally, as the original implementations of SR-GNN and GRU4Rec do not support fine tuning of previously trained models with more data, those models were trained (for some epochs) considering only sessions from the last 5 hours before each evaluation. On the other hand, CHAMELEON's network was incrementally trained over time (except during evaluation). for recommendation in the news portal at a given point of time. For the purpose of this study, we therefore selected as recommendable articles the ones that received at least one click by any user in the preceding hour. To finally select the negative samples, we implemented a popularity-based sampling strategy similar to the one from [11].

2) Metrics
To measure quality factors such as accuracy, item coverage, novelty, and diversity, we have selected a set of top-N metrics from the literature. We chose the cut-off threshold at N=10, representing about 20% of the list containing the 51 sampled articles (1 positive sample and 50 negative samples).
The accuracy metrics used in our study were the Hit Rate (HR@n), which checks whether or not the true next item appears in the top-N ranked items, and the Mean Reciprocal Rank (MRR@n), a ranking metric that is sensitive to the position of the true next item in the list. Both metrics are common when evaluating session-based recommendation algorithms [11], [15], [47].
As an additional metric, we considered Item Coverage (COV@n), which is sometimes also called "aggregate diversity" [84]. The idea here is to measure to what extent an algorithm is able to diversify the recommendations and to make a larger fraction of the item catalog visible to the users. We compute coverage as the number of distinct articles

Neural Methods GRU4Rec
A landmark neural architecture using RNNs for session-based recommendation [11]. For this experiment, we used the GRU4Rec v2 implementation, which includes the improvements reported in [67]. 1 We furthermore improved the algorithm's negative sampling strategy for the scenario of news recommendation. 2

SR-GNN
A recently published state-of-the-art architecture for session-based recommendation based on Graph Neural Networks. In [98], the authors reported superior performance over other neural architectures such as GRU4Rec [11], NARM [13] and STAMP [12]. Association Rules-based Methods

Co-Occurrence (CO)
Recommends articles commonly viewed together with the last read article in previous user sessions. This algorithm is a simplified version of the association rules technique, having two as the maximum rule size (pairwise item co-occurrences) ( [15], [47]).

Sequential Rules (SR)
The method also uses association rules of size two. It however considers the sequence of the items within a session. A rule is created when an item q appeared after an item p in a session, even when other items were viewed between p and q. The rules are weighted by the distance x (number of steps) between p and q in the session with a linear weighting function w SR = 1/x [15]; Neighborhood-based Methods

Item-kNN
Returns the most similar items to the last read article using the cosine similarity between their vectors of co-occurrence with other items within sessions. This method has been commonly used as a baseline when neural approaches for session-based recommendation were proposed, e.g., in [11].

Vector Multiplication Session-Based kNN (V-SkNN)
This method compares the entire active session with past (neighboring) sessions to determine items to be recommended. The similarity function emphasizes items that appear later within the session. The method proved to be highly competitive in the evaluations in [14], [15], [47].

Recently Popular (RP)
This method recommends the most viewed articles within a defined set of recently observed user interactions on the news portal (e.g., clicks during the last hour). Such a strategy proved to be very effective in the 2017 CLEF NewsREEL Challenge [99].

Content-Based (CB)
For each article read by the user, this method suggests recommendable articles with similar content to the last clicked article, based on the cosine similarity of their Article Content Embeddings. 1 GRU4Rec v2 [67] was released on Jun 12, 2017 and is available at https://github.com/hidasib/GRU4Rec 2 We exchanged the original negative sampling approaches used for training GRU4Rec by the sampling strategy described in Section IV-C1 (i.e., popularity-biased from recent clicks), and observed accuracy improvements for GRU4Rec in these experiments. that appeared in any top-N list divided by the number of recommendable articles [103], i.e., those that were clicked at least once in the last hour.
To measure novelty and diversity, we adapted the evaluation metrics that were proposed in [8], [80], [94]. We provide details of their implementation in Appendix A. The novelty metrics ESI-R@n and ESI-RR@n are based on item popularity, returning higher values for long-tail items. The ESI-R@n (Expected Self-Information with Rank-sensitivity) metric includes a rank discount, so that items in the top positions of the recommendation list have a higher effect on the metric. The ESI-RR@n (Expected Self-Information with Rank-and Relevance-sensitivity) metric not only considers a rank discount, but also combines novelty with accuracy, as the relevant (clicked) item will have a higher impact on the metric if it is among the top-n recommended items. Our diversity metrics are based on the Expected Intra-List Diversity (EILD) metric. Analogously to the novelty metrics, there are variations to account for rank-sensitivity (EILD-R@n) and for both rank-and relevance-sensitivity (EILD-RR@n).
For our experiments, all recommender algorithms were tuned towards higher accuracy (MRR@10) for each dataset using random search on a hold-out validation set. The resulting best hyper-parameters are reported in Appendix B.

V. RESULTS AND DISCUSSION
In this section, we present the main results and discuss our findings under the perspective of our research questions. For all tables presented in this section, best results for a metric are printed in bold face. If the best results are significantly different 8 from measures of all other algorithms, they are marked with *** when p < 0.001, with ** when p < 0.01, and with * symbol when p < 0.05.

A. EVALUATION OF RECOMMENDATION QUALITY (RQ1)
In this section, we first analyze the obtained accuracy results and then discuss the other quality factors. Table 5 shows the accuracy results obtained by the different algorithms in terms of the HR@10 and MRR@10 metrics. The reported values correspond to the average of the measures obtained for each evaluation hour, according to the evaluation protocol (Section IV-C).

1) Accuracy Analysis
In this comparison, our CHAMELEON instantiation outperforms the other baseline algorithms on both datasets and on both accuracy metrics by a large margin. The SR method performs second-best.
Generally, the observed difference between CHAMELEON and SR is higher for the G1 dataset. This can be explained by the facts that (a) the number of articles in the G1 dataset is more than 3 times higher than in the other dataset and (b) the G1 dataset has a lower popularity bias, see the Gini index in Table 1. As a result, algorithms that have a higher tendency to recommend popular items are less effective for datasets with a more balanced click distribution. Looking, for example, at the algorithm that simply recommends recently-popular articles (RP), we see that its performance is much higher for the Adressa dataset, even though the best obtained measures are almost similar for both datasets.
We can furthermore observe that other neural approaches (i.e., SR-GNN and GRU4Rec) were not able to provide better accuracy than non-neural baselines for session-based news recommendation. One of the reasons is that in a realworld scenario-as emulated in our evaluation protocolthose models cannot be updated as often as the baseline methods, due to challenges of asynchronous model training and frequent deployment. Furthermore, CHAMELEON's architecture was designed to be able to recommend fresh articles not seen during training. SR-GNN and GRU4Rec in contrast, cannot make recommendations for items that were not encountered during training, which limits their accuracy in a realistic evaluation. In our datasets, for example, we found that about 3% (Adressa) to 4% (G1) of the item clicks in each evaluation hour were on fresh articles, i.e., on articles that were not seen in the preceding training hours.
From the two neural methods, the newer graph-based SR-GNN method was performing much better than GRU4Rec in our problem setting. However, as our detailed analysis in Section V-B will show, SR-GNN does not achieve the performance levels of CHAMELEON, even when CHAMELEON is not leveraging any additional side information other than the article ID (configuration IC1 in Table 8).
In Fig. 4 and 5, we plot the obtained accuracy values (MRR@10) of the different algorithms along the 16 days, with an evaluation after every 5 hours. We can note that, after some training hours, CHAMELEON clearly recommends with higher accuracy than all other algorithms.

2) Analysis of Additional Quality Factors
The results obtained for the other recommendation quality factors investigated in our research-item coverage, novelty, and diversity-are shown in Table 6. The observations can be summarized as follows: • In terms of item coverage (COV), CHAMELEON has a much richer spectrum of articles that are included in its top-10 recommendations compared to other algorithms, suggesting a higher level of personalization. The only method with a higher coverage was the CB method, which however is not very accurate. This is expected for a method that is agnostic of an article's popularity. • Looking at novelty, the CB method also recommends the least popular, and thus more novel articles, according to the ESI-R metric. This effect has been observed in other works such as [8], [104], which is expected as this is the only method that does not take item popularity into account in any form. CHAMELEON ranks third on this metric for the G1 dataset and is comparable to the other algorithms for Adressa 9 . Looking at novelty in isolation is, however, not sufficient, which is why we include the relevance-weighted ESI-RR metric as well. When novelty and relevance are combined in one metric, it turns out that CHAMELEON leads to the best values on both datasets-• Considering diversity, we can observe that most algorithms are quite similar in terms of the EILD-R@10 metric. The CB method has the lowest diversity by design, as it always recommends articles with similar content. When article relevance is taken into account along with diversity with the EILD-RR@10 metric, we again see that CHAMELEON is more successful than others in balancing diversity and accuracy. CHAMELEON leverages a number of input features to provide more accurate recommendations, as shown in Table 3. In order to understand the effects of including those features in our model, we performed a number of additional experiments with features combined in different Input Configurations (IC) 10 . Table 7 shows five different configurations where we start only with the article IDs (IC1) and incrementally add more features until we have the model with all input features (IC5). Note that we have included two variations of IC3: (a) using the Article Content Embedddings (ACE) learned with the ACR module, trained to predict article metadata attributes from text (supervised learning), and (b) using article embeddings trained with doc2vec [75] (unsupervised learning). Table 8 shows the results of this study. We can generally see that both accuracy (HR@10 and MRR@10) and item example, the diversity of CHAMELEON's recommendations in terms of the EILD-R metric decreases with additional features, in particular when the Article Content features is included at IC3. This is expected, as recommendations become generally more similar when content features are used in a hybrid RS.
Looking at the two variations of configuration IC3, we can observe that for the G1 dataset the textual content representation of ACE leads to a much higher accuracy than doc2vec embeddings. This confirms the usefulness of our specific way of encoding the textual content with the ACR module, based on word embeddings pre-trained in a larger corpus (e.g. Wikipedia).
For the Adressa dataset, however, the results with ACE and doc2vec are very similar 11 . A possible explanation for the difference between the datasets can lie in the nature of the available metadata of the articles, which are used as target attributes during training. In the G1 dataset, for example, we have 461 article categories, which is much more than for the Adressa dataset, with 41 categories. Furthermore, the distribution of articles by category is more unbalanced for Adressa (Gini index = 0.883) than for G1 (Gini index = 0.820). In theory, fine-grained metadata can lead to content embeddings clustered around distinctive topics, which may be useful to recommend related content.

C. BALANCING ACCURACY AND NOVELTY WITH CHAMELEON (RQ3)
In this section, we analyze the effectiveness of our novel technical approach to balance accuracy and novelty within CHAMELEON, as described in Section III-C2. Specifically, we conducted a sensitivity analysis for the novelty regularization factor (β) in the proposed loss function. Table 9 shows the detailed outcomes of this analysis. As expected, increasing the value of β increases the novelty of the recommendations and also leads to higher item coverage. Correspondingly, the accuracy values decrease with higher levels of novelty. Fig. 6 shows a scatter plot that illustrates some effects and contrasts of the obtained results in our evaluation. The trade-off between accuracy (MRR@10) and novelty (ESI-R@10) for CHAMELEON can be clearly identified. We also plot the results for the baseline methods here for reference. This comparison reveals that tuning β helps us to end up with recommendations that are both more accurate and more novel than the ones by the baselines. Fig. 6 also illustrates the differences between the two datasets. Due to the uneven distribution of the Adressa dataset, the performance improvements over the RP baseline, which recommends recently popular items, are smaller than for the G1 dataset.

VI. SUMMARY AND FUTURE WORKS
In this final section, we first summarize the major findings of our work and then give an outlook on future research directions in this area.

A. SUMMARY
We have proposed a novel approach for session-based news recommendation, which in particular addresses domainspecific problems such as a) the short lifetime of the recommendable items and b) the lack of longer-term preference profiles of the users. The main technical contribution of our work lies in the combination of content and context features and a sequence modeling technique based on Recurrent Neural Networks. Furthermore, we propose a novel way to balance potentially conflicting optimization goals like accuracy and novelty through a parameterizable loss function. The individual technical components that were developed in our work were integrated into a configurable open-source news recommendation framework for session-based recommendations. Experimental evaluations on two public news datasets revealed that a) the proposed hybrid approach leads to higher prediction accuracy and b) that our approach to balance conflicting optimization goals is effective.

B. FUTURE WORKS
With respect to future works, our plan is to further investigate differences between existing algorithms in terms of their capability of dealing with the constant item cold-start problem, which is omnipresent in news portals.
Another specific challenge that we have not addressed so far and which was not investigated to a large extent in the literature as well is that of "outliers" in the user profiles. Specifically, there might be a certain level of noise in the user profiles. In the case of news recommendation, this could be random clicks by the user or user actions that result from a click-bait rather than from genuine user interest. As proposed in previous works [105]- [107], we plan to identify such outliers and noise in the context of session-based recommendation to end up with a better estimate of the true user intent within a session.
Furthermore, we will investigate the role of emotions as a further contextual factor, see, e.g. [108], [109], both in the form of trying to consider the sentiment of a given news article and the current emotional state of the user.
Finally, our next immediate goals include the exploration of mechanisms within CHAMELEON that allow us to balance more than two quality factors, with a particular look at enhancing the diversity of the recommendations while preserving accuracy.

ACKNOWLEDGMENTS
G. Moreira thanks CI&T for supporting this research in its R&D departments (D1 / Lab23), Globo.com for sharing a dataset and their technical challenges, and also Ecossistema Negocios Digitais Ltda for their support for this article. .

APPENDIX A NOVELTY AND DIVERSITY METRICS
In our studies, we use novelty and diversity metrics adapted from [80] and [94], which we tailored to fit our specific problem of session-based news recommendation. Generally, for the purpose of this investigation, novelty is evaluated in terms of Long-Tail Novelty. Items with high novelty correspond to long-tail items, i.e., items that were clicked on by few users, whilst low novelty items correspond to more popular items.

A. ESI-R@N
The Expected Self-Information with Rank-sensitivity metric, presented in (9), was adapted from the MSI metric proposed by [8] with the addition of a rank discount. The term −log 2 p(i) represents the core of this metric, which comes from the self-information (also known as surprisal) metric of Information Theory, which quantifies the amount of information conveyed by the observation of an event [8].
Applying the log(·) function emphasizes the effect of highly novel items. We define L = i 1 , ..., i N as a recommendation list of size N = |L|.
In this setting, the probability p(i) of an item being part of a random user interaction under free discovery is the normalized recent popularity, i.e., p(i) = rec_norm_pop(i), previously presented in (7). In (9), disc(·) is a logarithmic rank discount, defined in (10), that maximizes the impact of novelty for top ranked items, under the assumption that their characteristics will be more visible to users compared to the rest of the top-n recommendation list:

B. ESI-RR@N
Analyzing quality factors like accuracy, novelty, and diversity in isolation can be misleading. Some Information Retrieval (IR) metrics, such as α−nDCG, therefore consider novelty contributions only for relevant items for a given query [8]. As proposed by [80], a relevance-sensitive novelty metric should likewise assess the novelty level based on the recommended items that are actually relevant to the user. Thus, we used a variation of a novelty metric to account for relevance-Expected Self-Information with Rank-and Relevance-sensitivity (ESI-RR@n). It weights the novelty contribution by the relevance of an item for a user p(rel|i, u) [8]. We adapt the proposal from [94]: where I u is the set of items the user interacted within the ongoing session, and b is a background probability of an unobserved interaction (negative sample) being also somewhat relevant for a user. The lower the value of b (e.g., b = 0) the higher the influence of relevant items (accuracy) in this metric. The author of [94] used an empirically determined value of b = 0.2, based on his experiments on balancing diversity and novelty. In our study, we arbitrarily set b = 0.02, so that all the 50 negative samples would sum up to the same relevance (1.0) of a positive (clicked) item.
Equation (12) shows how we compute the ESI-RR@n metric.
ESI-RR(L) = C k N k=1 −log 2 p(i k )×disc(k)×relevance(i k , u), Equation (13) defines the term C k , which computes the weighted average based on ranking discount.
Like in [94], the relevance is not normalized, so that more relevant items among the top-n recommendations lead to a global higher novelty.

C. EILD-R@N
Diversity was measured based on the Expected Intra-List Diversity metric proposed by [80], with variations to account for rank-sensitivity (EILD-R@n) and for both rank-and relevance-sensitivity (EILD-RR@n).
Intra-List Diversity measures the dissimilarity of the recommended items with respect to the other items in the recommended list. In our case, the distance metric d(·) defined in (14) is the cosine distance. sim(a, b))/2, Here, a and b are the Article Content Embeddings of two articles and sim(a, b) is their cosine similarity. As the cosine similarity ranges from -1 to +1, the cosine distance is scaled to the range [0,1]. The Expected Intra-List Diversity with Rank-sensitivity (EILD-R@n) metric, defined in (15), is the average intradistance between items pairs weighted by a logarithmic rank discount disc(·), defined in (10). Given a recommendation list L = i 1 , ..., i N of size N = |L|, we compute the EILD-R@n metric as follows.
The term rdisc(l, k), defined in (16), represents a relative ranking discount, considering that an item l that is ranked before the target item k has already been discovered. In this case, items ranked after k are assumed to lead to a decreased diversity perception as the relative rank between k and l increases. rdisc(l, k) = disc(max(0, l − k)) (16)

D. EILD-RR@N
The Expected Intra-List Diversity with Rank-and Relevancesensitivity finally measures the average diversity between item pairs, weighting items by rank discount and relevance, analogously to the ESI-RR@n metric: EILD-RR(L) =C k N k=1 disc(k) × relevance(i k , u)C l N l=1:l =k d(i k , i l )rdisc(k, l) × relevance(i l , u) Here, C k (13) and C l (18) are normalization terms representing a weighted average based on rank discounts.

APPENDIX B FINAL ALGORITHMS HYPER-PARAMETERS
In Table 10, we present the best hyper-parameters found for each algorithm and dataset. They were tuned for accuracy (MRR@10) on a hold-out validation set, by running random search within defined ranges for each hyper-parameter. The methods CO, RP, and CB do not have hyper-parameters. More information about the hyper-parameters can be found in the shared code and in the papers where the baseline methods were proposed. Coefficient of the L 2 regularization 1e-5 2e-5 momentum if not zero, Nesterov momentum will be applied during training with the given strength 0 0 embedding Size of the embedding used, 0 means not to use embedding 0 0