Recommendation Systems: An Insight Into Current Development and Future Research Challenges

Research on recommendation systems is swiftly producing an abundance of novel methods, constantly challenging the current state-of-the-art. Inspired by advancements in many related fields, like Natural Language Processing and Computer Vision, many hybrid approaches based on deep learning are being proposed, making solid improvements over traditional methods. On the downside, this flurry of research activity, often focused on improving over a small number of baselines, makes it hard to identify reference methods and standardized evaluation protocols. Furthermore, the traditional categorization of recommendation systems into content-based, collaborative filtering and hybrid systems lacks the informativeness it once had. With this work, we provide a gentle introduction to recommendation systems, describing the task they are designed to solve and the challenges faced in research. Building on previous work, an extension to the standard taxonomy is presented, to better reflect the latest research trends, including the diverse use of content and temporal information. To ease the approach toward the technical methodologies recently proposed in this field, we review several representative methods selected primarily from top conferences and systematically describe their goals and novelty. We formalize the main evaluation metrics adopted by researchers and identify the most commonly used benchmarks. Lastly, we discuss issues in current research practices by analyzing experimental results reported on three popular datasets.


I. INTRODUCTION
The volume of digital information has been increasing at an exponential rate within the last few decades. This has led to what is commonly defined as the information overload problem, which describes those situations in which users find themselves dealing with excessive amounts of information, and are actually hindered in their ability to navigate it and make decisions in its regard. Whenever content providers offer goods or services in numbers that are intractably large for individual customers, an automated method able to guide them towards a custom selection of content becomes The associate editor coordinating the review of this manuscript and approving it for publication was Amir Masoud Rahmani . a necessity. Recommendation Systems (RSs) [1] are such methods, functioning as an indispensable tool to users, as well as increasing sales and views for providers. RSs have an incredibly wide range of applications, such as e-commerce, social media, video hosting platforms, online news platforms, music libraries and much more. With this review, we aim to provide a strong foundational overview of this research area, describe its latest advancements and precisely frame the most important issues and challenges that should be addressed.

A. RECOMMENDATION TASK
We begin by providing a brief overview of the generic task tackled by recommenders. A RS can be generally described as a framework that suggests items to users utilizing any type of data that regards either or both of them, as well as historical interactions between them. These are the three main actors of RSs -users, items and interactions -and are used as generic terms regardless of what they concretely represent in different scenarios. Interactions are considered to be a user's feedback, and are either explicit (user reviews of an item, e.g., a score in the range [1][2][3][4][5] or implicit (user acts on an item without indication of preference) [2]. Some approaches split interactions into additional subcategories based on the more concrete action type that describes them (e.g., click, buy, view, etc.). RSs based on implicit feedback face additional difficulties, as all interactions are weak signals: items selected in the past give a weak indication of what a user may want to see in the future, and there are no explicit negative interactions [3]. In either case, all of the remaining items (not interacted with) are weak negatives, in the sense that it is unknown how the user would react to them, a fact that poses its own challenges (most notably, how to handle the large number of negatives).
As far as learning objectives are concerned, in the case of explicit feedback the task is frequently framed as a prediction of how a user will rate an item. Instead, in the case of implicit feedback the task can be defined as ''maximization of the rate of consumption''. Because the signals are weak, the problem is not what the user will like or not, but what the user is likely to interact with. The meaning of ''consumption'' is domaindependent. The watch time for a video or the dwell time on a website page can both be considered consumption signals for a video sharing platform and a news agency. On the other hand, advertisement platforms are more likely to be interested in the maximization of Click-Through Rate (CTR), that is, the fraction of clicks on an item over the number of times it has been seen.

B. PROBLEM DEFINITION
Item recommendation, also known as top-n recommendation, is the task of selecting the best items from a large catalog for a user in a given context. In this section, we give a short formal introduction to common notational conventions.
Formally, we define a user u and an item i as belonging to corresponding sets, i.e., u ∈ U and i ∈ I . Again, these are generic terms that abstract from what they are concretely, which is instead described by their representation. User and item representations are very flexible and depend on the data utilized by the system itself -Section III will explore these different representations. The simplest representation for these actors is based on user and item identifiers (ids), supplied with no further information, meaning the system works solely on user-item interactions. Some authors prefer to incorporate users in a ''context'' [3], which encapsulates both the user and additional contextual information such as time, location, and previous interactions of that user. We keep these concepts separate, though the resulting methods are the same. Interactions between users and items are most commonly organized in a matrix R, where r ∈ R can be an explicit rating (e.g., 1 to 5) or an implicit signal (1 if the interaction has occurred, 0 otherwise). Therefore, r ui represents the interaction between user u and item i. In general, most recommenders systems can be seen as having to design a scoring or utility function: This utility indicates the degree of preference towards the item of the user. The choice and design of such function are core aspects of the modeling process of a RS [3], [4].

C. RELATED WORK
There are a number of recently published surveys and articles on the area of RSs, though the vast majority addresses a particular sub-field without attempting to capture it as a whole. Here we briefly mention the most relevant to our work, highlighting their merits and how they differ from this survey. The authors of [5] organize a survey from the perspective of modeling recommenders with the accuracy goal, and limited to neural approaches. Collaborative filtering approaches are reviewed in [6], which also showcases hybrid approaches that integrate information derived from social networks. In [7], neural recommenders are tackled, focusing on deep learning-based approaches and building a comprehensive summary of current research. The work by [8] provides an excellent categorization of recommendation tasks and goals for sequence-aware recommenders, which have to deal with sequentially-ordered interactions. In [9], a unified framework on session-based RSs is provided (often considered a subset of sequence-aware recommenders), describing in depth the unique characteristics and challenges posed by session data. The excellent article by [3] details item recommendation in implicit settings, with a large focus on challenges faced during training and various techniques (mainly sampling) utilized to solve them. In [10], a framework of recommendation from the point of view of explainable recommendations is described. The interesting formulation of RSs as systems trying to solve a Multi-Armed Bandit problem is surveyed in [11]. Both [12] and [13] cover the usage of knowledge graphs in RSs. Finally, [14] characterize and formalize graph learning-based RSs, their challenges, and main progress in the sub-field.
We found many recent surveys addressing the usage of RSs in specific domains. For instance, in [20] the authors discuss algorithms that make use of user-assigned tags to predict item relevance, often in social network platforms. Much research has been published on the recommendation of scientific texts, like in [16], [18], [21]. Applications of RSs in the tourism and travel industry, like accommodation and food recommendations, are explored in [19], while [17] showcases the importance of location-based services and social networks in this domain. Finally, RSs can be beneficial in education for recommendation of teaching resources, for instance on e-learning platforms [15], [22]. Table 1 provides an overview of the surveys analyzed.

D. CONTRIBUTIONS OF THIS SURVEY
In contrast, our survey is organized from a more generic point of view, attempting to collate much of this information into a single, foundational overview. We attempt to highlight how the field has evolved over the years, such as to give a realistic and up-to-date view of the recommendation landscape. By analyzing challenges and points of contention on recent progress, we aim to incorporate theoretical knowledge with an authentic representation of the current state of this research field. This will help researchers discover new ideas to design better solutions in the future, while also being conscious of possible disputes about recent progress in the recommendation area [23]- [25]. Relatedly, we discuss how several different evaluation protocols are currently adopted to test the performance of RSs, and how possible issues in such protocols affect the assessment of the state-of-the-art. In summary: • We provide an overview of the recommendation task, its various facets, and possible design choices to be made when developing a recommender; • We propose an updated taxonomy of RSs, based both on traditional categorizations and new emerging trends, clearly characterizing different approaches through popular representatives; • We briefly describe a wide array of recently proposed methods, such as to provide an easily accessible overview of recent research in this area; • We study the evaluation process of a RS and its critical issues, highlighting examples in recent literature. How Papers Are Collected: As our survey aims to capture the latest advancements and proposed ideas in the field, we retrieved the most related top conferences such as NEURIPS, ICML, ICLR, RECSYS, SIGIR, KDD, WWW, WSDM, AAAI and IJCAI, the same that were surveyed in [5]. Due to the very large number of retrieved results, we only reviewed a selection of contributions from each conference, matching the keyword ''recommendation'', ''recommender'' and ''recommendation system'', and preferring the ones surrounded by a larger amount of academic discourse. Among our goals, we wished to perform an analysis of testing protocols in recent works, a procedure which in many cases requires access to the code implementation of the experiments. We found that conference papers tend to publish the code of the experiments more frequently than journal publications. As such, we decided to select mainly conference papers, as was done in [23], [26]. As this field of work is particularly dynamic, we limited our search to papers published after the year 2019, though we also consulted particularly influential and distinguished publications from years prior.
However, in order to provide a more thorough analysis, we also complement our search with queries to Google Scholar 1 and DBLP. 2 We have first searched for the most influential works with an unfiltered search sorted by relevance, and then applied a more fine-grained search of recent works in the period of time 2018-2022. While we still included works published in conferences with this procedure, we tried to put a particular emphasis in searching works published in related journals rather than conference papers. These include journals such as Knowledge-Based Systems, Expert Systems with Applications and IEEE Access. The total number of works retrieved by the end of our research was of roughly 200 works, of which about 150 were conference papers. Another large portion of our references is from cross-referencing particularly important works mentioned within the corpus of our analysis. Over 120 recent or influential works are briefly presented in the methods overview.

E. STRUCTURE OF THE SURVEY
This survey is organized with the following structure: • Section II provides an introduction to the main design choices to be made towards the optimization of RSs; • Building on such information, Section III provides an overview of recommendation models and explains a data-dependent taxonomy, tying it with standard taxonomies and clarifying these approaches by giving influential examples; • Section IV goes in-depth into an exploration of the recently proposed methods and approaches, which are largely based on neural networks; • Section V describes the main evaluation protocols adopted in research, and reports the most popular metrics and datasets, with considerations on various testing strategies found in the literature on three of them; • The survey draws to an end in Section VI, analyzing possible new and long-standing challenges as well as future research directions of this field; • Lastly, our conclusions are reported in Section VII.

II. DESIGN CHOICES
Before diving into a taxonomy of RSs, it is useful to introduce a few propaedeutic concepts related to the field, all of which relate to the general design of a recommendation framework.
In this section, we first provide some clarity on why a categorization of RSs is not simple, and what considerations should be taken when constructing such a system. In a related fashion, we then proceed to introduce some of the main challenges faced by RSs, fundamental in order to better understand the design choices that differentiate the methods within the taxonomy. We introduce some of the most popular choices of learning objectives used to frame the recommendation problem into supervised Machine Learning (ML) problems. Lastly, we briefly touch on the ''retrieval and ranking'' approach for designing recommender frameworks, as well as a short mention to sampling approaches.

A. CONSIDERATIONS TO BE MADE
RSs have been long studied with great interest, and are generally considered an important subclass of ML and information filtering. However, we find that, unlike many traditional fields of study, they lack a robust definition and classification. This is not without reason; while an intuitive notion of what a RS should do is easily identifiable, the process of developing a careful characterization is soon met with an abundance of questions. Here we attempt to identify some of the main reasons why a consistent categorization of RSs can be difficult to achieve.
Firstly, it is important to consider (1) what type of data is available to the system. It is often not trivial to decide what information should be used, and how to treat missing or not readily available data points. Secondly, one should also consider (2) how user interactions are treated. For example, an e-commerce website might want to consider the action of ''adding to cart'' differently from the ''buy'' action. In a similar vein, one might ask (3) what interaction is being sought. In video recommendation scenarios, one might want to decide between maximizing watch time and CTR; this latter objective may favor ''click-bait'' videos, resulting in many videos that were opened, but abandoned shortly after. Last but not least, considering (4) how the task is framed is also of utmost importance. Learning strategies can differ depending on whether the algorithm objective is to approximate a user-dependent function that describes the level of affinity with items (classification or regression problem), or to populate a list of items of probable interest (retrieval problem). Furthermore, the specific application might have additional requirements, such as having at least one relevant item (or, conversely, as many as possible in a less precise manner).
Clearly, these considerations only cover part of the large number of facets of this design process. The ones presented above were chosen as we found them to capture some of the most relevant and thoroughly studied issues within the field. Throughout this survey, we will introduce and explain the various concepts necessary to answer these questions.

B. MAIN CHALLENGES
Throughout the years, RSs have had to deal with a staple set of challenges that are important to consider whenever discussing both new and old approaches. This section provides a brief introduction to the most common: data sparsity, the cold start problem and scalability. While we will address other important challenges in Section VI, we briefly anticipate these core issues, as we deem it necessary to wholly understand the methods that will be illustrated.

1) DATA SPARSITY
One of the most severe complications associated with RSs is the sparsity problem [4], a natural consequence of the fact that it is very unlikely for users to have interacted with more than a small fraction of the available items. In turn, the representations of such systems -which are, one way or another, based on interactions -will contain a large number of missing entries, i.e., will be very sparse. This causes severe complications, most notably the difficulty to create accurate representations for users and items, as most of the interactions will not have occurred [6]. Unobserved interactions are inherently weak negatives, as we have no information on whether the user has actively avoided them or has simply not come across them yet.
Moreover, not only are interactions sparse, but they are also commonly concentrated around popular items, meaning such sparsity is also highly localized [2], [27]. This property often satisfied by real-world recommendation datasets is referred to as the long-tail. Datasets with such property will have the vast majority of their interactions related to a restricted fraction of highly popular items. This creates a long-tail distribution when plotting the number of interactions against the items sorted by interaction frequency (Fig. 1), where the vast majority of items reside in such long-tail, yet have the least number of interactions overall.

2) COLD START
The cold start problem [4], [28] describes situations in which a recommender has to deal with either users or items that have few or no interaction histories, which is usually the case when they have just entered the system. Approaches based solely on interaction histories are inherently sensitive to this issue, since they have no other foundation to characterize users or items. While new users can be trivially suggested popular items, new items might end up never being recommended because of how they have never been part of any interaction. Utilizing side information (e.g., based on the item and user data) is usually an effective way to mitigate this problem [6].

3) SCALABILITY
Practical properties such as scalability [29], [30] are fundamental in RSs, as recommendations should be generated quickly -usually as soon as the user enters the system or each time they interact with an item. A scalable system should be able to handle often massively large amounts of information, which will likely only grow in time. It is notable that this issue leads to many real-world applications relying on methods that are not very recent (though they have obviously been refined) [23], [24], [31], [32], yet perform undeniably well when we consider their scalability. New approaches should be mindful of this constraint; trading accuracy for performance through approximations is often a necessity.

C. LEARNING OBJECTIVES
Recommendations are rarely provided as single items and instead are usually presented as a ranked list, with items deemed more relevant placed on top and vice versa. An important point of divergence between different recommendation approaches, then, is the choice of optimization task [3]. Functions that optimize a single affinity score between a user and an item are defined as pointwise, while other methods, namely pairwise and listwise approaches, fall within the ''learning-to-rank'' category. In general, this popular class of algorithms in information retrieval (IR) contains methods that sort items according to their predicted degree of relevance, putting less focus on a predicted score and more emphasis on a well-ordered result. The last part of this section discusses the multiclass approach highlighted by some authors [3], [33].

1) POINTWISE OPTIMIZATION
A large number of traditional RSs rely on pointwise optimization functions. Methods based on a pointwise criteria can be generally described as seeking to predict affinity scores between individual pairs of users and items. For instance, a method within this category might aim to predict the expected rating a user might give to a previously unseen item; the optimization process would therefore aim to minimize the error of each predicted rating against the real rating value. While the underlying task might be classification or regression, the result can be easily adapted to a top-n recommendation scenario by ordering the items by their predicted rating . This can very well incur infeasible  computational costs as the number of training examples  is very large (O(|U ||I |)), an open problem that has been tackled with various approaches, one of which is negative sampling [3], [33], [34].

2) PAIRWISE OPTIMIZATION
Researchers have long argued against the discrepancy between the objective of the optimization process and the final output of a RS. For example, [35] showcased how influential the choice of a properly chosen optimization criterion can be towards the end result, proving the potential of pairwise methods by introducing the widely popular Bayesian Personalized Ranking (BPR) optimization criterion. Pairwise methods compare pairs of interactions at train time. The learning task becomes one that must determine which of the two items would be preferred by the user, ultimately creating an ordering between items -leading to a personalized ranking for each user. Other works, such as the one by [36], have also argued against pointwise approaches, claiming that the application of accuracy-based metrics (such as rating prediction error, see Section V-B) is a sub-optimal fit to the recommendation task. Furthermore, they argue that pointwise optimization is by nature wasteful, as a good approximation is sought for items that are not to be suggested. In general, pairwise approaches are at least as expensive as pointwise approaches, having to consider a number of pairs in the order of O(|R||I |) ⊇ O(|U ||I |) [3], therefore also incurring in complexity issues.
Learning-to-rank approaches attempt to model the fact that, ideally, the algorithm should learn to directly maximize a ranking utility. However, maximizing utilities of this kind is not trivial, as they are often non-differentiable or otherwise uninformative gradient-wise, and this challenge has been studied widely by the academic community [37], [38]. The examples mentioned above fall within the class of solutions that devise surrogate, differentiable ranking losses to minimize, such as to indirectly maximize ranking metrics. To clarify, whenever we use the term ''ranking loss'', we refer to one that only considers relative preferences between items for each user, and does not care about maximizing absolute utility scores on single items. We also point out that, as an alternative approach, the pairwise ranking task has been reformulated as a classification problem (as in [39]), where pairs are labeled as positive if correctly ordered, negative if not.

3) LISTWISE OPTIMIZATION
Listwise approaches can be seen as a generalization of pairwise approaches to multiple items. Authors in this field argue that listwise approaches are more suitable for the learning-to-rank paradigm, as they directly address the problem of creating a list of objects as a prediction [39], [40]. Evidently, due to the fact that such approaches work with permutations that grow factorially in number, the learning task can quickly become intractable. This issue is 86582 VOLUME 10, 2022 often addressed by utilizing what is called a score-and-sort approach [41]. The task is then to learn a scoring function that, given a query (e.g., a user) and a set of items, produces a vector of relevance values, which can then be used to produce a ranking.
Many learning-to-rank methods further reduce the problem to that of learning a univariate scoring function to produce a score between a single query and an item [36], [37], [42].

4) MULTICLASS APPROACH
A related formulation frequently used in practice casts recommendation as an extreme multiclass classification problem where each item is a possible class. From a probabilistic point of view, this option models recommendation as a multinomial distribution over the items (conditional on the user,ŷ(i|u)). One of the most common functions utilized to translate the real valued score to a multinomial distribution is the softmax function, defined as: In the above equation, β is a temperature parameter used to control how much the output distribution will be concentrated around large values [3]. This class of methods is often paired with cross-entropy as a measure of the distance (or error) between predicted and target distribution [33]. The seminal work that introduced ListNet [39] reaches a similar formulation, which the authors call ''top-one probability'', while devising an approach to the otherwise intractable listwise approach based on permutations. The authors prove that the top-k probabilities over the items form a probability distribution, and propose to utilize any loss metric that measures the distance between score lists, such as cross-entropy.
Treating recommendation as a multiclass scenario utilizing the softmax and cross-entropy pairing might appear loosely related to ranking metrics. In [37], however, the authors find analytical connections (under certain conditions) between this loss and popular ranking metrics. Many modern methods do, in fact, utilize similar approaches (i.e., cross-entropy paired with softmax), and it is debatable whether these methods should be considered listwise, as most of the time they do not model the permutations within elements in the list explicitly. For the sake of clarity, throughout this article we consider ranking methods only those which openly address the learning-to-rank task, usually by applying ranking losses (or their approximations) directly.

D. RETRIEVAL, RANKING AND SAMPLING
Before moving on to the taxonomy, we note that many recommendation approaches are not applied directly as ''outof-the-box'' solutions in practical settings. Modern recommendation architectures are usually much more complex and include multiple steps. For example, [43] describes a two-step pipeline in which items are gradually filtered and re-ranked into smaller groups, making it possible to exploit different approaches with different complexity requirements even in large settings.
Generalizing beyond a concrete framework, many approaches implicitly assume that complex methods are preceded by a retrieval model, whose job is to return a short (compared to the whole database) list of items. Retrieval models have to make a compromise between returning good items and obtaining them within the required serving latency requirements (commonly in the order of milliseconds). Complex models are then applied on the much smaller retrieved list, on which they act as a ranking model [44].
Another possible way to approach large-scale systems is that of sampling. While going into detail about such approaches is beyond the scope of this survey, the basic idea is to tackle the sparsity problem by coupling positive examples (which are typically much fewer in number) with a restricted pool of negative ones. This is commonly referred to as negative sampling. The choice of sampling distribution for the negatives and how the sampler weighs different examples is key to the design of proper sampling strategies [3].

III. TAXONOMY OF RECOMMENDATION SYSTEMS
In this section, we describe a data-oriented taxonomy for RSs. Based on our review of past and current works, we deem more appropriate an incremental taxonomy dependent on the amount and type of side information available inspired by works such as [5] and [45]. Note that what is meant by incremental is not that categories must necessarily contain the ones before it, but rather that each category can be extended by combining it with the others, allowing for large areas of overlap between them. This better reflects how work in this field has progressed, rarely tying itself to a specific subset of data and instead attempting to utilize any bit of information as permitted by the situation. This section will introduce influential, explanatory examples for the categories, while Section IV will provide a more in-depth exploration of the current landscape of proposed methods.

A. TRADITIONAL VS DATA-ORIENTED CATEGORIES
In the following, we provide an overview of the classic taxonomy of RSs and then extend it to showcase a more accurate categorization of current approaches.

1) TRADITIONAL CATEGORIES
Traditional categorizations are helpful in providing a general perspective of the most prominent methods, even though we argue they no longer wholly capture how current methods approach the recommendation task. RSs have been traditionally divided in three main categories: collaborative filtering, content-based and hybrid recommenders: • Collaborative filtering [46] methods predict user-item affinity by considering past interactions from other known users. This is commonly referred to as leveraging the ''wisdom of the crowd'', as suggestions made to a user will be based upon similar users. Similarity measures might differ, but are only ever based on past interactions and an expressed preference/feedback towards them; • Content-based [47] methods are used to predict user-item affinity by considering only the user or item features (i.e., their ''content''); approaches such as this are most commonly user-centered, in which the system builds profiles for individual users to make predictions on unseen items. It is also possible to create itemcentered systems, which models individual items and predicts some sort of affinity score when provided with unseen users; • Hybrid [48] methods utilize approaches that combine both of the above categories. Many different combination approaches have been proposed, as well as entirely new methods which fuse them into a single algorithm.
Though such labels still see use, many new categorizations have been adopted in the literature, each with different amounts of overlap between the other. Earlier works, such as the seminal work by [48], placed these groups beside Demographic-based (based on demographic attributes of users), Utility-based (based on a utility model of items with regards to users), and Knowledge-based (based on knowledge bases of items) systems. Hybrid systems would then be defined as combinations of two or more of these, with multiple possible combination strategies between them. While these categorizations are a good fit, they are not entirely balanced, and recent approaches have begun to incorporate them within larger categories. In our research, we find that most new approaches were, in fact, either hybrid or collaborative-based. Thus, though the distinction between collaborative and content-based filtering is still useful, it should be noted how the modern recommendation landscape is much more focused on the area of overlap between the two. We believe that by giving a more data-oriented depiction of this field we can paint a more realistic picture of the current state of RSs.

2) DATA-ORIENTED TAXONOMY
Differentiating models based on specific sources of data can potentially spawn too many subcategories. Recent approaches, from which we draw inspiration, maintain the dichotomy of interaction and content data and add a third category, which covers what is defined as contextual information. The latter aggregates features that are not specific to users or items but rather to the interactions themselves (i.e., describes their context) [5]. Depending on the information utilized, methods may fall below one or multiple of these categories.
It has become common to define the agglomeration of content and context information as side information, which most commonly adds to the base information of interactions given by collaborative filtering approaches. In other words, most methods are influenced by collaborative filtering methods, to which they possibly add available side information. This is due to the fact that interaction data (in particular, in its implicit form), is by and large the most common type of interaction information available [35], [49], [50].
In the following sections, we describe a taxonomy based on three main categories: • Collaborative filtering, which are based solely on the interactions between items and users, ignoring all types of information that describe either or both; • Content-enriched models, which integrate content data into a recommendation, including all descriptive attributes that may be associated directly with items or users; • Context-aware models, which utilize those types of information associated with interactions but not exclusive to the user or item involved, such as time or location. Fig. 2 depicts the differences in the type of information used in the three families of methods. Again, many parts of such categories overlap based on specific availability and considerations of a particular context.

B. COLLABORATIVE FILTERING METHODS
Collaborative filtering (CF) methods model users and items solely based on the interactions of a population of users. As mentioned, users' interests are usually presented as a numerical rating in a small range (explicit feedback) or as a binary value that simply indicates whether the interaction has occurred (implicit feedback). We reiterate that, in practical scenarios, implicit feedback is far more common.
Memory-Based vs. Model-Based: Collaborative filtering approaches have often been divided into two subcategories, namely memory-based (or heuristic-based) and modelbased [4], [45]. 86584 VOLUME 10, 2022 Early approaches calculate the behavior similarity of users or items directly, operating over the collection of interactions in order to make a suggestion. They are termed memorybased because of how they store computed similarities between users or items as a sort of ''memory'' to produce new recommendations. Memory-based models may also be further sub-categorized based on whether they compare users or items. Similarity may be identified through metrics such as the Pearson correlation or the cosine similarity [45]. The most popular memory-based approaches fall into the neighborhood search category, which we discuss next.
Model-based approaches, on the other hand, train prediction models based on the user-item interaction matrix (in contrast to using ratings directly), hence transforming the task into one of estimating the model's parameters. Latent Factor Models (LFMs), discussed in Section III-B2, are the most popular representatives of this category, though the idea is not exclusive to them and includes approaches such as cluster models and Bayesian networks [4], [51].

1) NEIGHBORHOOD METHODS
A popular memory-based approach is the conceptually simple nearest-neighbor (k-NN) algorithm [4], [45]. A userbased approach of this kind finds the k users with the highest similarity in terms of ratings and bases its expected affinity between a given user u and an unseen item i on the ratings of the k neighboring users. For example, a simple formulation to produce a predicted ratingr might be: In the above equation, C is a normalizing constant, sim is a chosen similarity measure and K u and r are the set of neighboring (similar) users to u and the true ratings, respectively. The above is merely a simple example, and much more refined methods exist. A similar, mirrored approach can be taken towards item-based neighborhood methods, where an unknown rating is predicted by averaging the ratings of similar items rated by the same test user [52]. Nearest-neighbor methods (and memory based approaches in general) can run into scalability issues. While the underlying implementation and what is being compared (items, users, a combination of both) obviously matters, there is an inescapable complexity in calculating similarities between all pairs of users and/or items. Most approaches, for |U | users and |I | items, have a worst-time complexity of O(|U ||I |) -though it is empirically usually closer to O(|U | + |I |) thanks to the sparsity of most user vectors [53]. This might still be prohibitive for large datasets, but appropriate preprocessing paired with sampling techniques can make these systems more viable at large scale, though recommendation quality is likely to be reduced.

2) LATENT FACTOR MODELS
LFMs have risen to be the most indicative representatives of collaborative filtering-based RSs, attracting great deals of attention ever since their impressive results in the Netflix contest [54], [55]. The idea behind such approaches is to find representations for users and items in a shared latent space, deriving them from the interaction matrix. The general objective is to learn two embedding matrices P and Q for users and items, where p u and q i are the parameters of the corresponding matrices for user u and item i, respectively. Latent models aim to find the underlying relationships between users and items by learning what are defined as their ''latent factors''.
The most well-known method in this category is Matrix Factorization (MF), and its basic idea is at the foundation of other LFMs. This model attempts to decompose the interaction matrix into the respective embedding matrices, whose combination is a good approximation of the original feedback matrix (Fig. 3). The most basic type of MF, proposed in [56], predicts a ratingr by performing a dot product between user and item embeddings: Though already an effective formulation, many developments have been proposed for it since (such as the different variants of the SVD [57], [58] and iALS [32] methods). One of the advantages of MF is its compact representation; given |U | users, |I | items and d-dimensional embeddings, the theoretical space complexity of the embedding matrices is O((|U |+|I |)d) in total. Given that d is typically much smaller than both |U | and |I |, the resulting complexity is much more affordable than methods that have to consider all user-item pairs. As a side note, the dot product has become a popular combination strategy for latent representations in many applications within the field of RSs (as we will showcase in Section IV). Indeed, it can be applied to any system that produces embedding representations for users and items. While we do not dive into details, it should suffice to know that dot product models are widely popular because they provide an efficient and effective way of combining embeddings, with many well-studied approximations that improve their practical applicability [3].

3) GRAPH-BASED ALGORITHMS
Graph-based algorithms, as the name suggests, opt out of the traditional data representation based on feedback matrices, instead adopting one based on graphs. Not only is this a natural fit to interaction data, but it also lends itself quite conveniently to the integration of additional side information.
Indeed, the interaction relationship between users and items can be easily translated to a graph representation. Formally, consider the interaction relationship S as composed of user and item pairs, i.e., (u, i) ∈ S if an interaction has occurred between the two. We can then define the graph as graph G = (U ∪ I , S) (Fig. 4). This representations is inherently bipartite, since user nodes can only be connected to item nodes and vice versa. Edges within the graph (i.e., interactions) might also be weighted, depending on available information. The weight might be based on feedback data such as explicit ratings, but might also be a straightforward opportunity to introduce more nuanced influences -such as the one derived from content information based on the features of the pair representing the edge.
In general, the objective of graph-based recommenders is to discover a ranking of item vertices in I for a user vertex u based on their respective similarities, as defined by the structure of the graph. If the interactions are implicit and the graph unweighted, the task can be framed as a more generic link prediction problem [14]. The bipartite graph representation grants several advantages; most notably, information can be propagated through nodes to mitigate sparsity and cold start issues. However, the true challenge resides in finding an effective way to enact such propagation. Moreover, this is particularly challenging in bipartite graphs, as user-user or item-item edges do not exist, requiring multiple hops from neighboring nodes for certain communications to happen.
Many graph-based approaches exist, and the recent resurgence in popularity of graph-based methods has revitalized this category of methods. A popular example of such approaches consists of random walk-based algorithms [59], [60]. In short, these methods operate through a stochastic process that lets a random walker move between nodes based on a transition probability (established from known feedback). The probability that a walker lands on an item node after a certain number of steps is utilized as a means to rank the candidate nodes. Examples of methods that fall within this category include P 3 α [61] and RP 3 β [62], which have obtained excellent results. An interesting consideration to be made is that P 3 α can be framed as equivalent to a k-NN item-based approach, which highlights how similar certain approaches can be despite a different representation [23].
More recently, graph embeddings have been proposed as a way to exploit graph structures, mapping nodes into low-dimensional embedding vectors to capture the structural information of the graph. Such embeddings can then be used as representations for users and items. With the advent of neural approaches, Graph Neural Networks (GNNs) have also been proposed, which we further detail in Section IV-F.

C. CONTENT-ENRICHED METHODS
We define as content-enriched those methods that integrate information about the main agents of interactions within a RS (i.e., users and items). Differently from the traditional class of content-based RSs, we consider this group almost as a natural extensions to collaborative filtering models. Empirically speaking, we find that many modern content-enriched methods utilize interaction data as a basis.
Therefore we find, similarly to [5], that describing them as complementary to collaborative filtering approaches is more suitable. Most content-enriched methods indeed ''enrich'' base interactions with auxiliary data related to users or items. This category can be further dissected into sub-classes that encapsulate the specific variety of data being incorporated, as will be described in the remainder of this section. As always, these categorizations can overlap and may be integrated as deemed appropriate.

1) PURELY CONTENT-BASED APPROACHES
We begin by providing a brief overview of purely contentbased approaches. Empirically, we find these to be less common in modern systems, where hybridized models are by far the most prevalent. Still, in light of the previously mentioned fact that RSs are oftentimes complex, multi-step processes with multiple algorithms involved, they are worth mentioning, and can still be useful in certain contexts [63], [64]. A positive side of such algorithms is that they often produce more accurately tailored predictions to single users when compared to purely collaborative approaches. On the other hand, purely content-based systems suffer from overspecialization [4], [65], which describes a system whose recommendations are strictly similar to previous interactions, as well as sometimes being too similar (and hence, not interesting). Furthermore, while they fare better than CF in item cold start scenarios, they struggle with user cold start, as a sufficient number of interactions is necessary before a user profile can be built.
A standard example of a purely content-based approach would be an item-based k-nn approach [47]. This is similar to the one detailed in Section III-B1, where item similarities would be computed utilizing content attributes rather than ratings. To improve scalability, many content-based approaches resort to a projection of features into some type of low-dimensional space, which is then utilized to perform, for example, a search of nearby items. It is also possible to develop predictive models, inducing a similar dichotomy between memory and model based CF approaches (though it is less common to make this distinction in contentbased approaches). These include various types of classifiers, decision trees, and clustering methods [4].

2) CATEGORICAL FEATURES AND ATTRIBUTES
The sources of side information regarding users and items are broad and varied. Categorical and other similarly quantifiable generic types of attribute information are among the most commonly found, describing users and items to some degree (e.g., the genre of a movie or the gender of a user).
A strong representative of the models in this category are Factorization Machines (FMs) [66], which extend factorization models by integrating ideas and advantages of Support Vector Machines (SVMs) [67]. FMs are general predictors but, in contrast to SVMs, and thanks to their modeling of interactions between variables through factorized parameters, can estimate interactions in settings with high sparsity (which is practically always the case in RSs), all at an affordable computational cost [68]. FMs are not applied directly to the interaction matrix, instead requiring a data representation that more closely reflects their predictive nature. Concretely, FMs are provided a matrix in which each row is a feature vector that describes a specific interactions and its features. An interaction matrix can be easily transformed into such a form by creating a one-hot encoding of items and users (Fig. 5).
Additional features may then be concatenated to this feature vector. As each of these rows has a target value y ∈ Y (e.g., rating), this framework is easily understandable as a standard prediction task. Notably, a FM without any auxiliary data is identical to a MF model, and it has also been shown that FMs can mimic most factorization models with appropriate feature engineering [69].

3) MULTIMEDIA CONTENT
Not all types of features can be introduced straightforwardly. When it comes to multimedia content, a dedicated approach may be necessary to first extract a good representation. This is the case for textual and visual content, which have been vastly studied of their own accord; the developments in the fields of Natural Language Processing (NLP) and Computer Vision (CV), respectively, can be combined with recommendation frameworks to obtain better user and item representations. Similarly, the same approach can be taken in regards to audio and video content. Textual content. The past decade has seen revolutionary advancements in the field of NLP. In particular, neural network-based approaches have enabled the automatic extraction of syntactically and semantically meaningful representations for text, most recently with the development of contextualized word embeddings based on Transformer architectures [70]- [72]. These embeddings can be combined with user and item embeddings produced by CF approaches to produce more accurate representations, or to produce more explainable recommendations [73]. As an example, content descriptions (such as abstracts for articles) can be utilized in this fashion [74].
Image content. RSs based on image content are suitable for scenarios that rely heavily on visual influence, such as clothing recommendation [75]. Image-based models may attempt to extract textual tags from images, which may then be processed as discussed previously. Another approach is to project both users and items in the same visual space; items are trivially projected through their pictorial representation, while users may be projected through the items they previously liked or by more advanced encoding procedures. Approaches based on Convolutional Neural Networks (CNNs) are, as of now, some of the most popular and prolific in terms of extracting features from images [76]- [79].
Audio and video. RSs based on rich visual and auditory information have also been proposed, both as purely content-based models as well as integrated to hybrid, contentenriched architectures [80], [81]. These can be particularly useful to recommend new audio and video content that has no historical behavior data by comparing its similarity to other well-known items, mitigating cold start issues. The more straightforward approach might be to utilize metadata related to such types of media (e.g., titles or descriptions), as it is more easily manageable. Nonetheless, some recent approaches have developed deep neural networks to extract image and audio features, projecting items into a low-dimensional feature space in which it is easier to operate (for example, by searching for similar videos in this space) [5]. It is worth noting that working with video media can be difficult because of the underlying computational expensiveness, as well as space storage requirements [80]. VOLUME 10, 2022

4) SOCIAL NETWORKS
Recommenders based on social networks (sometimes more broadly termed as ''community'' based) leverage the preferences of a user's friends or otherwise closely tied users to make a recommendation. Such methods attempt to model the underlying social influence among ''neighbors'', which is seen as the driving force that correlates users' interest in the network. Social sciences have long studied principles such as homophily, a property suggesting that contact between similar people occurs at a higher rate than among dissimilar people. In turn, this provides reason to believe that capturing social interactions can lead to better recommendations, as friends are likely to show preference to similar things.
The integration of social networks has been often devised as a countermeasure towards cold start issues and to integrate information into particularly sparse environments. For instance, [82] integrates social influences into probabilistic LFMs as regularization terms. Some recent approaches have also introduced social influences in neural models. An example is the work by [83], which pairs latent model-inspired user embeddings with social embeddings learned from an unsupervised deep learning approach, applying regularization techniques based on social correlation theories.
Social connections are also a natural extension to graphbased approaches, where they can be seen as edges that relate users to other users. While other approaches that integrate social networks might limit themselves to local first-order social neighbors (i.e., the direct friends of a user), recent approaches (notably GNNs) have been proposed as more accurate models to describe the global social diffusion process for recommendation. These methods have been applied to user-user social graphs (i.e., no interactions), but perhaps more interestingly have also been applied to heterogeneous graphs where both social connections and interactions are present [84]. The leftmost side of Fig. 6 represents an example social interaction graph.

5) KNOWLEDGE GRAPHS
Knowledge Graphs (KGs) are another effective way to represent entities and the relationships between them. There has been some debate [85] on the exact definition of this term; in the context of RSs, the denomination knowledge graph refers to directed graphs containing nodes e ∈ E which represent entities, and edges s ∈ S to denote the relationships between them. A KG is then formally defined as where each triplet indicates the existence of a relationship s between head entity h and tail entity t [86]. Users and items will have relationships with their describing features (or otherwise connected data), which might also be related to other entities.
Earlier approaches utilized KGs to extract a representation (e.g. in the form of embeddings for users and items), but recent models have proposed to enrich such graphs by adding interaction relationships to the graph itself, arguing that it provides a more complete representation [87]. In other words, FIGURE 6. A possible (simplified) graph representation for a RS, including elements of a KG as well social relationships. Double dotted (yellow) edges between users represent social relationships between users, while straight directed edges (purple) indicate relationships between entities. As before, dotted directed (green) edges between users and items represent ratings.
such methods add interactions to the set of edges of the graph. Fig. 6 showcases an explicative example. As before, the user-item interaction data is most commonly presented as a bipartite graph. Note that, while the example shown is mixed with information from a social interaction graph, this need not be necessarily the case. The concept outlined above where KGs are enriched with interaction information can be formally expressed as G = {(h, s, t) | h ∈ E, t ∈ E ∪ I , s ∈ S ∪ R}, where we abuse notation and refer to R as the rating/interaction relationship.
KG-based methods are attractive because they allow for a more interpretable system [12], as it is possible to verify the reasoning behind a recommendation and, thus, create an explanation for it. In a similar fashion to standard graph approaches, these approaches can be used as regularization terms, as input for predictive path-based methods, and, more recently, have been explored in the frame of GNNs to model higher-order connectivity representations.

D. CONTEXT-AWARE METHODS
The category of context-aware methods includes approaches that integrate information sources that can describe the environment where an interaction happens. Because of this, this context is sometimes called ''interaction-associated information'' [45]. Some authors make the distinction between representational context [88], which is defined by a predefined set of ''observable'' context variables (e.g., time, location, weather), and interactional context [89], [90], which instead is more dynamic and has to be derived from the user's most recent actions and is not directly observable (user mood, current shopping intent). The latter set of context data is considered particularly important in settings where users are anonymous or new, as there is no historical data in such scenarios. Some types of side-information might not fall clearly within one category or the other, such as textual reviews for an item (which are content for the item, but contextual to the interaction that prompted the review).
The most widely studied type of contextual information is by far of the temporal kind [4], [5]; as such, the next sections examine in more detail approaches that take into consideration the temporal domain. Though we do not go into detail about other types of context, we point out that several approaches are possible in those cases, and are often similar to those described in the previous section. Methods that are not usually designed to work outside of the two-dimensional User ×Item space have been generalized to multidimensional spaces, with approaches such as Tensor Factorization [91]. Tensor Factorization generalizes MF to arrays of higher orders, where, intuitively, they factorize interactions in the generic form (user, item, interaction context, rating).
Categorization of Temporal Methods: Users' preferences change and evolve over time; due to this fact, static recommendations are likely to be less effective. Instead, it may be possible to discern patterns within the sequential behavior of users, which is the aim of methods that incorporate time in the recommendation process. Methods in this field usually differentiate between a sequence, which is considered a list of chronologically ordered interactions with no explicit time intervals, and a session, a list of interactions with a clear boundary, either ordered or unordered, which most commonly spans a relatively brief interval of time.
However, while research in this particular area is extensive, it is also surprisingly scattered, making it hard to find a categorization commonly agreed upon. To cite a few notable surveys, [8] classifies these methods based on the importance given to historical interactions, distinguishing between lastn interactions-based recommendation, which considers only the last few user interactions, session-based recommendation, in which only the last sequence of interactions (contained in a session) is available, and session-aware recommendation, which contains both knowledge about the current session as well as historical information. On the other hand, in [9] such categorizations are deemed to be more fitting of a ''sequencebased'' class. The authors argue that session-based methods are not only those that consider single, anonymous sessions, but instead include approaches that consider historical sessions as well.
Based on these ideas as well as other works [45], [92], [93], we differentiate between three loosely separated classes: • Time-aware RSs utilize time information directly to recommend appropriate items, with a focus that is more largely tied to the exact point of time of past user interactions (e.g., time of day, day of the week). These methods are still related to a sequential environment as with the other two classes, as they might discover patterns within, for example, temporal cycles; • Sequence-based RSs, sometimes defined time-dependent or sequential recommenders, instead put a much larger focus on the sequential order of events. Such methods aim to predict the next items a user might interact with given a sequence of historical interactions; • Session-based RSs instead group interactions within sessions and tackle tasks more closely related to the sessions in question (further detailed later).
Again, these categories are not separated by hard lines and should be taken as purely functional to a better understanding of this sub-field. Indeed, sequence-and session-based RSs are often considered as special cases of a broader category, that of sequence-aware methods [8], [94].
We do not discuss time-aware approaches directly, pointing out that many of them rely on matrix completion approaches (i.e., common CF approaches), of which [92] provides an excellent overview. Instead, in the next sections, we briefly introduce sequence-and session-based recommenders, which have seen rising popularity in recent research. Notably, both of these classes frequently differentiate between various types of interactions [93], i.e., different with regards to the concrete action logged at that specific time (e.g., view vs buy). This is particularly relevant in a sequential context, as a specific sequence of actions might deliver further insights on the intent of a user.
As a side note, sequence-and session-based RSs will sometimes have to predict a utility value on a list of items, rather than for a specific item. This is different from Equation 1, where the utility was based on a single item, as the utility of the list is calculated on the entire sequence, and the sequence with maximum score should be found.

1) SEQUENCE-BASED RECOMMENDERS
Sequence-based systems try to explicitly discover the sequential dependencies among interactions, such as to discover behavioral patterns and other information that can only be understood when viewing the interactions as a succession of events. There are different types of patterns that might be sought; for example, [8] differentiates between sequential, co-occurrence and distance patterns. Sequential patterns relate interactions in a specific order, while co-occurrence patterns only care that two interactions have happened together. Distance patterns are less common and try to identify good lapses of time necessary before recommending something (e.g., a reminder).

2) SESSION-BASED RECOMMENDERS
Session-based recommenders consider (usually short) sequences of interactions within clearly bounded periods of time. A user's sessions are usually separated by non-identical time intervals. Sessions themselves have been categorized in multiple ways depending on their internal characteristics (length, internal order, action type). While the most common type of session is totally ordered and contains interactions of a single type, heterogeneous and partially ordered (or unordered) sessions have also been researched. Further considerations have also been made in regards to the length of the sessions and the amount of content (user) information available [93]. VOLUME 10, 2022 We note that some of the most popular session-based RSs are limited to the interactions of the current user session (i.e., only the last one), which is the case in anonymous or new-user scenarios. Most of the time, researchers will use the term ''session-based'' to refer to this situation, where user attributes and histories are usually scarce or not present [95]. As there is no consensus, we do not make a clear-cut distinction, clarifying when sessions are deemed as anonymous (only the last session is present) and when the system is instead session-aware, i.e. has knowledge about historical sessions (a term we also borrow from [95]).
Common Approaches: There is a wide range of approaches that have been proposed for sequence-and session-based recommendation, and a detailed overview is provided by [9]. In general, the most popular approaches are, as expected, based on the exploitation of the sequential item transition patterns. These include conventional sequential approaches (e.g., Markov chains) [96], LFMs, and neural network-based approaches (e.g., RNNs, CNNs) [95]. Notably, graph-based approaches have also been successful, integrating sessions as chains (sequences of nodes) within the graph [97]- [99]. Methods also differ depending on the task being faced, which, especially in the case of session-based recommenders, can be of various types. The most common categorization separates them based on whether the system is trying to predict the next item in the session, all the remaining items (until the end of the session), or even the entirety of the next session [9].

E. SUMMARY
We described an extended taxonomy to classify RSs based on the amount and type of information they exploit to make recommendations. We consider this categorization as a reframing of the traditional classification schema, which we deem to have become less informative due to the abundance of hybrid methods that have been proposed. The taxonomy is inspired by the one devised in [5] and consists of three broad categories. The first makes exclusive use of user-item interaction histories, while the other two families of methods are characterized by the usage of additional information, namely user and item content and any environmental data describing the context in which the interaction took place. These categories are further specialized into more finegrained sub-classes of methods, to exemplify how practical methods fit in this classification. It's easy to see how such taxonomy puts much more emphasis on hybrid approaches; moreover, practical implementations usually fall in the large areas of overlap between the first category and one (or both) of the others.

IV. METHODS OVERVIEW
This section provides an overview of the most recent proposals for the improvement of RSs. When vital to the understanding of more recent methods, earlier influential approaches are introduced.
To reduce the amount of redundancy, methods are introduced based on which generic model they are based on -many methods span across different categories, so the distinction is not clear-cut. Whenever describing methods, we will clarify where they lie in the data-oriented taxonomy: collaborative filtering (CF), content-enriched (CE), or context-aware (CA). In particular, a method will be marked as CF if it only acts in a purely collaborative setting, while methods that use side information can be tagged as either CE, CA, or both. In those cases, we do no write CF, but we find that the vast majority of the algorithms analyzed have a collaborative foundation (in most neural approaches, the collaborative signal is implicitly embedded in the learning process).
Unsurprisingly, most of the discussed methods will be neural methods; the application of neural network frameworks is undeniably the most popular new approach to ML tasks. The section is also loosely ordered in terms of optimization processes. The widest class, described first, is mixture of pointwise and multiclass approaches. In Section IV-G, we will instead discuss approaches that directly tackle the learning-to-rank paradigm, while Section IV-H makes a few notable mentions of approaches not included elsewhere.

1) CONTROVERSY ON PROGRESS
Before going into detail on the most recent methods, we deem necessary a word of caution. The particular field of RSs has seen a vast amount of proposed improvements and proclaimed advancements, but these have been sometimes disputed by fellow researchers. It has been demonstrated that, at times, much simpler methods can compete or surpass complex, deep learning-based methods which were deemed superior because of poor testing practices or non-standardized metric evaluation [23]- [25]. This, in turn, causes a ripple effect throughout publications that use the latter methods as baselines, inadvertently basing improvements off of a false belief. We tried, to the best of our knowledge, to factor this within the explanation of individual recent approaches, such as to provide an intellectually honest representation of the recommendation landscape. The ideas and research directions taken by different fellow researchers are still important to study, but a careful examination of the baselines is necessary before declaring new approaches as state-of-theart. We further discuss some of the issues at the root of these controversies in Section V-E.

2) COMMON ABBREVIATIONS
For the sake of clarity, Table 2 summarizes common technical acronyms used in tables throughout this section and in the rest of the survey. Whenever appropriate, the abbreviations will be explained in the text.

A. MATRIX FACTORIZATION-BASED METHODS
Ever since Funk's MF [56] achieved third place in the Netflix Prize challenge [54], many LFMs based on Matrix Factorization (MF) principles have been proposed. SVD++ [57] is a notably popular example, extending the previous algorithm by creating an integrated model that allows for the benefits of neighborhood models (e.g., explainability), as well as allowing the use of implicit feedback in place of explicit item ratings. The reasoning behind this choice is that if a user rates an item, that is in itself an indication of preference. Nonnegative Matrix Factorization (NMF) [120] has also been long used as a powerful tool able to identify meaningful substructures underlying the data. In particular, variants have been successfully applied into diverse fields and extended to analyze multiple matrices jointly [121].

1) RECENT DEVELOPMENTS IN MF
With the recent popularization of neural network-based approaches, some researchers have proposed MF methods that replace the original dot product between factorized matrices with learnable functions, most often feed-forward neural networks [104]. However, recent research concluded that these strategies are not trivial to fine-tune, and that dot-product should still be considered when developing MF methods, since it cannot be easily approximated using a feed-forward neural network [24]. Other approaches have proposed to consider feedback data as ordinal rather than binary; the Ordinal NMF (OrdNMF) [100] is a notable example, introducing a NMF approach that generalizes Poisson factorization and can be used with ordinal data, making it applicable to big sparse matrices of explicit ratings. On a different note, there is also great recent interest in creating extensions for MF techniques that focus on improving the interpretability and mathematical properties of the decomposed matrices [101]- [103].

2) HYBRIDIZED MF APPROACHES
As mentioned, purely CF methods do not natively support the incorporation of side information available from users and items. Content-enriched and context-aware methods, which the literature commonly regards as hybrid methods, have been studied extensively in recently proposed research. Within the spectrum of MF approaches, [122] proposes the usage of a Quantile Random Forest to model the effect of side information and combines it with MF in a Bayesian framework. In [123], Probabilistic Matrix Factorization (PMF) [124] is extended by integrating information derived from item descriptions, extracting a latent representation through a shallow CNN with max pooling.
Information derived from social structures have also been successfully applied to MF-based methods. For instance, [113] integrates information from social relations between users, with the underlying assumption that different kind of relations should have different impact on the recommendation process. In a similar manner, the authors of [116] develop EnSocialMF, a model which derives social VOLUME 10, 2022 factors from social network data and uses it to influence recommendations. In particular, the algorithm attempts to fuse three factors, namely user trust relationships, user interest similarities, and item similarities, all within a PMF framework. Several works [117]- [119] propose to extend MF by considering temporal dynamics in user preference, as well as social factors and geo-spatial information. An example of joint MF framework is proposed by the authors of [114] in the context of AIoT. They propose a hybridized RS to learn user similarity, API similarity and user-API relevance matrices using three MF models that are jointly trained. API invocation histories for each user are embedded through a Word2Vec model [125]. ER-MF [115] is a similar approach, using Doc2Vec [126] to obtain user and item representations and training two models to compute the final recommendation score based respectively on user-similarity and item-similarity. The final model is the ensemble the previous two.

3) FACTORIZATION MACHINES
In Section III-C2, we discussed FMs as straightforward reformulations of the recommendation task through general-purpose linear predictors, able to work under huge sparsity. Recent studies have proposed FMs that are able to work with both categorical and arbitrary real-valued features [105], [106]. The authors of [107] propose the usage of product quantization to compress the memory usage (their model is based on [106]). In [110], Heterogeneous Information Networks and a hierarchical attention mechanisms are explored to capture relationships between objects. Finally, while many existing FM-based methods adopt negative sampling for training efficiency, [111] proposes a non-sampling method that is relatively efficient compared to the selected baselines.

B. FEED-FORWARD AND MULTILAYER PERCEPTRON-BASED METHODS
Feed-forward networks are some of the conceptually simplest and most widely explored types of neural networks. In particular, Multilayer Perceptrons (MLPs) have seen wide use throughout ML. Thanks to their flexibility, they have often been used as starting points or combined with other architectures, and have laid the foundation for some of the most influential earlier neural works on recommendation.
The term MLP is sometimes used ambiguously, usually referring to feed-forward networks with fully connected hidden layers. Since the vast majority of methods utilize fully connected structures, this section addresses them as MLPs directly. We thus review prominent methods based largely on feed-forward networks and MLPs, with various types of augmentations, most notably attention mechanisms (further detailed in Section IV-D and Appendix A-A).

1) POPULAR MLP APPROACHES
The influential work by [43] proposes a 2-step recommendation procedure to recommend YouTube videos. Side information about users (watch history, search keywords, demographic information) is embedded and passed through a feed-forward NN with ReLU activation to learn user and item representations. To train the classifier to discriminate between possibly huge numbers of items, a negative sampling strategy is used to generate multiclass probabilities over millions of candidate items, using a softmax to generate normalized probability scores. At serving time, the model is used to generate user and video embeddings, and an approximate nearest-neighbor algorithm can be used for low-latency constrained predictions. The Wide & Deep Learning framework [44], which has gained similar popularity, combines the advantages of memorization and generalization using two MLP-based branches. The ''wide'' component is a linear model that works with various combinations of manually-created features (responsible for memorization). The ''deep'' component, on the other hand, is a feed-forward neural network that converts sparse, highdimensional categorical features into low-dimensional dense embeddings for all user and item features (responsible for generalization). The model attempts to learn nonlinear interactions through the combination of embeddings via neural networks rather than a dot product, a matter which still stands as a controversial topic. DeepFM [127] later expanded on this idea by introducing a framework that integrates FM and deep neural networks. Differently from Wide & Deep, the proposed model jointly learns both low-and high-order feature interactions without the need for handcrafting feature combinations. Both the wide and deep parts utilize the same input, enabling efficient training.

2) MLPs FOR HIGHER-ORDER FEATURES
Recent contributions mostly focus on strategies to embed side information such as to create more robust recommenders. The recent Deep Learning Recommendation Model (DLRM) [129] addresses the problem of using dense features in addition to categorical interaction features. While the latter are processed using an embedding table and projected in a dense feature space using a MLP, dense features are imputed in a disjoint MLP to learn expressive and properly sized representations. The authors of AutoInt [133] propose a new model to learn expressive higher-order features through a self-attention mechanism. In the proposed method, both categorical and continuous features are firstly projected in a low-dimensional embedding space. Different nonlinear combinations of features are then extracted and their relevance is weighted using the self-attention mechanism.

3) MLPs AND COLD START
In order to address the cold start problem, researchers have experimented with meta-learning solutions, approaches that use previous learning experiences to train new models [134], [135]. The core idea of meta-learning algorithms is to learn a global representation (shared initialization parameters) for all users, which are then used to learn local, personalized parameters for individual users. The work by [134] improves this by utilizing memory matrices that can to store taskand feature-specific memories. Also in the context of cold start solutions, [136] uses a MLP-based strategy to produce vector representations for warm (known) users using both interaction data and side information. To obtain approximate cold-user representations, they use averaged embeddings from a pool of warm users from the same geographical area and with similar age. Additionally, they leverage the interaction data from the same warm users during the registration day.

4) EFFICIENT MLPs FOR RECOMMENDATION
Several works adapt neural-based recommenders to satisfy specific resource constraints. The authors of [138] propose to jointly learn a tree-index and a deep neural model. The tree structure allows to efficiently retrieve user representations with logarithmic time complexity w.r.t. the corpus size. Jointly optimizing the tree-based retrieval problem with the deep recommendation model gains improvements in the overall recommendation accuracy. In [139] and [140], model compression methods are explored. The first proposes a unified framework to jointly optimize a network compression task as well as feature extraction from input interactions. The latter defines a new memory-efficient feature projection technique that relies on several smaller embedding tables to dynamically generate unique embeddings for every user. The Deep Hash Embedding (DHE) [141] framework replaces embedding tables with deep NNs that compute embeddings on the fly, utilizing multiple hash functions to generate unique identifiers for every feature value. This work aims to reduce memory requirements imposed by the usage of embedding tables.

C. CONVOLUTIONAL NEURAL NETWORK-BASED METHODS
CNNs have been thoroughly explored as an efficient and effective way to extract latent representations from various types of media. Though popularized in the context of CV, they are not only applicable to RSs that involve visual content but also to other types of data, such as textual and temporal information.

1) CNNs FOR TEXTUAL INFORMATION
As it is common to encounter bodies of text (such as reviews, descriptions, and news articles) in various recommendation scenarios, a wide range of new approaches proposes to utilize CNNs to learn contextual representations efficiently. For instance, [146] proposes the Neural news recommendation model with Personalized Attention (NPA), which uses a CNN to learn the hidden representations of news articles based on their titles. Furthermore, two personalized attention mechanisms are introduced, at the word-and articlelevel respectively. This is intended to model how different users might perceive the same words in a title differently (with similar reasoning being applied to whole articles). Similarly, the Neural Recommendation with Personalized Attention (NRPA) proposed by [147] applies a hierarchical personalized attention mechanism to generate both user and item representations, considering textual user reviews of items as additional information. CNNs are used to extract VOLUME 10, 2022 semantic features of text reviews, while attention mechanisms are applied hierarchically over words and entire reviews. [150] proposes a neural news recommendation approach with long-and short-term user representations (LSTUR), which combines CNNs with GRUs to capture both longand short-term dependencies between interactions. In a similar vein to previous approaches, news are encoded by passing their titles' embeddings through a CNN and an attention layer. Topics and sub-topics (represented by tags) are also projected to embeddings and concatenated with this representation. GRU networks are utilized to learn short-term user representations from their recently browsed news, which are combined with long-term representations (based on user embeddings) through either initialization of the GRU hidden states or by concatenation.

2) CNNs FOR SEQUENTIAL RECOMMENDATION
Multiple approaches embed the sequences (or sessions) of a user's interaction into a 2-dimensional latent matrix and treat them as an image. For instance, [149] proposes a Recurrent CNN model (RCNN), mixing LSTM and CNN networks to capture both long-and short-range user preferences from the user's interaction sequence. Recent hidden states of the recurrent layers are regarded as the ''image'', which convolutional filters search for local sequential features. The authors of [144] propose Weave&Rec, a 3D CNN applied on word embeddings extracted from news articles. This approach aims to learn ''spatial'' features (i.e., content of the article) as well as temporal features (across different articles, seen as a sequence). Test articles are instead passed through a 2D CNN, also working on word embeddings, and the interaction between a user and the item is obtained through element-wise product of the 3D (user) and 2D (item) CNN outputs. The usage of 3D CNNs had also been explored by [145], which addresses session-based recommendation. In their approach, content features are modeled with character-level encoding to avoid expensive feature engineering steps.
The Convolutional Sequence Embedding Recommendation Model (Caser) proposed by [151] represents users as a L × d image where L is the length of the interaction sequence and d is the embedding dimension (embeddings are learned throughout the training process). Sequential patterns are regarded as local features and extracted through 2D convolutions, while vertical convolutions (i.e., filter size L × 1) are used to capture point-level sequential patterns across item representations. The authors of [152] address the issue of generative models in modeling long-range dependencies in item sequences, directly showcasing some limitations within the Caser model. Their model, named NextItNet, utilizes a stack of dilated 1D convolutional layers, as well as residual blocks to enable the training of deeper networks. Inspired by previous methods, [153] define a general framework for training encoder-decoder recommenders named Gap-filling based Recommender (GRec). The authors showcase a CNNbased encoder-decoder architecture, where the two parts are jointly trained with a gap-filling mechanism (inspired by NLP's masked language modeling [155]), such as to introduce bidirectionality without data leakage. Similar to NextItNet, both the encoder and the decoder use stacked 1D dilated convolutional layers with skip connections.

3) OTHER CNN-BASED APPROACHES
Lastly, we introduce a few interesting CNN-based approaches that do not fall within other categories. In [148], the authors aim to explicitly model feature interactions of arbitrary order, deemed particularly important to express context-aware semantics. They propose a Multi-Branch Convolutional Network (MBCN) with three specialized branches. The first branch is a standard 1D convolutional layer that learns feature correlations in a vector-wise manner. The second branch is a dilated convolutional layer that was added with the idea of generating interactions among features in non-neighboring positions. The last layer models user, item, and context bias (e.g., a user that tends to give mostly positive ratings) for better recommendation.
In [156], a framework to bridge content-and collaborativebased representations is proposed. Textual information is utilized (though extensions to other types of metadata are proposed) to extract representations for completely cold items, i.e., with no prior interactions. The resulting Content Based to Collaborative Filtering (CB2CF) model learns a mapping from the word embeddings of item descriptions (the ''CB'' representation) to a representation learned through BPR [35] (the ''CF'' representation). This multi-view mapping is learned with a CNN, though the authors claim that both this architecture and the BPR model can be replaced with similar approaches, focusing on the connection between representations.
The independence assumption between items is challenged in [154], which focus on the importance of item-item relationships in a CF problem. The authors propose the Co-occurrence pattern combined with CNN (CoCNN); this model is based on the assumption that the more two items appear together in a users' interaction history and are co-rated (i.e., similarly rated) by similar users, the more their representations should be close. The CNN learns representations from the co-occurrence matrix, directly applied to the embeddings. A different CNN model is used to learn pointwise user-item affinity and is jointly optimized with the previously described model.

D. RECURRENT NEURAL NETWORK-BASED METHODS
Recurrent Neural Networks (RNNs), by virtue of their intrinsic advantages in modeling sequential dependencies, are a strong candidate whenever dealing with interactions organized in sequences or sessions. Many recent approaches use more sophisticated recurrent units within their architecture, most popularly implementing a gating mechanism such as Long Short-Term Memory (LSTM) units [170] and Gated Recurrent Units (GRU) [171], such as to solve the various challenges faced by vanilla RNNs (e.g., the vanishing gradient problem). Furthermore, recent years have seen a dramatic increase in the proposal of approaches utilizing the attention mechanism [172], which was indeed first popularized in the context of recurrent networks. This enhancement can be summarized as a weighting strategy for different numerical components; a more detailed explanation is provided in Appendix A-A.

1) RNNs FOR ANONYMOUS SESSION-BASED RECOMMENDATION
Many candidate solutions for session-based recommendation are based on RNNs, a large portion of which deal with anonymous users (i.e., no historical information other than the current session is available). As a first example, [173] introduces the Neural Attentive Recommendation Machine (NARM), proposing an item-level attention mechanism to encode the user's global purpose in a session. The model presents itself as a neural encoder-decoder, where two encoders based on GRU layers encode global and local signals. An attention mechanism is applied to the hidden representations of each time-step t to emphasize it or ignore it. A collaborative framework is introduced in [159], which focuses on exploiting neighboring (but also anonymous) sessions. Two neural-based modules are applied in parallel: an ''Inner Memory Encoder'', that follows the NARM architecture, as well as an ''Outer Memory Encoder''. The first is composed of two sub-modules: one to capture global behavior from the user interaction sequence, the other to pay attention to specific behaviors and linearly combine them into a summary of the user's main purpose in a session. On the other hand, the Outer Memory Encoder extracts knowledge from similar sessions, effectively integrating a CF approach in session-based recommendation. The information from the two memory encoders are selectively combined through a fusion gating mechanism, and the recommendation score is computed through a bi-linear layer. In [163], a multitask learning approach is proposed, incorporating keywords from product titles as soft supervision signals. Such signals are used in a keyword-generation module, which extracts the intent from the session and integrates it in the final prediction, improving performance as well as explainability. A Transformer module is used for keyword generation, while the next-click predictor module is based on a recurrent framework that utilizes GRU layers. A bi-linear layer with softmax, as with the previous approach, is used to get the probability for each item. Keyword generation is integrated into the learning process by connecting the Transformer encoder to the item predictor. An interesting topic is addressed by TailNet [166], which addresses the long-tail problem in anonymous session-based recommendation. The authors propose a ''preference mechanism'' to learn to balance recommendations between popular and niche (i.e., within the long tail) items.

2) RNNs FOR SHORT-AND LONG-TERM MODELING
Some approaches tackle instead scenarios where user profiles are available and propose various approaches to factor in long-term dependencies. The authors of [157], for instance, implement the Hierarchical Recurrent Network with metadata (HRNN-meta), which utilizes two different GRU models to learn intra-and inter-session representations. The idea of utilizing hierarchical recurrent architectures was first proposed by [95], which HRNN-meta builds on and extends by encoding time information as a learned embedding, allowing for more flexibility and efficiency. The authors integrate meta-data information by utilizing ''fieldaware'' MLPs, allowing for multiple types of contextual data (other than time) to be integrated. The Sequential Deep Matching (SDM) [164] model also focuses on the evolving preferences of users by observing short-and long-term behaviors. In particular, we highlight the usage of multi-head attention on the output of a LSTM network to model the multiple interests of a user within the current session. A gated fusion module is utilized to merge global and local preference features. Intuitively, the latter combines a user representation with the short-and long-term representations, learning a gate vector that controls fusion behaviors in a similar fashion to LSTM gates. The authors recall a resemblance to attention-like models, though they argue this approach has more representational power. Similarly, the Hierarchical RNN model (Hi-RNN) [167] uses multiple GRU-based layers to represent both short-and long-term interactions, taking in consideration the time interval between inputs. Finally, the Streaming Session-based Recommendation Machine (SSRM) [165] incorporates a MF into a GRU-based encoder model, intending to integrate collaborative information into a session-based model. They focus on a streaming sessionbased environment, in which they enhance the short-term representation captured by the RNN encoder with the historical long-term preferences captured by MF.

3) OTHER RNN-BASED APPROACHES
Lastly, we introduce some RNN-based methods which explore different research directions. The work by [160], for instance, explores cross-domain sequential recommendations to improve CTR accuracy. The proposed Dual Attentive Sequential Learning (DASL) learns cross-domain user representations using a dual embedding strategy, which extracts latent embeddings in both domains simultaneously through metric learning. The dual embeddings are then used to initialize a GRU layer, that updates its hidden state consuming the sequence of interacted items. The dual attention mechanism then matches the embeddings with candidate items to provide cross-domain recommendations, which are obtained through a final MLP block.
The Co-Attentive Multi-task Learning (CAML) model [161] tackles recommendation explainability through an encoder-selector-decoder architecture. An encoder network is used to obtain latent representations for users and items, utilizing what they call ''implicit factors'' (user embeddings) as well as words item reviews. Then, a multi-pointer co-attention selector module is used to identify relevant features within reviews and concepts for both users and items. A multi-head decoder is used to generate predictions as well as a sentence explaining the recommendation in natural language. A FM is used for predictions, while a GRU-based module is used for sentence generation.
In [162], a deep LSTM-based model is proposed, meant to incorporate geographical and category information for next POI recommendation, such as to enrich sequential information. A personalized attention mechanism is used to weigh the importance of different time windows to improve recommendation accuracy.
Lastly, the authors of [158] propose to model the context of historical interactions more precisely, by factoring in ''what'', ''when'', and ''how'' the action took place. Most notably, they argue that session-based approaches could create a bottleneck in the way they aggregate data points in sessions, and hence distance their approach from such assumption. Their three-step approach starts by applying self-attention to the input sequence, meant to capture item correlation and long-term dependencies (''what action''). The second stage is used to learn temporal influence between interactions and the current moment of recommendation, for which multiple kernel functions are proposed (''when''). The last stage is concerned with using the temporal scores and the item representations to understand the user purpose (''how'') in the session, tuning out noisy interactions that are probably less relevant and imputed to somewhat casual browsing, and it is implemented with a bi-directional RNN to capture event contexts from the past and the future.

E. PURELY ATTENTION-BASED METHODS
As previously introduced, the attention mechanism has seen widespread use as an enhancement to various neural approaches. The authors of [70] introduced the Transformer, an architecture that has revolutionized the field of NLP and that crucially makes no use of recurrence, relying on attention as its main learning mechanism. RSs based on transformers (or its idea of basing themselves largely on attention) have naturally been a popular new approach to this task.

1) SELF-ATTENTION FOR SEQUENTIAL RECOMMENDATION
The application of self-attention modules, inspired by transformers, has been widely popular in recent proposals. A noteworthy example is the Self-Attention-based Sequential model (SASRec) [184], which attempts to capture long-term semantics in the interaction process while also being able to base the prediction on relatively few interactions. The attention mechanism seeks to identify relevant items within a user's history of interactions, basing the network's prediction on them. The Disentangled Self-Supervision (DSS) training strategy [188] aims to enhance SASRec's ability to capture multiple intentions. This approach utilizes self-supervision to reconstruct the sequence of future items as a whole (seq2seq), instead of individual items (seq2item). Moreover, the authors propose a disentanglement layer, which clusters intentions according to their distance to a set of prototypes. This is followed by an attention mechanism to encourage the model to learn user intentions over a number of latent categories.
In [185], it is argued that self-attention does not account for the time span between events, thus capturing sequential signals rather than patterns. They thus introduce various functional time feature mappings, from which they develop time embeddings compatible with self-attention. In a similar vein, [186] attempts to model both sequential behaviors as well as continuous timestamps (which measure a distance between those behaviors) with self-attention. They propose a self-modulating attention approach, which involves the re-weighting of attention coefficients according to the intensity function of temporal point processes, as well as continuous-time regularization to penalize the intensity of largely time-independent behavior data. The intuition is to adaptively and predictively re-weight past behaviors in their impact on the current score. In the same context but with a different approach, [174] proposes to tackle the sparsity of item-to-item transitions by examining the categories of items. They utilize self-attention to capture transition patterns within the same category (e.g., clothing, toys). A separate context encoder is used to predict the next interacted category, applying self-attention to interaction sessions. Finally, a collaborative module compares the users' category-specific preferences and integrates collaborative information based on users' similarities.
GeoSAN [181] also uses self-attention to model longrange dependencies, framing it in the context of sequential location recommendation. Here, the task is to predict the next location position based on the user trajectory and behaviors. The model is based on a Transformer architecture, with several modifications to handle geographical data. They also propose a new loss function based on importance sampling to obtain more informative negative samples. The Spatio-Temporal Attention Network (STAN) [182] improves over the previous work's performance by explicitly considering spatio-temporal information and the personalized item frequency (the number of times a user visits a location), using a bi-layer attention architecture.

2) OTHER ATTENTION-BASED APPROACHES
Lastly, we outline two notable Transformer-based frameworks. The Personalized Re-ranking Model (PRM) authored by [190] is a modular component that can be stacked on top of existing recommendation approaches to perform a re-ranking of item candidate lists. It uses a Transformer structure to capture item-to-item influences and a personalized module to integrate user-level preferences. A likewise worthwhile mention is the proposal by researchers at NVIDIA, which recently open-sourced the Transformers4Rec [191] library. Built upon the popular HuggingFace Transformers library [193], Trans-formers4Rec has the goal of encouraging the development of Transformer-based RSs, especially in sequential and session-based recommendation. The library includes various enhancements specific to the recommendation settings, and a general framework for training and evaluating different models on several built-in datasets with an incremental strategy.

F. GRAPH NEURAL NETWORK-BASED METHODS
GNNs have gained increasing popularity in recent years. Graphs have long been studied as particularly expressive structures, able to effectively capture dependencies and relationships between nodes [14], [219]; whenever a problem has an intuitive representation as a graph, approaches based on them may be able to reveal higher-order connectivity between its vertices. Many well-established approaches of popular neural network architectures have been generalized to arbitrarily structured graphs, most notably convolutions [220], [221], and have been shown to effectively propagate auxiliary information throughout the graph. There is also great interest in the application of GNNs to KGs, as we will showcase in this section. We refer to [220] for further details on graph convolutions.

1) GRAPH CONVOLUTIONAL NETWORKS
In recent years, works such as Neural Graph Collaborative Filtering (NGCF) [194] have paved the way for neural graph approaches in recommendation through the application of Graph Convolutional Networks (GCNs). The authors argue that earlier methods based on vectorial representations (i.e., embeddings), such as MF and other LFMs, can be lacking as they do not encode the collaborative signal expressed by interactions. The proposed bipartite graph structure allows the expressive modeling of high-order connectivity, which is injected in and propagated through the embedding process by utilizing an architecture akin to a standard GCN. LightGCN [203] simplifies the previous approach, yet obtains substantial improvements. The authors argue that, in the context of collaborative filtering, neighborhood aggregation is the most essential component of the GCN. The resulting network learns user and item embeddings by linear propagation on the user-item interaction graph, using a weighted sum of all layers' embeddings as the final embedding. The Self-supervised Graph Learning (SGL) paradigm [204] expands on the idea of LightGCN and explores the idea of self-supervision to supplement node representation learning via self-discrimination. In theory, this approach should mitigate bias, increase robustness to noise and encourage learning from hard negatives.
The GCN-based PinSage [205] combines efficient random walks and graph convolutions to generate node (item) embeddings, such as to incorporate both information about the graph structure and node feature information. It is a particularly worthwhile mention because of the work done towards architectural and training choices that make the method viable in massive graphs, with billions of nodes and edges. In [197], the oversmoothing problem is addressed directly -where node embeddings converge to a single set of values and become indistinguishable, resulting in poor performances. While the authors argue that works such as LightGCN partially address this issue by simplifying the structure, they argue that it is still largely present, and propose a novel Interest-aware Message-Passing GCN (IMP-GCN), where convolutions are performed inside subgraphs. The subgraphs consist of users with similar interests (as well as their interacted items), which should avoid transmitting information between users with little in common. The subgraphs are generated by a dedicated model based on user features and graph structure information. By limiting the amount of ''negative'' information, the model is proved to be more resistant to the oversmoothing issue. In [202], the type of interactions are diversified into multiple behaviors such as to contrast the data sparsity and cold start issues. The authors integrate this concept into a GCN over a heterogeneous graph based on multiple types of behavioral data, arguing that GNNs are a strong candidate in learning the difficult semantics and impact of multiple types of behaviors.

2) GNNs AND KNOWLEDGE GRAPHS
As mentioned, KGs have been studied with increasing interest as effective solutions to sparsity and cold start problems. As a prime example, [208] utilizes interactions within KGs in order to break down the interaction independence assumption. This is achieved by exploiting the links between items and their attributes (which may then be connected to other items, acting as bridges). They propose a Knowledge Graph Attention network (KGAT), which explicitly models high-order connectivities in an end-to-end fashion. Embeddings for nodes (which may be users, items, or attributes) share information through recursive propagation, regulated by a discriminative attention mechanism that weighs the importance of neighbors. In [201] an end-to-end framework inspired by GCNs on a KG representation (KGCN) is proposed. The system is able to capture inter-item relatedness by mining their associated attributes in the KG, aggregating and incorporating neighborhood information with bias when calculating the representation of the items within the graph. An extended GNN architecture is proposed by [198], aimed at simultaneously capturing user preferences as well as relationships between items. The KG is transformed into a user-specific weighted graph to address the relational heterogeneity, which, in layman's terms, attempts to learn a scoring function to weight particular relationships for users. For instance, in a movie recommendation setting, some users might be more interested in a ''directed by'' relationship, while others in the ''lead actor'' relation. They also develop a regularization technique based on label smoothness to counter overfitting (the model is hence called KGNN-LS). The Collaborative Knowledge-aware Attentive Network (CKAN) [199] extends the previous two methods and describes a novel way to integrate KG information with latent collaborative signals. This is achieved through heterogeneous propagation (collaboration and KG) and a novel attentive embedding strategy to model different conditions affecting neighboring KG entities. The authors of the Knowledge Graph-based Intent Network (KGIN) [200] propose an attentive combination of KG relations to model the intents that lie behind a user-item interaction. A newly proposed information scheme for GNNs allows for the integration of such intent information within user and item representations. This framework also allows for interpretable results (through an understanding of the intent).

3) GNN FOR SESSION-BASED RECOMMENDERS
Graph-based approaches have also been used in contextaware environments, encoding sessions within a graph structure. The Target Attentive GNN (TAGNN) [97] investigates temporal transitions of items within a session. The authors argue that prior sequence-based approaches often compress sessions into a single fixed representation, failing to consider the target items to be predicted. By representing sessions as directed graphs and introducing a target-aware attention mechanism, their GNN architecture should instead be able to activate different user interests concerning varied target items. In [98], session-based recommendation is tackled in an environment where data is produced in an online manner (in ''streams''). The authors argue that previous online learning approaches do not model sequences adequately and may easily overfit new data, losing important historical information on long-term preferences. They propose to model sessions as session graphs, where user embeddings are treated as a global attribute for the graph (Global Attributed Graph, GAG for short), and perform graph convolutions to update such global attributes. They also develop a reservoir technique based on the Wasserstein distance, which they deem more effective in sampling streaming session data. The LESSR model (Lossless Edge-order preserving aggregation and Short-cut graph attention for Session-based Recommendation) [99] addresses two issues with previous graph representations of sessions. The first issue they explore is the fact that such representations are lossy, as multiple sessions could map to an identical graph structure. The authors argue for a directed multigraph representation whose information is aggregated in an edge-order preserving manner through a GRU module. The second issue is related to the propagation of long-term dependencies, which they address by introducing attention-based shortcut connections.

G. LEARNING-TO-RANK
As mentioned, it has been argued that approaches that try to directly predict ratings may be non-optimal [42], [229]. Ideally, solving a ranking problem should require the objective function to depend on the relative distances between candidates (preference or rank), rather than the absolute rating value, which should instead have little importance. However, as we mentioned, IR metrics that are often used in the context of recommendation evaluation cannot be easily used as optimization criteria due to their non-smooth nature [38]. In this section, we explore recent approaches that put a larger focus on the learning-to-rank side, often devising surrogate ranking loss functions in an attempt to bridge the gap between training and evaluation objectives.
Various influential works have been proposed in earlier years, devising proxy approaches to optimize ranking scores directly, most commonly the Normalized Discounted Cumulative Gain (NDCG) metric. COFIRANK [230] uses Maximum Margin Matrix Factorization to this end, while [36] crafts surrogate ranking losses in both a pointwise and pairwise scenario (proposing a heuristic approach for the latter's complexity). Other notable classes of algorithms that work towards this end were proposed by the authors of SoftRank [38] and LambdaRank [231].

1) PAIRWISE APPROACHES
Pairwise ranking approaches consider pairs of interactions rather than attempting to model the affinity between a single user and an item. It's worth noting that many pairwise approaches are based on the idea of Bayesian Personalized Ranking (BPR) [35], a general optimization criterion that tries to maximize the probability of binary comparison between an observed and an unobserved item, assuming the observed item will always be preferred. An example is that of DeepRank [227], which proposes a neural network model using the BPR loss for implicit feedback recommendation.
Earlier approaches such as RankBoost [232] have recently been revisited; as the name suggests, the original algorithm consists in the application of a boosting algorithm (an ensemble meta-learning algorithm widely used in classification) to the ranking framework. Effectively, this approach combines a collection of weak rankers into a single, more powerful ranking procedure. The original work proposed two pairwise ranking losses as optimization criteria; RankBoost+ [222] rectifies some issues related to the theoretical soundness of one of these approaches. Another example of pairwise approach is given by JoVA [223], a VAE-based model which we discuss in Section IV-H1. Finally, we mention PushCR [226], an approach based on collaborative ranking (CR) that experiments with three convex loss functions for ranking to emphasize the top positions of the results list. VOLUME 10, 2022 2) LISTWISE APPROACHES While pairwise approaches have seen great advances, authors have argued against the fact that this class of algorithms implicitly assumes that the item comparisons (the pairs) are independent. The problem, however, remains hard to solve, because of the aforementioned non-smoothness of ranking functions, hence making them unsuitable as direct loss functions. Notable works such as ListNet [39] address this by projecting labels and scores onto the probability simplex, minimizing the cross-entropy between resulting distributions. LambdaMART [233], on the other hand, dispatches the loss function entirely and formulates the gradients heuristically. While not recent, the latter approach is still considered to be among the best.
Due to its heuristic nature, LambdaMART's loss function is unknown, and it can only be assumed to be smooth -making theoretical analysis difficult. The work by [234] attempts to close this gap by defining a listwise ranking loss function based on cross-entropy. This modified cross-entropy loss is similar to ListNet's, and proven to provide an upper bound over the NDCG in general IR settings, hence allowing NDCG-driven optimization for retrieval problems. The Stochastic Queuing Listwise Ranking (SQL-Rank) [40] is a listwise approach that applies probabilities to permutations of the set of interacted items for every user. This work, which extends the earlier ListNet, can handle both implicit and explicit feedback, as well devising a graceful method to break ties through a stochastic shuffling process. The authors define a custom listwise loss for collaborative ranking, defined using the permutation probabilities, and highlight advantages over listwise methods that utilize the cross-entropy loss. The aforementioned DeepRank [227] also tests a listwise loss function, derived from the one used in ListRank-MF [228]. This method estimates the probability of an item being in the top position in a ranked list (i.e., top-one probability). The relation between users and items is modeled with the inner product, through a MF model. To introduce non-linearity in users and items representations, DeepRank replaces MF with a MLP with nonlinear activation functions. Differently from ListRank-MF, cross-entropy is used to optimize the top-k probability of items in the ranked list.

3) SETWISE APPROACHES
Some works have begun to incorporate setwise comparison in listwise approaches. While [224] praise the approach of SQL-Rank, they identify a weakness in the fact that only the upper bound -rather than the original negative log-likelihoodis optimized. To solve this, they propose SetRank, a setwise Bayesian approach for collaborative ranking that exploits set structures to better adapt to the recommendation with implicit feedback data (in which ties are particularly difficult to break). Their preference structure assumes users always prefer observed items over the set of unobserved ones. Thus, there is no need to order unobserved items. They experiment with two different models, MF-SetRank and Deep-SetRank. The first one utilizes PMF, while the second one is based on the DeepMF method [235] (an earlier MLP-based approach). The authors of Set2SetRank [225] also explore ideas based on considering sets of items, proposing a model-agnostic framework which leverages both an item-to-set and a set-toset comparison. The first is achieved by encouraging each observed item to be ranked higher than the set of unobserved ones. The second works on setwise distances, by assuming that the sum of distances between positive instances should be less than the distance between the set of observed items and the closest unobserved item (''hard negative''). Both utilize sampling approaches for the two sets.

H. OTHER METHODS
Lastly, we make a briefer mention to two other classes of methods that are seeing much interest in recent years.

1) AUTOENCODER-BASED METHODS
Autoencoders are a type of encoder-decoder architecture in which the decoder maps back to the input space. This process forces the encoder to compress information and maintain the most important features in a low-dimensional space. Therefore, the task is to reconstruct the input with the least possible error. While generally unsupervised approaches, these architectures are often utilized with supervised learning methods to learn improved representations (embeddings) for raw input features, such as users and items in the context of RSs.
Different types of autoencoders exist, and we point to the comprehensive review from [7] for in-depth coverage. In our research, we found a rising interest in the application of a particular class of autoencoders, namely Variational Autoencoders (VAE), introduced in [239]. VAEs have a distinct probabilistic formulation, in which input samples are encoded as a probability distribution over the latent space factors, rather than a single value for each latent state attribute. This results in a representation of input data that resides in a smooth latent space.
In the aforementioned Joint Variational Autoencoder (JoVA) [223], two VAEs are assembled and jointly trained to understand user-user and item-item relationships with implicit feedback. One block reconstructs the rating matrix row-by-row (user representation), while the other reconstructs it column-by-column (item representation). The authors also propose a pairwise hinge-based loss function, to further specialize the method for top-n recommendation tasks. The Macro-micro Disentangled Variational Auto-Encoder (MacridVAE) [237] tackles the complex problem of entangled representations. Briefly, an entangled representation identifies latent factors that each map to more than one generative factor; in the context of recommendation, this can be roughly understood as the learned representation for interactions being related to many different facets of the users' decision-making process. The authors therefore explore the development of a more interpretable and robust disentangled representation, based on VAEs and an information-theoretic interpretation of such models to obtain macro (e.g., user intention) and micro (e.g., descriptive factors of the item being sought) disentanglement. The Bayesian Latent Organic Bandit model (BLOB) [236] shows how to combine ''bandit'' data, information that describes how the user reacted to a sequence of recommendations, with ''organic'' data, which are sequences of naturally occurring interactions. The proposed probabilistic algorithm makes use of both these data sources, integrating advantages of VAEs and bandit-based approaches.

2) CAPSULE NETWORK-BASED METHODS
Recent work explores the usage of Capsule Networks to model dynamic user interests. The base unit in Capsules Networks (CN) is the capsule, which can be seen as a group of standard neurons (i.e., perceptrons). Differently from a perceptron, the output of a capsule is a vector instead of a scalar. An introduction on CN is given in Appendix A-B, and we refer to [244] and [245] for further details on these architectures. In the RS domain, capsules attempt to model the reasonable assumption that each user is a composition of different intents and multi-domain interests that should be recognizable by looking at their interaction sequence [240]- [242], [246].
The authors of [240] use capsules to generate multiple interest embeddings for every user. The multi-interest layer receives average pooled item embeddings as well as user embeddings, and outputs a variable number of interest vectors generated through a Dynamic Routing (DR) approach. Then, scaled dot-product attention is used to compute the importance of user interests with respect to the target item. At serve time, the capsule module is used to generate user interests, and a nearest neighbor procedure is run to generate recommended candidates. In [241], a novel routing by bi-agreement algorithm is proposed, optimized for a binary sentiment analysis task over review texts. By using a self-attention mechanism over embedding and convolutional layers, the method also aims to provide insight on which expressions and aspects of user reviews are most determining for the predicted sentiment. A general framework to extract multiple user representations is proposed by [242], such as to better capture a user's multiple interests. Both DR and self-attention mechanisms are used to generate these embeddings. The model is trained to predict the next interacted item in a sequential recommendation setting. At serve time, an approximate nearest neighbor is used to find the top-n candidates for every user interest, and an aggregation module selects the best candidates for the user. The work by [243] leverages future user behavior using DR to aggregate users' future behaviors into trend representations. A LSTM is used to compute sequence-aware user vectors. Then, a CF-inspired approach is used to select similar users and extract behavioral trends from them. A timeaware attention layer is applied to compute the future trend representation that is concatenated with the user history embedding and used to predict next item probability with a softmax operation.

3) NOTABLE MENTIONS
Lastly, we mention the existence of other noteworthy categories of methods, for which we however do not include a full section but rather point to other sources.
Reinforcement learning approaches have begun to garner attention in their deep learning variants, and the same can be said for adversarial network-based recommenders; [7] provides an overview of these methods. The multi-armed bandit is a reinforcement learning problem that exemplifies the exploitation-exploration dilemma. In the context of RSs, bandit-based algorithms [247], [248] have shown to be effective tools to promptly react to user feedback and trade-off between two goals: pleasing users by making safer bets based on historical behaviors (exploitation) and gaining knowledge about their tastes (exploration). The latter encourages showing more diverse recommendations in order to further improve user satisfaction in the long run. Reinforcement learning can also be used as an enhancement to other ML methods, as it is the case in the previously mentioned BLOB model [236]. An excellent resource on this topic is provided in [11].
Counterfactual learning has recently attracted much interest as a strategy to learn more robust representations for users and items. For instance, CauseRec [249] is a sequential model that uses contrastive learning by modeling counterfactual data distributions. They focus on denoising user representation learning, intuitively considering the retrospect question ''how would the user representation change if we intervened on the observed (historical) behavior sequence?''. The ''counterfactual'' part lies in changing the behavior sequence to observe how the representation changes. VOLUME 10, 2022 Recent approaches are also furthering the class of neighborhood methods, such as [250], which apply a k-NN model with item frequency data and temporal dynamics to a next-basket recommendation environment. Related to distance-based methods, the authors of [251] argue that factorization and neural models, though effective, violate the triangle inequality, losing valuable fine-grained preference information. They propose to approximate users and items with Gaussian distributions use and the Wasserstein distance as a distance (preference) function between users and items. The set-based model proposed in [252] (which we refer to as ''SetBased'') is a straightforward and explainable method, where every user is represented as a weighted bag of interests (tags). A conceptually simple probability model is used to estimate the likelihood of tags for each user, based on the set of personalized preferences as well as the item priors (i.e., the probability of an item being liked by any user).

V. EXPERIMENTAL FACTORS A. DATASETS AND CONSIDERATIONS
In this section, we highlight several popular datasets and their statistics, as well as describing some considerations to be made whenever splitting a dataset for the recommendation task.

1) POPULAR DATASETS
We report in Table 12 some statistics of the most popular datasets used within the research works we reviewed. Along with the number of users, items, and interactions, the table further indicates: • whether sessions are defined explicitly; • whether interactions are in the form of explicit ratings (EX) or implicit feedback (IM); • the availability of additional feature for users (U)e.g., age -items (I), -e.g., title, description -or the interaction itself (C) -e.g., context and type of interaction; • the domain of the dataset. We noticed that in some cases, namely with datasets like Epinions and Foursquare, researchers often crawl the data themselves. Therefore, many different versions of these datasets exist, but not all of them are published and some may be customarily built for a particular work. In such cases, datasets listed in Table 12 describe the most common version that is publicly available.

2) APPLYING EVALUATION METRICS
In order to better understand the application of evaluation metrics, which will be discussed in Sections V-B and V-C, it is important to understand how datasets in this context are utilized and partitioned.
First and foremost, evaluation procedures are assumed to be applied in an ''offline'' scenario, i.e., on historical data. Data is also assumed to be split as is common in most ML scenarios, i.e., a training split utilized for model building, a validation one used for parameter tuning, and a testing set that is used exclusively for evaluation (Fig. 7). Typical approaches, such as hold-out and cross-validation, may be applied.
Validation and test portions of the data are not truly missing ratings but rather simulated through various holdout procedures. This assumption has been widely studied [2], and such evaluation items are often characterized as Missing Not At Random (MNAR) or subject to a selection bias [275], which can lead to possibly inaccurate evaluations. This is a lengthy topic with various complications, some of which are explored in the following sections. For now, we mention that common approaches include random splits, temporal splits (utilizing more recent ratings as test data), and pre-made, fixed splits. The approach we found to be most common is the temporal one, which is, however, not entirely devoid of issues, as it does assume a certain sequential behavior model in the data; regardless, it is usually considered a reasonable choice [2].

3) STRONG AND WEAK GENERALIZATION
Another consideration to be made about evaluation procedures is the choice between ''strong'' or ''weak'' generalization protocols [276]. As discussed previously, in order to evaluate a model's generalization abilities, users (or, in some contexts, anonymous sessions) should be divided into a training and testing set. Strong generalization refers to a split that ensures the model is tested against completely novel user profiles. However, not all methods (especially in the case of collaborative filtering) are designed to work with novel user profiles. Such approaches are tested on a weak generalization protocol, where the test set is comprised of interactions from users that have already been characterized by the model (such as in Fig. 7).
In datasets where the interaction timestamp is available, a chronological split (e.g., first 80% of interactions in training, the last 20% for testing) is frequently used, though more traditional CF methods often prefer a random splitting strategy. The number of interactions reserved for testing is largely dataset-and method-specific. We found that session-based methods prefer to use one or a few target interactions [141], [188], while there is no clear preferred strategy among other methods.

B. ACCURACY METRICS
Here and in the following section we provide a description of the main evaluation metrics utilized in RSs. Note that, in discussing these metrics, it is common to use the terms ''relevant'' and ''irrelevant'' as an abstraction from the various types of interactions possible. Intuitively, a relevant item should be recommended (e.g., a positive implicit signal or a high explicit rating). Moving forward, we define all metrics for a generic user u, but in practice, reported metrics for a RS are always averaged over all users: Though throughout this survey we frequently mentioned various critiques of accuracy-based evaluation procedures, they are often still preferred because of their simplicity. This is particularly common in contexts such as CTR prediction or next item prediction. These metrics measure the error of a predicted rating w.r.t. the real rating, i.e., for a user u and an unseen item i, e ui =r ui − r ui .

1) ROOT MEAN SQUARED ERROR
The Root Mean Squared Error (RMSE) is a metric commonly utilized in regression tasks, and is used to measure the difference between predicted and true values. A smaller RMSE indicates better performance, and the square-rooted version is usually preferred to plain the MSE, as its units are aligned with those of the ratings. Given a vector of predicted ratingsr u and the ground truth r u , it may be defined as: where n is the number of test items (i.e., n = |r| = |r|), u is a user and i is an item.

2) MEAN ABSOLUTE ERROR
The Mean Absolute Error (MAE) is another accuracy-based metric that is frequently used as an alternative. Notably, while RMSE tends to penalize large errors disproportionately (because of the squared term), MAE is more lenient in this regard: MAE tends to better reflect accuracy when outliers have limited importance, while RMSE values the robustness of the prediction across various ratings more highly.

C. RETRIEVAL METRICS
Though comparably not as simple in terms of direct performance feedback, retrieval (or ranking) metrics based on information retrieval theory provide a more realistic perspective of the true usefulness of a RS. These metrics typically restrict the evaluation to the first k item, and are hence commonly referred to as top-k metrics. Given a catalog of n items, consider a recommendation algorithm that produces a ranked list of such items. In order to make the formulations more digestible, we introduce the following notations: let P = {p 1 , p 2 , . . . , p n } with |P| = n be an ordered set of predicted items, generated by a scoring function, for a single user u (which we omit in the notation for the sake of simplicity). Better scores imply a higher degree of relevance of the item for the user. P is sorted in descending order with respect to the scores predicted (i.e., the first item is the best candidate), and p i indicates item ranked at position i (Fig. 8). Notably, the predicted score only matters for sorting purposes. Let G with |G| = m be the list of true relevant items for the same user. We call I the set of all available items, either irrelevant and relevant. Whenever limiting such sets to the top k elements, we will indicate the value as a parameter, e.g., P(k) = {p i ∈ P | i ≤ k}. Also define as an indicator function, formally defined as: The Discounted Cumulative Gain (DCG) is an overall measure of the usefulness (also called gain) of a list of retrieved items, weighted by how well the list is sorted. As mentioned, this is commonly restricted up to an arbitrary position k ≤ n. The relevance score of singular items is summed, while a logarithmic discount factor is used to give more weight to higher positions and penalize lower ones. While different approaches exist, it is common to express a utility function util as an exponential function of the relevance, such as to place a stronger emphasis on retrieving relevant items: where rel(p i ) is the relevance of item p i (e.g., true rating/relevance of the item or a heuristic function thereof). For the sake of generality, we write util (p i ) rather than specifying a particular utility function. Formally, DCG at k can be understood as an inverse logarithmic reward on all positions i that hold a relevant item: The Normalized DCG (NDCG) further normalizes the score in the 0−1 range: the DCG score is divided by the ideal DCG score (IDCG@k), which is obtained by calculating the DCG on the ground truth of relevant items: NDCG is defined as standard when utilizing the inverse logarithmic decay (i.e. 1 log (i+1) . Note that the base of the logarithm is not important, as constant scaling will cancel out due to normalization [277].

2) RECALL
The Recall at k is the fraction of relevant items in P that are correctly recommended in the top-k scoring items, out of the set of relevant items G: As a side note, it must be considered that, if the total number of relevant items is greater than the cutoff value k (i.e., |G| > k), the value of this metric will be lower than 1 even for perfect rankings.

3) PRECISION
The Precision at k is the fraction of relevant items in P that are correctly recommended among the top-k scoring items: In this case, if |G| > k, multiple lists can achieve a perfect score as long as the top k items are relevant.

4) AVERAGE PRECISION
The Average Precision (AP) is defined as the average Precision at k over all k values that hold a true relevant item: where the indicator function is used to enforce a value of 1 if the item at position k is truly relevant, 0 otherwise. The average precision also has an interpretation as the area under the precision-recall curve. We note that, as this metric is most commonly utilized in its averaged (over users) form, it is often used interchangeably as a synonym to Mean Average Precision (MAP@k), as it is implied that it is only a useful statistic when the mean of AP@k over all users is taken.

5) F -SCORE
The F-score (or F-measure) combines the Precision and Recall score in a single value, and their relative importance can be controlled with a β factor: F β (P) = (1 + β 2 ) Prec (P) · Recall (P) (β 2 · Prec (P)) + Recall (P) (13) If β = 1, both terms are equally weighted, resulting in the harmonic mean of precision and recall (usually termed F 1score). This measure can be generalized to a F-measure@k using the previously defined Precision and Recall at k.

6) RECEIVER OPERATING CHARACTERISTIC
The Receiver Operating Characteristic (ROC) is one of the possible approaches for the evaluation of the trade-off between the length of the recommendation list (k) and the percentage of relevant items. Note that the ROC evaluates a binary setting, and hence is best suited for implicit feedback environments. The ROC depends on two measures, namely the true-positive rate (TPR), which is the same as recall, and the false-positive rate (FPR, also called inverse recall), which measures the fraction of ground truth negatives (items not interacted with) incorrectly captured in the prediction: The latter can be seen as a ''negative'' recall. The ROC curve is obtained by plotting the FPR on the x-axis and the TPR on the y-axis for varying values of k.

7) AREA UNDER CURVE
The Area under the ROC Curve (simplified to AUC) measures the likelihood that a random relevant item is ranked higher (scored better) than a random irrelevant item [278]: Alternatively, it can also be expressed in terms of ranks rather than utilities (i.e. the util function), therefore requiring that the rank values (i.e., positions in the list) be sorted correctly [3].
Though AUC provides an objective and quantitative evaluation of the effectiveness of a particular method, as well as having many intuitive interpretations, this metric should be valued carefully. Among its most notable weaknesses stands the fact that it is not always the case that a method with higher AUC is strictly better than another, as the two ROC curves could cross (and, practically, they often do) at different thresholds [279]. In that case, it is hard or impossible to determine which method dominates the other. Furthermore, it should be considered that the ROC treats higher and lower ranked items equally, and is thus unable to give greater importance to higher-ranked items [2].

8) HIT RATE
We make a brief mention to Hit Rate (HR), a metric that often appears with different definitions. In many cases, it is defined as analogous to recall. Here, we describe it as measuring whether the prediction contains least one relevant item in the top-k results. For a single prediction: Generally, the hit rate is more meaningful when averaged among users. The similarity with recall is obvious; when exactly 1 relevant item exists for every user, this metric is equivalent to Recall@k. In other words, it measures, on average, where the first correct prediction lies. Assume the existence of a function rank, which returns the position of the first relevant item for a given prediction P: If there is no relevant item, the reciprocal rank for that prediction is 0. Since ranking positions are explicitly taken into consideration, this metric emphasizes the order of the recommendations, whereas the hit rate only cares about the existence of a relevant item.

10) SUMMARY
There is no pre-defined ''best'' metric for evaluation, as each metric values different aspects of the final ranking differently. Accuracy-based metrics only care about the distance between the actual and the predicted score, without directly considering the actual ranking. For ranking metrics, NDCG is comparatively better than other approaches at distinguishing between higher-and lower-ranked items. In order to verify the trade-off between precision and recall for different values of k, F-scores, average precision, and AUC can give intuitive evaluation estimates. HR and MRR can also be useful, especially if the situation requires the predicted list to contain at least one relevant item (HR) or if it is particularly important for a relevant item to be present in the higher ranks (MRR).

D. SAMPLED METRICS
While discussing metrics, we referred to the generic set of all items, containing both observed and unobserved items. However, the total number of items available in a practical setting is often up and above the hundreds of millions, which makes evaluation in real-world conditions challenging. Pointwise models, for example, would have to evaluate each user-item pair, resulting in substantial time requirements. Therefore, downsizing the set of items may be considered not only for training (with the aforementioned negative sampling VOLUME 10, 2022 approach) but also for the evaluation process, though this choice has significant influence over the results of the metrics. Keeping such considerations in mind, it comes as no surprise that many researchers use sampling strategies to speed up the evaluation process; since datasets are usually very sparse, some approaches sample a subset of unobserved items for every user. However, several studies have pointed out the difficulty of obtaining reliable performance results using metrics with sampling strategies [280]- [283]. Specifically, [280] demonstrates how sampled metrics are, in fact, not good indicators of the model performance when compared to the same global, non-sampled metrics. As a consequence, it is not possible to reliably compare the performance of two methods using sampled metrics, even if the two adopt the same sampling strategy for evaluation. The authors also introduce corrected versions of popular IR metrics that account for the sampling bias at the cost of higher variance. They point out that obtaining statistically significant results is still challenging, and requires at the very least the execution of several evaluation runs, such as to reduce variance. However, they conclude by saying that the only way to remove sampling bias is to avoid sampling altogether.
A recent work by [282] studies the impact of sampling on the recall@k measure, used frequently in implicit feedback recommendation settings. They demonstrate how this metric paired with sampling can be used to approximate the global metric, hence providing a more reliable measure. In another work, [284] proposes new methods to estimate the true unbiased rank distribution with approaches based on Maximal Likelihood Estimation and Maximal Entropy. However, it is still unclear how many samples should be used for a reliable evaluation.

E. EVALUATION IN RECENT WORKS
To provide practical insight to the discussion on evaluation procedures, this section presents our findings of the usage of different evaluation procedures in recent methods as applied to three popular datasets, with a foreword on reproducibility issues. We briefly introduce the most relevant preprocessing choices and evaluation strategies they describe, referring to the published code used for experiments when available. We selected methods from Tables 13 and 14 as applied to two large review datasets (Netflix, ML-20) and a common dataset for POI recommendation (Gowalla).
Foreword: The Issue of Reproducibility: In theory, research works that introduce new methods and deem them to be at a state-of-the-art level should clearly describe all the relevant details to make it possible for other researchers to validate their claim. Reproducibility of results should be ensured by publishing the training and evaluation code and, if possible, relevant data splits. Alternatively, authors ought to give precise instructions on how to generate and preprocess the data [26].
However, several studies, such as the ones in [23], [26], showcase that this is not always the case. Moreover, retrieval top-k metrics are often used with different parameters (i.e., different values of k). In many situations, we found it impossible to make a solid comparison between methods by looking at the reported performance metrics, even when they were reported on the same datasets. The main issues that caused this impossibility concerned different dataset splits or lack of enough details on how data were preprocessed and adapted to different tasks. We occasionally found it not possible to determine whether reported metrics had been computed using comparable data splits or whether they relied on sampling strategies, which would prevent direct comparison. As pointed out in [283], [311], the selected strategy to split data in training and testing set (and possibly to generate sessions) from user's historical behavior can have a considerable impact on the measured performance. Hence, methods using different data splits, even when created from the same datasets, cannot always be compared reliably without repeating tests on the same preprocessed data.
Another point of contention can be found with the conversion of datasets with explicit rating into an implicit feedback setting, most often by interpreting higher rating as a positive signal (e.g., applying a threshold such as (r ≥ 4)). This practice is widely diffused, seemingly with disregard of the fact that explicit ratings are a much stronger preference indicator than implicitly-gathered signals, which in turn are by nature ambiguous and thus weaker. As an example, consider a movie RS utilizing the previously described procedure. While a rating above a high threshold (e.g., 4/5) describes a movie the user liked, an implicit signal only reveals that the movie has been watched or interacted with in specific ways. On behalf of this practice, [23] points out that there seems to be no rationale on the choice of the threshold beyond the fact that others used it before. Metrics and datasets seem to be conveniently chosen and paired with inadvertently weak baselines, giving the impression of improving a few performance metrics, despite several works warning there may not be a direct correlation between accuracy and improved recommendations [23], [312], [313]. While we understand how many of these works are custom-tailored to solve domain-specific problems and may well be worthy of attention, our goal is to show how slight variations in the problem formulation, data processing, and metrics of choice create a very fragmented landscape that lacks established benchmarking strategies of reference [25], [283].

1) THE NETFLIX PRIZE DATASET
In our search of recent contributions from top conferences, four works using the Netflix Prize dataset [54] were selected. This dataset comprises about 100 million explicit rating values assigned to 17.700 movies by more than 450.000 users. All of the studied methods binarize the dataset to emulate implicit feedback by considering movies with rating ≥ 4 as observed interactions and evaluate the models with retrieval metrics.
We start by considering the following two methods. The Embarrassingly Shallow Autoencoder (EASE) [238] is a linear model geared towards sparse data, for which the  authors report better ranking accuracy over state-of-the-art and deep models. The authors of MacridVAE [237] instead use VAEs to capture disentangled user representations. Both these methods use the same preprocessing steps to extract implicit feedback from the explicit ratings. They also declare to follow the exact same splitting strategy, where 40.000 users are held out for evaluation and the rest is used for training. Therefore, this is a strong generalization protocol. Once trained, the model is given 80% of the click history of the held-out set, and the remaining 20% is used as target. By inspecting the code available for MacridVAE, we found that all unobserved items are considered for evaluation (i.e. no negative sampling is used in evaluation). Both methods report results using NDCG@100 and Recall@20/50. These methods are evaluated under the same settings, hence their results can indeed be natively compared.
The authors of JoVA [223] report results using NDCG@1/5/10, but a random sample of 70.000 users is kept for training. Moreover, after inspecting the train/test splits generated from the ML-1M dataset (since the Netflix dataset splits are not shared in their repository), we can only assume that this model has been similarly evaluated on a weak generalization task, where interactions are randomly split and results are reported over 10% of each user's observed items. No negative sampling is used during evaluation. The last method we analyzed, the Deep Generative Ranking (DGR) [285] model, does not share its implementation, and little detail on the splitting strategy is provided. However, their conference paper seems to suggest that their evaluation on the Netflix dataset is conducted using all the available users, hence using a weak generalization framework. They also perform a separate evaluation on users with less than five ratings, but only on other datasets.

2) THE Movielens20M DATASET
The second dataset we analyze is the popular Movielens datasets, specifically in its 20-million interactions form. This dataset consists of about 20 million explicit movie ratings on a 1 − 5 scale (with half-points) given by 138.493 users on 27.278 items [255].
The DHE and DSS models [141], [188] both use the same splitting strategy, but do not publicly share the experiments implementation. In both cases, all user ratings (regardless of the rating value) are considered observed interactions and are sorted by timestamp. Then, the last two interactions are put in the validation and test set respectively, and the rest are used for training. The authors of DSS further specify that their evaluation procedure relies on a negative sampling strategy: 100 items a user has never interacted with are randomly sampled according to their popularity and added to the test set. DHE results are reported using AUC while DSS uses Recall@k, NDCG@k, and MRR. M2GRL [211] works with sessions of movie ratings, created by splitting sequences into sessions for a user when two consecutive ratings are more than one year apart. Additionally, sessions with more than 50 ratings are divided into two shorter ones. Ratings lower than 3 are deleted from the dataset, so this method uses a different threshold value than the previous ones. The MetaHIN [135] model does not share the code used for its experiments, and we were not able to clearly understand the adopted strategy. Results are reported using MAE, RMSE, and NDCG@5.
The graph-based approaches KGCN [201], KGNN-LS [198] and CKAN [199] all propose to enhance recommendation using knowledge bases. The dataset is binarized using 4 as a threshold value for ratings that should be considered observed interactions, resulting in about 13.5 million interactions. During training, an equal number of unobserved interactions is randomly sampled. For evaluation, 40% of the total interactions are reserved and equally split between validation and testing, and the rest are used for training. The three models adopt a weak generalization scheme with no negative sampling in the evaluation process. Results of these methods can be compared on the reported AUC score for the CTR task.
The SetBased model proposed by [252] uses a strong generalization protocol, with a restricted dataset of about 17 million interactions and 5.800 items. To evaluate the binary relevance of predicted items, 1000 users are held-out and 20% of their ratings are used for testing. Again, a predicted rating of at least 4 is considered the minimum threshold for relevant items. MAP, NDCG, and MRR are reported, considering the list of the top-100 items. In the Ordinal NMF [100] approach, only users and movies with more than 20 interactions are kept, resulting in a smaller dataset of 20.000 users and 12.000 movies. A weak generalization scheme is also used here, as it commonly is for MF methods, without negative sampling in evaluation. Results for this method are reported using NDCG@100, but the task is framed as prediction of explicit ratings in the original scale, so it is hard to make a fair comparison with the others. EASE and MacridVAE follow the same evaluation protocol described previously for the Netflix dataset, with 10.000 held-out testing users.

3) THE GOWALLA DATASET
The Gowalla dataset [264] is a popular dataset of check-ins, where the users' friendship network is made available in the form of a graph with 190.000 nodes and 950.000 edges.
Three related graph-based methods, namely NGCF [194], LightGCN [203] and IMP-GCN [197], preprocess the dataset removing users and locations with less than 10 check-ins, keeping about 1 million interactions and 40.000 locations. Performance on the test set is evaluated by using all not-visited locations for every user, and the test set is composed of 20% of randomly sampled interactions for every user, hence using a weak generalization protocol. Results are reported using Recall@20 and NDCG@20, making these methods comparable. GeoSAN [181] seems to work with a bigger version of the dataset, with 131.000 locations and almost 3 million interactions, obtained by removing locations visited less than 10 times and users with less VOLUME 10, 2022 than 20 interactions. For efficient evaluation, the last visited location is used as target (weak generalization), while the rest are used for training, and 500 of the closest locations to the target are selected as negative samples. Between these, 100 are selected by a model trained on the same task. The hit rate and NDCG@5/10 are hence reported on a pool of 101 candidates for each user.
In the SSRM and GAG [98], [165] models, the 30.000 top locations are kept, and user sessions are created by grouping all check-ins for a single day. Sessions with more than 20 or less than 2 items are then removed. Both methods simulate a streaming context for evaluation, in which 40% of the last check-ins in chronological order are split into 5 slices, and evaluation is conducted over each slice, giving all past interactions as input. Results are reported using Recall@20 and MRR. The LESSR [99] model uses the same preprocessing described for the two previous methods to generate sessions, but the evaluation protocol is different: the last 20% of interactions for every session is used as target set. Even if these methods are evaluated on weak generalization and results reported on MRR@20, their results are hardly comparable. Negative sampling is not used in evaluation here, since the model predicts scores for all items.
In PMLAM [251], interactions are not considered as sequential, so evaluation is done similarly to purely CF algorithms, with a cleaned dataset of 1.2 million interactions. Five-folds-cross validation is used: observed interactions are randomly divided into five folds, one fold used for test and the rest for training, and metrics are the average of test results over the five splits. Negative sampling is used for training, but we are not able to clearly understand if it is also used during testing, since the implementation is not available at the time of writing. The authors of LightRec [142] approach evaluation with a similar strategy, but here, for every user, 10% of interactions are used for validation. The dataset is also reduced to about 800.000 samples, by removing users with less than 3 interactions. The code is published, and we found that their weak generalization evaluation protocol does not use negative sampling. However, results can only be tested on the ML-10M dataset, and the Gowalla splits are not shared.
The authors of STAN [182] use a subset of the dataset with 121.000 locations, 53.000 users, and 3.3M interactions. For every user with m check-ins, m − 3 training sequences are created. Each sequence i ∈ [1, m − 3] is composed of the first {1, . . . , i} items with item i + 1 as target item. The test set is composed of a single sequence of the first m − 1 items for each user, with the last check-in as target item. No evaluation negative sampling is used, as the model outputs prediction over the whole set of items. Likewise, Deep-RegionRs [169] uses a weak generalization protocol and predicts the next location given the sequence of previous check-ins. However, during testing, candidate locations appear to be sampled based on their distance from the correct one, and results are reported with different metrics, making it incomparable with STAN. HME [307] uses a subset of Gowalla with check-in data from Houston, and the most recent 10% of each user's check-in is used for testing. Therefore, the method adopts a weak generalization strategy. Code is not published for this work. The CauseRec [249] approach uses a strong generalization protocol, and 10% of users are held out for testing. During evaluation, for each test user, 80% of interactions are used to learn the new user representation and the remaining part are the target items. We were not able to find an official implementation, and no further details are given.
We briefly mention SSTPMF [118], POI-SMF [119] and Meta-SKR [168], methods which all appear to use different subsets of the Gowalla dataset. Interactions are chosen as selected within different time spans, or filtering check-ins on a subset of cities. The splitting strategies utilized are different from one another, and the authors do not publish the code for inspection of their results.
Summary and Discussion: Most methods we analyzed follow evaluation procedures defined by previous work. When in doubt on the evaluation procedure, we consulted the published code for experiments, which, to the best of our knowledge, is only available for 2/3 of the methods listed in Tables 13 and 14. However, with the exception of related methods or the ones that directly improve over one another, they are not directly comparable without extensively editing their implementations. This is mainly due to two reasons. First, as showcased, it is common to find methods trained and tested on different subsets of the same dataset. Furthermore, these sometimes utilize different splitting strategies for training and testing sets. The second culprit resides in substantial variations on the evaluation objective: models that operate on the same dataset are often tested on different tasks. These include but are not limited to the prediction of the next n item(s), the selection of top-n items among candidates, and the incorporation of temporal ordering in both. For example, we showcased how the Movielens20M has been framed as a purely implicit collaborative filtering task, but also as a session-based context-aware problem. In two of the analyzed works (DHE, DSS) [141], [188], all explicit ratings were considered observed interactions, but it is easy to find other research in the literature in which a threshold value (e.g., 3 − 4) is used to convert explicit feedback to a binary form. Additionally, there is always the possibility of framing the task as explicit rating prediction, of which we studied one example (OrdNMF) [100]. Though NDCG can still be calculated on the resulting ranked list, it might be unfair to compare two methods with different optimization tasks in mind.
Even between methods working with the same implicit data, we found performance estimations reported using various top-k metrics (often with incomparable k values), as well as AUC. Even when AUC is used, the evaluation protocol can be very different. For instance, DHE and DSS evaluate the ability of the model to recommend a single relevant item, while the graph-based approaches KGCN, KGNN-LS, and CKAN [198], [199], [201] all evaluate on 20% of the interactions, with a variable number of relevant items. A similar strategy is adopted by the SetBased approach, EASE, and MacridVAE [237], [238], [252], but using a strong generalization protocol and measuring different metrics, computed over a different subset of held-out users (1000 for the SetBased approach, 10.000 for the others). NDCG@100 is reported for all three latter methods. As evaluation metrics continue to be an already controversial topic because of phenomena such as MNAR and the significance of accuracy, this fragmentation makes evaluation all the more difficult. Negative sampling in evaluation was declared to be used in only two of the analyzed works, namely DSS and GeoSAN [181], [188].
Both weak and strong generalization settings present these issues. Ideally, datasets should be standardized in the way they are split and treated, at least in regards to their testing procedures. This is further discussed in Section VI-D. Ultimately, we find that this great variety in evaluation strategies is worsened by the lack of effective benchmarks and platforms that should be used for consistent evaluation of different models [314]. In the NLP field, for instance, GLUE, SuperGLUE [315], as well as other sibling initiatives, provide strong frameworks for the evaluation of newer language models, ensuring fair and consistent results that can be easily compared with other baselines. Similar initiatives should be sought for the improvement and betterment of recommendation methods. Fortunately, the emergence of works such as the previously mentioned [23] and [24] have raised much concern, and various proposals for comprehensive recommendation frameworks are starting to be proposed. Many of these issues are being addressed by the excellent works of [191], [283], [314], [316].

VI. CHALLENGES AND RESEARCH DIRECTIONS
The ubiquity of RSs in today's digital platforms motivates research and industry to monitor closely users' online experience with the aim to continuously improve it. At the same time, the influence of AI systems on users' behaviors raises legitimate concerns about the unwanted effects that a biased RS may have when used to deliver content (like, for instance, personalized news feeds, search results, and shopping advice). Furthermore, the growing need to adhere to strict data protection regulations has steered recent research towards the development of more reliable, transparent, and privacy-aware RSs. While we previously introduced the main technical difficulties encountered in the development of a RS, this section expands on this topic and introduces other major challenges and directions addressed in current research.

A. FACING BIAS AND FAVORING DIVERSITY
It is well known that, in data-driven approaches, the lack of sufficiently diverse data can create dangerous biases, especially on consumer-faced systems such as recommendation algorithms. In general, the concept of diversity implies that the set of proposed recommendations within a single recommended list should be as diverse as possible. The effects of bias are discussed in two recent surveys [317], [318], that systematically study the sources of bias (like input data and model design) and highlight how these flaws contribute to creating unfair results. For example, they argue that the user base contained in the training historical data usually reflects the behavior of an uneven user distribution, resulting in a tendency to under-represent smaller groups. Moreover, additional inductive biases exist within models, related to the assumptions about the nature of the target function of the method of choice.
These, however, are problems that concern every datadriven system. Nonetheless, it is also possible to find biases specific to RSs, such as the ones that may originate from the users' tendency to give feedback only to content that is particularly liked or disliked and to converge towards the majority behavior (a phenomenon termed ''conformance bias''). Many works also highlight the influence of the long tail phenomenon we mentioned at the start of this work, where a small number of popular items represent a considerable part of user interactions. Feedback-loops used within RSs to update user preference may reinforce this effect, known as ''Matthew's effect'' [317], that reduces recommendation diversity in favor of the most ''likely likable'' items.
Recent works try to mitigate bias effects, mostly through regularization techniques using multi-task objective functions or explicitly capturing the concept of diversity from past user interactions [291], [319]- [321]. Other notable works include the one from [322], which performs an empirical study on this phenomenon on a news dataset using different recommendation logic, finding that careful algorithm design can lead to diverse recommendations in line with manually curated news feeds. The authors of [323] test a serendipity-oriented approach based on a topic diversification algorithm to improve the variety of retrieved items.

B. EXPLAINABLE RECOMMENDER SYSTEMS
Explainability refers to the ability of a user to understand why it has received a recommendation. This can be directly related to the concept of user trust, which can be thought of as similar to accuracy (though not entirely the same because of its intrinsic subjectivity, among other things). Many of the works we presented rely on various artificial neural network architectures to generate recommendations. In recent years, researchers have tried to make the results of these ''black-box'' architectures more understandable to human subjects. Surveys from [10], [324] comprehensively cover the latest efforts and emphasize the desirability of robust RSs that can be perceived as reliable and transparent by the users. Some works focus on explicitly modeling latent factors or user profiles [325] and propose the usage of template-based systems to generate user data [252], [326]. Many works use disentangled representation learning to assist in the separation of contextual representation into a number of disjoint user and item factors that support factor-based explanations [187], [237]. Some works have gone as far as proposing the usage of language models to generate a natural language explanation base on the internal user/item representations [73], [161], [288].

C. TOWARDS FEDERATED LEARNING
Changes in data policies laws have recently pushed for the development of recommendation solutions that strike a balance between personalization and user privacy. In contrast with traditional systems, where all data is processed by a centralized infrastructure, federated learning enables a distributed approach. Personalized models are updated directly on user devices and then transferred to the server to be aggregated in a global model [293]. We found several recent works proposing new solutions for federated RSs [293], [327]- [329]. One recently published work explores the effectiveness of FedAttack [330], a method for launching ''poisoning'' attacks on federated RSs. This work suggests that this paradigm may be vulnerable to specific adversarial attacks that may compromise the functioning of the target RS. In [290], the authors study memory-efficient recommenders to tackle the limitations of resource-constrained edge devices, proposing ''elastic embeddings''. Such embeddings are composed of smaller blocks (sub-embeddings), similar to compositional embeddings, though its components are exclusive and not shared.

D. IMPROVING EVALUATION PROTOCOLS FOR COMPREHENSIVE EVALUATION
As already discussed, few datasets are used consistently throughout different studies and results are often difficult to compare. For example, only one of the presented datasets provides an ''official'' data split for training and testing. As a consequence, most works adopt different splitting strategies or reuse datasets from previous work without clear indication of how to retrieve them.
More importantly, this strategy can be seen as somehow compelling new research to select the dataset and evaluation strategy in function of the methods that it is seeking to improve, since results would be not comparable otherwise. This would not necessarily be a negative thing if it were not for the highly fragmented dataset and evaluation landscape. The only exception we could find is the MIND dataset [268], a relatively recent dataset whose train, test, and validation data splits have been made readily available ever since its origin. The dataset portal 3 also allows the submission of predictions on an undisclosed test set with results published on an official leader-board. Benchmarks such as this ensure that models are evaluated fairly and always using the same evaluation protocol. We find that initiatives of this kind, close in spirit to the ones that have now become standard in domains such as NLP, can effectively mitigate the fragmentation and reproducibility issues that are becoming more frequent in current research.

E. OTHER ISSUES AND EXTENSIONS
There exist other research issues and possible extensions that we did not address, some of which we briefly introduce here. We mentioned, though did not address it directly, how researchers have sought algorithms that are both stable and robust, which should imply they are not affected by fake ratings or when patterns in data evolve significantly over time [2]. Some studies have addressed multi-criteria ratings, i.e., approaches that distinguish between (for example) like, dislike, and no interaction at all. Other properties related to user experience such as non-intrusiveness, trustworthiness and other matters related to privacy have also been discussed with great interest [4]. Finally, much could be said about metrics related to fairness and novelty, partly related to the matters of bias and diversity we discussed before, and that are seeing more and more interest in recent research [283].

VII. CONCLUSION
In this work, we provide a comprehensive overview of the main topics necessary to develop an understanding of recent developments in recommendation systems research. We begin by discussing the relevant factors that impact the design of a recommendation algorithm, like data availability and evaluation metrics of choice. We describe a data-oriented taxonomy in line with new developments in this area and present a selection of recent traditional and neural-based approaches classified using the newly introduced categorization. We provide statistics for the most popular datasets and discuss the most common evaluation metrics used to measure an algorithm's performance. We examine the various evaluation protocols used in the researched works and make an empirical analysis concerning three datasets. Our findings highlight a lack of clearly defined testing protocols and benchmarks of reference, suggesting a dire need for systematic evaluation procedures. Finally, the survey closes with a description of the latest research trends and open challenges addressed in recent works.

APPENDIX A ARCHITECTURAL DETAILS
This section briefly outlines two influential architectural paradigms that are extensively used in neural-based methods and also in RSs research.

A. THE ATTENTION MECHANISM
Recent research has made extensive use of various types of attention mechanisms, which can be summarized as weighting strategies for different numerical components.

1) ORIGINS OF ATTENTION
Attention has become truly ubiquitous when it saw applications in the domain of NLP, being used first in machine translation tasks [172] and later in the Transformer architecture [70]. The seminal work by [172] introduced additive attention as an enhancement over an encoder-decoder architecture based on bi-directional RNNs. Previous to this work, the standard approach to such encoder-decoder structures was to use a single, fixed-size context (the compressed hidden representation) as input of the decoding stage. However, longterm dependencies between tokens in the input sequence were difficult to encapsulate in such representation, as the context was, in practice, not able to compress all relevant information when it came to particularly long sequences. The authors therefore proposed to enrich the context vector fed to the decoder by instead providing all hidden states of the encoder, obtaining a different context c i for each target position of the sequence (sentence). The context vector c i for each target word y i is a weighted sum over all the hidden states h j , which are the concatenation of backward and forward hidden states for input word j, as defined in Equation 18.
The weights α ij that effectively measure the attention score between word j and target word i are computed by the attention model. These depend on the previous decoder state s i−1 (before generating word y i ) and the hidden state h j , as in Equation 19.
In the above equation, the attention function (originally termed as the alignment model) was parameterized as a feed-forward neural network jointly trained with the rest of the system. This approach allows for the hidden states from each input word to influence, to different degrees, the generated word y i (that depends on previously generated words) as well as the context c i .

2) THE TRANSFORMER ARCHITECTURE
In the Transformer architecture [70], a similar mechanism is applied to a different framework, notably without any recurrence involved. Having dispatched with recurrence, the sequential processing restrictions are lifted, allowing the authors to propose a novel encoder-decoder model that can process all input tokens in parallel. While a detailed description can be found in the original paper, we introduce the most important part of the architecture, which is the multi-head attention (MHA) layer. This layer uses ''scaled dot-product'' attention in order to achieve efficient computation of attention weights. In the regular attention function proposed, all input tokens are inputted and embedded simultaneously since the architecture makes no use of recurrence, and each embedding matrix X ∈ R N ×dim is projected in three different spaces through different linear transformations, generating three different input representations with values ∈ R N ×d k , as in Equation 20. These are dubbed query (matrix Q), keys (K), and values (V ) following an information retrieval naming convention.
Then, in a few efficient matrix operations defined in Equation 21, the whole self-attention matrix Z ∈ R N ×d k is computed, producing the context vector for every decoded position. The authors define this mechanism as ''selfattentive'' because of how keys, values, and queries all come from the same place (in their case, the output of an encoder layer).
In the above Equation, the denominator d k is a scaling factor, used to improve the gradient stability. Intuitively, in the NLP context, the query is the word being looked at, while keys and values both represent the past memory. The query is checked against the key matrix; the output of the matrix multiplication is passed through a softmax, obtaining a mask that allows to find the values corresponding to those keys. The idea of multi-head attention is simply to linearly project the queries, keys, and values h times with a set of learned linear projections. These operations are performed in parallel and operate on a (usually smaller) sub-space, which can learn multiple diverse representations. Their output is then concatenated and passed through a linear layer to obtain the summarized representation from all heads.

B. CAPSULE NETWORKS
Recent work explores the usage of Capsule Networks [244] to model dynamic user interests. The base unit in Capsules Networks (CN) is the capsule, which can be seen as a group of standard neurons (i.e., perceptrons). Differently from a perceptron, the output of a capsule is a vector instead of a scalar. Capsules have been first introduced in CV [331], and their operation on images is probably the easiest way to explain them. Every object in an image can be considered a composition of several sub-objects, all in a predictable position with respect to each other (e.g., eyes, mouth, and nose in a face). In the RS domain, authors translate this metaphor into the reasonable assumption that each user is a composition of different intents and multi-domain interests that should be recognizable by looking at its interaction sequence.

1) CAPSULES
A capsule is a specialized unit with a dual task (contextualized to images): • recognize the presence of a single sub-object (estimate how likely a part of a whole object is present in an image); • estimate the instantiation parameters of this part, computing a vector that describes the sub-object orientation in space, like its dimension, position, rotation, etc. Hence, with respect to perceptrons, a capsule can capture much richer information about each object's spatial properties, and this information is propagated in the network and exploited in the training process. This stands in stark VOLUME 10, 2022 contrast with other lossy operations often used in CV like pooling [245]. Capsules are organized in layers, and their output is fed to the next capsule using weighted connections. Every layer of capsules specializes on recognizing more highlevel objects, by using the sub-objects information captured by lower-level capsules.

2) DYNAMIC ROUTING
Since every capsule in a specific layer learns to recognize specific parts of an image with spatial information, the next layer must decide how to organize these parts consistently. The routing-by-agreement algorithm known as Dynamic Routing [245] is the key solution to this issue. To learn connection weights, each capsule tries to predict the output of every capsule in the following layer. This can be intended as an ''educated guess'' of the capsule about the object that is most likely made up of the recognized parts, and that should be found by the higher-level capsules. The entire process can be seen as a soft-clustering algorithm that creates clusters of capsules based on the agreement between their predictions and the target vectors. Predictions made by capsule i in layer l about the output of capsule j in layer l + 1 is computed: In the equation above, u i is the activation vector of capsule i and matrix W ij is used to learn the part-to-whole relationship between sub-objects and higher-level objects recognized by the next layer. A weight matrix b stores the connection weights between capsule i in layer l and capsule j in layer l +1 (entry b ij ). All entries of this matrix are initialized to 0. Then, a fixed number of iterations is performed to update weights for each layer. At each iteration the coupling coefficients are computed as follows: Then, for each capsule j in layer l + 1, the weighted sum of predictions made from capsules in layer l is computed. Here the vector s j depends on the ''guesses'' of all lower capsules i: The raw sum in s j creates an un-normalized vector with values potentially bigger than 1. Since we want the vector magnitude (norm-2) to represent the probability of the capsule ''being right'' on the recognized part, a squashing non-linear activation is applied to obtain v j , as in Equation 24.
To measure the agreement between capsules from subsequent layers, the dot product is computed between actual output v j and predicted outputû j|i . This agreement score is used to update the connection weights: b ij ← b ij +û j|i v j (25) This way, capsules in l that were more in agreement with capsules in level l + 1 can send a stronger signal than capsules that made a wrong prediction, with respect to higherlevel capsules. After a few update rounds for each layer l, the algorithm proceeds to the next layer, until all capsule connections are weighted. This routing mechanism has been recently improved using the Expectation-Maximization algorithm [332] in order to overcome some of the limitations of the former approach.

ACKNOWLEDGMENT
(Matteo Marcuzzo and Alessandro Zangari are co-first authors.) ALESSANDRO ZANGARI received the master's degree in computer science from the University of Padua, in 2020. He is an Associate Researcher with Digital Strategy Innovation and a Machine Learning Engineer with the Ca' Foscari University of Venice. His current research interests include deep learning applications for natural language processing algorithms, recommendation systems, computer vision, and interpretability of AI.
ANDREA ALBARELLI is currently a Professor for the multidisciplinary master program in data analytics for business and society with the Ca' Foscari University of Venice, where he is responsible for the artificial intelligence teaching.
He is a Researcher in the field of artificial intelligence, with a special focus on the design of disruptive data-driven methodologies to be applied on real-world scenarios. To this end, he works in close collaboration with companies willing to undertake a radical digital transformation process. His approach is end-toend, spanning from the co-design of digital-first business models to the scientific advising needed to fulfill their methodological and technological infrastructure. He has led several technological transfer projects, resulting in research papers published in top international journals and presented in key engineering conferences. He received several scientific and industrial recognitions, including the NVIDIA Best Paper Award, for his research on 3D data processing; and innovation grants from companies like Electrolux and TIM, for the technical contributions.
ANDREA GASPARETTO received the M.Sc. degree in computer science from the University of Venice, Italy, in 2012, and the Ph.D. degree in computer science from the Ca' Foscari University of Venice. Since 2016, he has been a Researcher and a Teaching Assistant with the Management Department, Ca' Foscari University of Venice. His research interests include in the artificial intelligence field, and more precisely in computer vision, shape analysis, retrieval and classification, and non-vectorial data models. VOLUME 10, 2022