Cross-Domain Meta-Learner for Cold-Start Recommendation

The cold-start problem is a major factor that limits the effectiveness of recommendation systems. Having too few available interaction records brings a series of challenges when predicting user preferences. At present, there are two main kinds of strategies for solving this problem from different perspectives. One is cross-domain recommendation (CDR), which introduces additional information by domain knowledge propagation with transfer learning. However, CDR methods follow traditional training processes in machine learning and cannot solve this typical few-shot problem from the perspective of optimization. The other type of methods that has recently emerged is based on meta-learning. Most of these approaches focus only on generating a meta-model to perform better on new tasks and ignore improvements based on cross-domain information. Therefore, it is necessary to design a novel approach to solve this problem with both domain knowledge and meta-optimization. To achieve this goal, a novel cross-domain meta-learner for cold-start recommendation (MetaCDR) is proposed. In MetaCDR, we design a domain knowledge meta-transfer module to connect different domain networks. In addition, we introduce a pretraining strategy to ensure its efficiency. The experimental results show that MetaCDR performs significantly better than state-of-the-art models in a variety of scenarios.


INTRODUCTION
F ACED with an increasingly severe information overload problem, recommendation systems are playing essential roles in online services [8], [16], [19]. An excellent recommendation system can accurately and quickly discover users' personalized preferences, which provides convenience to users and brings substantial economic benefits to businesses [7], [53], [65]. Most recommendation systems learn a given user's preferences from the user's historical interaction information to generate recommendation results. However, in the real world, new users and items will constantly enter the system. These new users and items with little available data severely limit the performance of recommender systems; this is the well-known cold-start problem [15], [51].
An intuitive method for solving the cold-start problem is to introduce more available data to the system [17], [51], [70], such as by obtaining additional item features or the demographic information of the examined user during the data collection phase (e.g., extracting this information from knowledge graphs) [54], [57]. Instead of relying on the availability of additional information and incurring the cost of obtaining it manually, a more attractive approach is to improve the model structure [60] or build a mapping function [36] to introduce knowledge from other domains; this is called cross-domain recommendation (CDR) [68]. Fig. 1 shows an example of a cross-domain scenario. In the real world, the same users in multiple domains can be aligned, and the user interaction records in other domains are considered auxiliary information in the current domain. In this regard, deep fusion networks [37] with transfer learning [39] (e.g., cross-stitch networks (CSNs)) [22], [61] have achieved remarkable results. However, most of these works have focused on building more complex neural networks to achieve high-quality information transmission while ignoring the important role of model optimization strategies in solving this typical few-shot problem. The core problem of cold-start recommendation is that new users or items have only a small number of interactions in the recommender system with which to model their features. Similarly, the few-shot problem is that only a small number of samples per class are available [56]. Therefore, it is feasible to use few-shot learning methods to solve the cold-start problem.
Recent research on meta-learning [21] has provided new ways to solve this few-shot problem from an optimization perspective. Among them, gradient-based meta-learning (e.g., model-agnostic meta-learning) [12] learns the shared information among tasks to adapt to a new task with a few parameter update steps. This method has achieved great success in solving the cold-start problem for recommender systems. It treats each user as a single task and learns general characteristics among users. When a new user arrives, only a small amount of interactive information is needed to predict the user's preferences. However, most of the current meta-learning models focus on generating a meta-model to perform better on new tasks [9], [26], [59], and only simple MLPs are used as the basic model. Improving the extraction of cross-domain information is also ignored, which leads to great restrictions on the usage scenarios of the resulting model.
Although some meta-learning algorithms have focused on solving the cross-domain problem [52], [58], they are different from those required to solve the problems we describe above. We show the difference in Fig. 2. In previous works, although each task is obtained from a different domain, each task only contains samples from one domain. The challenge of these works is to determine the domains to which the tasks belong. However, in our work, each task contains samples from different domains. The challenge is to utilize domain relevance to transfer knowledge between different domains for better performance.
Therefore, to solve the cold-start problem, it is necessary to propose a model that makes full use of the advantages of crossdomain knowledge and introduce a model optimization strategy simultaneously. However, this task faces the following challenges: 1) How can the problem of cross-domain cold start be rationally redefined so that it can be applied to meta-optimization methods? Unlike traditional machine learning, metalearning has special requirements for data and scenarios. Although the method of applying meta-learning to recommender systems has matured, determining how to transform the cold-start problem of cross-domain scenarios into a problem suitable for meta-optimization is still a challenge. 2) How can cross-domain knowledge transfer be achieved in a meta-learning setting? The current meta-learning-based models that can alleviate the cold-start problem are always based on simple MLP networks, which severely limit the expression ability yielded by the network and the obtained cross-domain knowledge. The introduction of transfer learning networks will inevitably face the complex problem of adaptation between transfer learning and meta-learning. 3) How can the efficiency of the resulting model be ensured? Meta-learning usually consumes more resources and time than traditional training methods. Moreover, with a more complex network structure and larger amounts of data, this problem becomes more serious [12].
Considering the above challenges, we propose a novel cross-domain recommendation model via meta-learning called MetaCDR, which solves the cold-start problem through two resources: cross-domain knowledge and an optimization model. We define the cold-start problem in cross-domain scenarios as a new few-shot problem and optimize it with model-agnostic meta-learning (MAML) [12]. In this model, a module called DKMT is designed based on a CSN [37] to perform domain-knowledge metatransfer. Finally, we propose a pretraining strategy to reduce the amount of computer resources and time required for model training, thereby enhancing the practicality of MetaCDR.
The contributions of this paper are as follows: 1) We design a novel recommendation model with transfer learning and meta-learning called MetaCDR to solve the cold-start problem. To the best of our knowledge, this is the first attempt to solve this problem from the viewpoint of both cross-domain knowledge and model optimization. 2) We propose a module called DKMT, which is designed specifically for recommender systems, to perform knowledge transfer between different domains. 3) We introduce a pretraining strategy for MetaCDR to reduce the amount of resources and time consumed while achieving similar effects. 4) A sufficient number of experiments are performed to prove that the results of MetaCDR are significantly better than those of several state-of-the-art methods in various scenarios. We also conduct an ablation experiment and detailed analysis to verify the impact of each component of MetaCDR and show the effectiveness of DKMT. The structure of this paper is as follows: Section 2 introduces the related work. Section 3 defines the cold-start problem in cross-domain scenarios. Section 4 describes the structure and training process of MetaCDR in detail. Section 5 introduces the experimental settings and analyzes the results. In Section 6, we conclude this paper and introduce our future work.

Cross-Domain Recommendation
Cross-domain recommendation (CDR) [68] is a commonly used method for solving the cold-start problem [41] by alleviating data sparseness. By transferring and sharing information across different domains, the relationships between the domains and semantics of user preferences can be explored to generate better recommendations [47], [48]. The key to this technology is the method of learning the complex relationships between different domains. Recently, many CDR approaches have been proposed [13], [28], [55]. Man et al. [36] proposed embedding and mapping methods that model domain relationships through neural networks. Utilizing a dual-objective optimization method, Zhu et al. [66] achieved simultaneous performance improvements in both the source and target domains. Hu et al. [22] proposed a deep cross network to realize the two-way transfer of knowledge between the two domains. Liu et al. [33] extended CoNet with image information to extract users' aesthetic preferences. Zhao et al. [62] integrated like-minded users with an end-to-end framework to further enhance the effect of CDR. Krishnan et al. [25] leveraged the contextual invariance across domains to simultaneously develop cross-domain and cross-system recommendations. Bonab et al. [2] explored different marketadaptation techniques inspired by state-of-the-art domain adaptation and meta-learning approaches and proposed a neural approach for market adaptation. Li et al. [29] presented a debiasing learning-based cross-domain recommendation framework with causal embedding to correct the data selection bias in cross-domain scenarios with a generalized propensity score and to estimate the propensity score when domain-specific confounders are unobserved. Sahu et al. [44] utilized matrix factorization, by which a rating matrix is decomposed into several submatrices. Li et al. [27] proposed a novel CDR method via regression analysis for cold-start users who never rated items in the target domains.
However, most of the current transfer-learning-based methods are devoted to sharing information more effectively between domains by improving the complex structures of cross-domain networks while ignoring the importance of the optimization for solving the few-shot problem. In this paper, we propose a novel model that incorporates a transfer-learning-based CDR network with an optimization approach to enhance the ability of the overall model to solve the cold-start problem.

Meta-Learning Recommendation
Meta-learning [21] is also known as learning how to learn. Unlike traditional machine learning methods, a meta-learning model is trained through many separate tasks to learn their similarities and differences and to obtain a base model that can be adapted to new tasks with rapid updating [52], [58]. Common meta-learning methods can be divided into three categories: metric-based [6], [45], [50], memory-based [14], [46], and optimization-based [30], [38], [43] approaches. Previous works have tried to utilize a variety of meta-learning methods in recommender systems to solve the cold-start problem and achieve good results.
Vartak et al. [49] used a meta-learning-based method to predict users preferences for tweets based on their historical clicks. Du et al. [10] predicted user behavior via sequential recommendations in different domains using meta-learning. However, this method only learns common initialization parameters for each domain and does not consider the accurate alignment of fine-grained information across domains. Bharadhwaj [1] modeled each input user as a task with the optimization-based meta-learning method (MAML). Lee et al. [26] extended the above method by optimizing the parameters of the model in groups. Lu et al. [34] introduced heterogeneous information networks as additional information in a meta-learning environment to further alleviate the cold-start problem. Dong et al. [9] used a memory-augmented neural network to improve the model's effect with respect to solving the cold-start problem. Zheng et al. [64] used a matching network to address the sequential recommendation cold-start problem without side information. Yu et al. [59] solved the problem of neglecting minor users through a meta-learning approach with a personalized adaptive learning rate. Lin et al. [32] further alleviated the coldstart problem with neural processes. Zhu et al. [69] reduced the bias toward limited overlapping users in the embedding and mapping approach via a meta-network. Feng et al. [11] developed a contextual modulation meta-learning framework for efficient and complete recommendation. Zhu et al. [71] proposed an embedding cold-start approach with metascaling and shifting networks to avoid the effects of noisy interactions. However, current methods do not consider the role of cross-domain knowledge transfer in meta-learning recommendation.

PRELIMINARIES
In this section, we first provide a specific definition for the cold-start problem in CDR. Then, as the base model for the new work, a feature embedding method and a multilayer feed-forward neural network for personality prediction are introduced.

Problem Formulation
Two different domains (such as movies and books), both containing user features, item features, and interaction records, are called the source domain D s and target domain D t according to the interaction sparsity difference between them. The users who appear in both domains are called overlapping users. The features of these overlapping users are represented as a set U, and item features are represented as sets I s and I t . The interaction information between users and items can be expressed as R u;s and R u;t by implicit feedback (such as clicks, browsing or likes) [20] or explicit feedback (such as ratings) [26]. It is worth mentioning that interaction records for new users or items are often scarce; this is called the cold-start problem in CDR.
Our task is to use the features of users, items, and the interaction records among them to train a model to make predictions regarding the users' item ratings. The function is expressed as follows: r u;s ;r u;t ¼ f u ðu; i s ; i t Þ; (1) wherer u;s andr u;t are the predictions of the ratings of user u & U for items i s & I s and i t & I t from source domain D s and target domain D t , respectively. f is the predicted model, and u is the parameter of f. In our model, we treat each domain as a separate recommendation task and adopt a single-domain method to model each task separately. Then, a cross-domain network is used to connect the two domains and perform knowledge transfer. A meta-learning strategy is used to train the domain models jointly. In the next section, the basic models are introduced before MetaCDR.

Embedding and Recommendation
In this section, we introduce the embedding method and the structure of the basic model for a single domain.
Embedding: u ! e u ; i ! e i . Traditional recommender systems use one-hot vectors to represent the IDs of users and items, but these systems can only predict the interactions between existing users and items; they cannot learn user preference information. When faced with new users or items, a one-hot vector is helpless. Therefore, we use the demographic information of users (such as their ages, genders, and occupations) and the features of items (such as film directors or book types). These features provide users' potential preferences in the recommendation system. Specifically, we first divide the available numeric information into groups and represent it as integers, convert the category information into one-hot vectors, and then use a dimensional compression matrix for embedding as follows: e u ¼ f uu ðuÞ ¼ ½u 1 p 1 ; u 2 p 2 ; u 3 p 3 ; ::::::; u N p N T ; ( where e u is the embedding vector of user u. f is the embedding function for users with the parameters u u . u n is an integer or a d j -dimensional one-hot vector representing user feature n 2 f1; :::; Ng, and p n is the d e -by-d j embedding matrix for the corresponding categorical content of user u. ½Á; Á is the concatenation operation. The items are embedded in a similar way: where e i is the embedding vector of item i. f is the embedding function for items with the parameters u i . i m is an integer or a d k -dimensional one-hot vector representing an item's feature m 2 f1; :::; Mg, and q m is the d e -by-d k embedding matrix for the corresponding categorical content of item i. ½Á; Á is the concatenation operation. Next, e u and e i are connected and fed into the recommendation model. Recommendation: ðe u ; e i Þ !ŷ u;i . We use MLPs to model user preferences and predict the ratings, implicit feedback, or dwell times for items. The model can be expressed as: Þ; :::::: wherer u;i is the model's prediction of user feedback, MLP stands for a multilayer perceptron, and f is the set of its parameters, including a weight matrix W and a bias vector b. s is the activation function; here, we use the rectified linear unit (ReLU). e u and e i are the embedding vectors of users and items, respectively. Fig. 3 shows the overall workflow of MetaCDR. In this section, the details of MetaCDR are presented. First, we introduce the cross-domain combination method of the network, including the sharing of the user embedding network and the cross-domain connection between fully connected networks in the meta-learning environment. Second, we redefine the cold-start problem in the CDR scenario as a few-shot problem. Third, we propose a meta-optimization method for MetaCDR. In addition, we design a pretraining strategy to greatly reduce the time and resource consumption.

User Embedding Sharing
Part A of Fig. 3 shows the structural details of MetaCDR, and the left side of its structure is the embedding part. To combine the networks of the source and target domains and share information, we first share the user embedding layer so that the same user feature has a consistent initial embedding in different domains, which can help the model focus on learning the mappings of the commodity characteristics between domains. Our method of obtaining the embedding vector for each domain is as follows: where V s is the input vector on the source domain side and V t is the input vector on the target domain side. f represents the embedding functions, and u u , u s , and u t are the parameters of the embedding functions for users, source domain items, and target domain items, respectively.

Knowledge Transfer Between Domains
Before introducing the structure of DKMT, we first review the deep neural network transfer model called CSN [37], which has achieved significant results for multitask learning problems in the field of computer vision. We also note the problems that need to be solved when performing crossdomain knowledge transfer in the recommendation system and meta-learning environment. Given two convolutional neural network models, the CSN is used to connect the corresponding layers. Specifically, at location ði; jÞ in the activation map, we have: where x ij A and x ij B represent the current-layer inputs of networks A and B, respectively; a AA , a AB , a BA , and a BB represent the cross-stitch parameters, which are used to implement knowledge transfer; and e x ij A and e x ij B are the inputs of the next layer of the networks.
Consider the following three issues: 1) The four weight parameters used by the CSN can only achieve content migration in the same dimensional space, but our model uses an N-layer fully connected neural network with different dimensions for each layer. 2) The CSN assumes that all dimensions of information are equally important. However, unlike images in computer vision, the importance of each user or item dimension in the recommendation system is different and requires an independent weight [5].
3) The CSN assumes that all information is worth migrating, but it is evident that not every feature is helpful in other domains, at least in the recommendation system. We need to find a way to make the model transfer knowledge more conservatively and to better adapt the meta-learning framework.
To solve the above problems, we design a module to realize domain knowledge meta-transfer, called DKMT. We show this module on the right side of Part A in Fig. 3, and we propose solutions to the above three problems.
For the first two problems, we use weight matrices to replace the weight values in the CSN, which is equivalent to a cross-domain fully connected network. This structure is expressed as: where x l s and x l t represent the outputs of the cross network and are used as the inputs of the next layer of the network in the source and target domains. W l ss and W l tt represent the domain-specific weight matrices of the source domain and target domain, respectively, which are used to perform knowledge transfer within the domain. H l st and H l ts are the cross-domain weight matrices between the two domains, which are used to perform knowledge transfer from the source to the target domain and from the target to the source domain. x lÀ1 s and x lÀ1 t are the inputs of the cross network and are also the outputs from the last layer of the network with respect to the two domains. b l s and b l t are the biases in the two domains. In contrast to the CSN, H l st and H l ts are d l Â d lÀ1 -dimensional parameter matrices, where d l is the dimensionality of x l s and x l t , and d lÀ1 is the dimensionality of x lÀ1 s and x lÀ1 t . In this way, we perform knowledge transfer between different-dimensional layers.
Since the effectiveness of meta-learning can be significantly reduced on complex models, we set H l st and H l ts to the same matrix H l to reduce the complexity of the model: where H l is the shared parameter matrix used for knowledge transfer between the two domains. The other parameters are the same as those in Formula (7). For the last problem, we introduce a widely used sparsityinduced regularization method called the least absolute shrinkage and selection operator (LASSO) to the knowledge transfer matrix H l . As usual, LASSO regularization can help the model filter more useful parameters through sparsity. The calculation of the regularization term can be expressed as: where V stands for LASSO regularization, H l is the parameter of the knowledge transfer matrix in the l-th layer, h l i;j is the ði; jÞ-th parameter in the matrix, and is the hyperparameter used to control the degree of sparsity.

Multilabel Loss Function
We define the loss as the sum of the mean square errors (MSEs) of the two networks and the regularization term. At this stage, the loss function of MetaCDR can be expressed as: where L represents the overall loss; L s and L t represent the MSE loss functions for the source and target domains, respectively; u 2 U, i s 2 I s , and i t 2 I t are the users and items from the two domains; r s 2 R s and r t 2 R t are the ratings of the two domains; andr s andr t are their predicted values. V represents the regularization term, and H l ðl 2 f1; :::; LgÞ are the parameters of DKMT. K is the number of interaction records.
Above, the network structure of MetaCDR is introduced. Next, we redefine the cold-start problem as a few-shot problem and introduce the utilized training and test processes based on meta-learning.

Data Processing
Part B of Fig. 3 shows the employed data preprocessing method. In MetaCDR, each task includes a user, the items which the user has rated in the source and target domains, and the corresponding ratings. Each domain of each task is divided into support sets and query sets.
To closely approximate a real scenario, we derive inspiration from the well-known meta-learning model matching network [50], use the latest fixed-length item sequence that the examined user has interacted with as the query set, and use the remaining items as the support set. Finally, we define the users or items that do not appear in the training phase as cold-start users or items.

Hierarchical Meta-Optimization
We utilize the idea of optimization-based meta-learning to optimize our model. We divide the input data into tasks according to different users. This can be understood as training a unique model for each user in the adaptation phase to better adapt to user interests and preferences.
As shown in Fig. 3, we divide the training process into two parts: relation-wise and semantic-wise updating. We hierarchically update different parameters during meta-training.
During relation-wise updating, the loss is computed via the task's support set and used to update the model based on a few steps of gradient descent for a task-adaptive model. Since the embedding matrix of the recommendation model occupies the vast majority of all parameters, updating all parameters at this stage will increase the computational cost and make it difficult to effectively approximate the task within a limited number of update steps. In addition, updating the parameters of the embedding layers during relationwise updating will lead to frequent changes in user and item embeddings, which is not conducive to the model focusing on learning domain relations. Inspired by [40], we only update the parameters of MLP and DKMT in the relation-wise update.
We tried three different meta-optimization strategies; however, only the basic paradigm of gradient-based metalearning is introduced here, and MetaCDR under each optimization strategy is introduced in Section 5.4. Fig. 3 shows the meta-optimization process of MetaCDR. To avoid repetition, we only show the equations in detail in Algorithm 1 and do not repeat them in the text. We divide the parameters that the model needs to optimize into three groups: 1) u e ¼ fu u ; u s ; u t g are the parameters of the embedding network. 2) u m ¼ fW f1;:::;Lg ; b f1;:::;Lg g are the parameters of the fully connected neural network. 3) u h ¼ H f1;:::;LÀ1g are the parameters of DKMT.
In the test phase, we use a small number of new user interaction records in the two domains to update the base model M base . With the advantages of meta-learning and the transfer of knowledge between domains, the model can quickly and accurately adapt to user preferences. After that, the model can provide rating predictions for other items.

Algorithm 1. Training of MetaCDR
Data: a set of meta-training tasks t; each task t u 2 t consists of two support sets t sou sup and t tar sup from different domains and two query sets t sou que and t tar que from different domains; Input: semantic-wise and relation-wise update steps: s and r; global update and local update learning rates: a and b Result: the trained base model; 1 Randomly initialize the base model M base with the parameters u ¼ fu e ; u m ; u h g 2 while no convergence do 3 sample a batch of tasks t u $ pðtÞ 4 for task t u w.r.t. user u do 5 do relation-wise update via task t u ; 6 end 7 do semantic-wise update; 8 end

Pretraining Strategy
The combination of the complex network structure and metaoptimization and the large amount of data brought by the combination of two domains (with the Cartesian product) not only dramatically reduces the efficiency of the model (approximately 12 GB of GPU memory and 2 hours are required) but also induces a risk of nonconvergence. Therefore, we set a pretraining method to optimize the training process.
We first use a method similar to neural collaborative filtering to train two single-domain network parameters u e and u m with the traditional training method in an alternating manner. Then, we use the pretrained parameters to initialize the corresponding parameters in MetaCDR, randomly initialize the parameters u h of DKMT and fix the parameter u e . Here, we adopt a random strategy to select training samples from the combined data of the two domains. Finally, we obtain an evaluation model with a small number of training epochs.
The pretrained model is called MetaCDR-PT. Section 5 proves that this pretraining method ensures the effectiveness of the model while greatly improving the training efficiency.

EXPERIMENTS AND DISCUSSION
In this section, we summarize the experimental results and analyze them to answer the following research questions

Datasets
We choose two real-world datasets to evaluate our model: MovieLens 1M 1 [18] and Douban. 2 Table 1 shows the details of these datasets.
MovieLens 1M contains user rating records from the IMDB 3 for movies, as well as the features of users and movies. Similar to the method in [31], we divide the movies into a source domain (before 1998) and a target domain (after 1998) according to their release years, and the ratio is approximately 4:1. Then, to simulate a cold-start problem, we filter out the users with between 13 and 60 interaction records in each domain, and the average gap in the interaction counts between the domains is approximately 24.77. The last ten interaction records are used as query sets, and the rest are used as support sets. In particular, for fairness, we use the support set in the evaluation data for the meta-learner as the training data for the non-meta-learning methods.
Douban is a real-world dataset crawled from the Douban website [67]. It contains many user ratings on movies, music, books and other items. We select movies and books as the source domain and target domain, respectively, and select users with between 13 and 80 interactions in both domains as the available data; the average gap in the interaction counts is approximately 17.48. Similar to the processing method used for the MovieLens dataset, we divide the data into a support set and query set for each task (user). For the non-meta-learning methods, the support sets in the evaluation set are also used as their training data.
For each dataset, the division ratio of the training, validation and test sets is 7:1:2. We set up four scenarios on each dataset. 1) Warm-Start: The model is evaluated with existing users and items. In addition, we adopt a real-world dataset from the e-commerce platform Amazon 4 to study the effect of overlapping users, side information and feedback patterns on the models.
FM [42] is a classic method for recommendation based on the features of items and users. It can predict the personalized preferences of users by exploring the potential relationships between users and items through existing content and additional feature information.
NeuMF [20] is a state-of-the-art collaborative filtering model based on an MLP and generalized matrix factorization (GMF). We define its output module as a linear layer for rating prediction and embed the features of users and items as its inputs for the cold-start problem.
EMCDR [36] is an embedding and mapping approach for CDR. It first learns the embeddings of entities in the source domain and target domain and then uses a neural network to capture the mapping function between the embeddings of the same entity. In this paper, the two domains are set as the source domain and the target domain in turn.
MMoE [35] is a well-known multitask learning framework. It utilizes a gating network for each task on a mixture-of-experts structure. As suggested by [63], the embedding parameters are shared across all experts. The embedding vectors of users and items from the two domains are given to each expert. We set two towers to output scores for the two domains.
CSN [37] is a multitask model with a deep fusion network that was first applied in computer vision. Two networks are connected through cross-stitching to optimize the results with multitask learning.
SCoNet [22] is a state-of-the-art transfer learning model designed for CDR; it uses a parameter matrix to transfer knowledge between domains and uses Lasso to limit the degree of knowledge transfer.
MetaCS-DNN [1] is optimized with an N-layer fully connected network to obtain embeddings and ratings through an idea similar to that of MAML. By converting each user into a task, the cold-start problem is transformed into a fewshot problem.
MeLU [26] is designed with a similar idea to that of MetaCS-DNN, except that it optimizes its personalized recommender network at all stages and only optimizes the general embedding network during the global update stage. That is, the parameters of the embedding network are updated only as the model learns about the commonalities between users.
MAMO [9] is designed with a memory-augmented neural network to store the personalized user gradient information, further improving the accuracy of recommendations in cold-start scenarios.
TMCDR [69] is a meta-learning-based embedding and mapping approach for cross-domain recommendation. Unlike EMCDR, TMCDR utilizes a meta-network for the mapping stage. For fairness, we set an MLP as the base model in the transfer stage for each domain and obtain the embedding vectors based on the trained embedding layers. Similar to EMCDR, the two domains are set as the source domain and the target domain in turn.
The source codes of FM, 5 NeuMF, 6 EMCDR, 7 SCoNet, 8 MeLU 9 and MAMO 10 are openly available, and we modify their data processing and output components to apply them to our experiments. We implement MetaCS-DNN with the code of MeLU, which has a similar idea. We implement CSN ourselves.

Parameter Settings
For MetaCDR, we set MAML as the base meta-learner; the learning rates for semantic-wise and relation-wise updating are set to a=0.01 and b=0.001, respectively; the regularization parameter is set to 0.01; and the numbers of steps of relation-wise and semantic-wise updating are set to 5 and 1, respectively. The embedding dimensionality of each feature is set to 32. Two ½32 Â 8 ! 64 ! 64 ! 1 MLPs are used as the basic model. The rectified linear unit (ReLU) is employed as the activation function, and optimization is conducted by adaptive moment estimation (Adam) [4]. We also use batch normalization [23] to speed up the convergence of the model. We set the batch size to 16 tasks, and the maximum numbers of epochs are set to 30 and 20 for MetaCDR in MovieLens and Douban, respectively. For MetaCDR-PT, the maximum numbers of epochs are set to only 10 and 5.

Evaluation Metrics
We adopt three evaluation metrics, the mean absolute error (MAE), root-mean-square error (RMSE), and normalized discounted cumulative gain at rank K (nDCG@K), to evaluate MetaCDR and the other baseline models. Here, we set K ¼ 5. The specific calculation method is as follows: where U is the user set utilized in the test, I u denotes the interaction record of user u, and r u;i andr u;i are the real rating and the predicted rating, respectively. The IDCG calculates the best possible DCG for each user. The MAE and RMSE calculate the degree of error incurred when predicting ratings, and lower MAE and RMSE values correspond to better model performance. The NDCG represents the overall performance of the model for a certain user, and a higher NDCG indicates better performance.

Environment
All our experiments are conducted on a Linux server with a GPU (Tesla V100 with 32 GB of RAM) and CPU (Intel Xeon E5-2698 v4). The operating system is Ubuntu 11 16.04.6, and Python 12 version 3.6.8 is used. The model is built based on the deep learning library PyTorch, 13 version 1.4.1.

Performance Comparison (RQ1)
In this section, we compare MetaCDR and its pretrained version MetaCDR-PT with several state-of-the-art baseline models. We design four scenarios for each dataset: warm-start, user cold-start, item cold-start, and user-item cold-start. Tables 2 and 3 show the performance of all models in different domains of the two datasets for the four scenarios. It is obvious that MetaCDR and MetaCDR-PT outperform the state-of-the-art models for most of the datasets and scenarios, especially in cold-start scenarios with more severe conditions. Further analysis of the results shows that the meta-learning method usually performs better than the traditional methods and normal cross-domain methods when faced with a cold-start problem.
According to an in-depth analysis of the Amazon dataset, the average ratio of overlapping users to the total number of users in any two domains is less than 12% [24]. Fig. 6 shows three pairs of domains as examples. To make the experiment more similar to the real world and evaluate the robustness of the models, we set three small-overlap scenarios for each dataset with fewer overlapping users in the training phase. To make the experiment fair, we use the same useritem cold-start data to evaluate each scenario. Fig. 4 shows the performance of our method (MetaCDR), the meta-learning methods (MeLU), the cross-domain methods (SCoNet), and the traditional methods (NeuMF) in the four scenarios. It is evident that with the decrease in the number of overlapping users, MetaCDR exhibits stronger robustness (a smaller increase in MAE) than other baselines and achieves the best accuracy (lowest MAE).
To test the impacts of different degrees of cold-start problems on the model effects of the results, we test the performance of traditional methods (NeuMF), meta-learning methods (MeLU), cross-domain methods (SCoNet), and our MetaCDR by limiting the maximum length of each support set. The results in Fig. 5 show that as the support set decreases, the performance of MetaCDR declines least among all models. Therefore, MetaCDR is sufficiently robust and can address the cold-start problem well when very limited data are available.

Hyperparameter Analysis (RQ2)
Next, we study the impacts of hyperparameters on MetaCDR by adjusting them. To better demonstrate the effect of our model in the most challenging scenario, the experiments below are all performed in the user-item coldstart scenario. In this section, we analyze the impacts of three hyperparameters on the effectiveness of the model: the semantic-wise and relation-wise update steps and the regularization coefficient . Figs. 7 and 8 show the impact of the semantic-wise and relation-wise update steps on MetaCDR (MAE). Because similar experimental results are observed in terms of the RMSE and nDCG@5 metrics, we only report the MAE results, and the range of the number of update steps is from 1 to 5. The analysis shows that the impacts of the relationwise and semantic-wise update steps on MetaCDR are relatively small in both the source and target domains. The model exhibits strong stability. The results show that the number of semantic-wise or relation-wise update steps has little effect on the model results in the user-item cold-start 11. https://ubuntu.com/ 12. https://www.python.org/ 13. https://pytorch.org/ scenario. However, we still choose to perform 5 steps in the relation-wise update phase because in other scenarios, multiple relation-wise updates often bring some improvements to the model. Figs. 9 and 10 show the impact of the regularization coefficient, i.e., the sparsity of the DKMT parameters, on MetaCDR. We evaluate the effect of on the model for the MovieLens and Douban datasets (in terms of the MAE, RMSE, and nDCG@5). We select [0.0001, 0.001, 0.01, 0.1, 1] as the values of for the evaluation. With the experimental results, it can be found that the effect is best when is set to approximately 0.01, and as it continues to decrease, the model effect changes slightly. However, when it increases, the model effect decreases significantly. This means that excessively sparse parameters also limit the transfer of cross-domain knowledge. Finally, we choose 0.01 as the regularization coefficient of MetaCDR.
We explore the optimal locations and number of DKMTs through further experiments. The basic model we use is a ½32 Â 8 ! 64 ! 64 ! 1 MLP, so we can add three DKMT structures: H 1 2 R 256Â64 , H 2 2 R 64Â64 , and H 3 2 R 64Â1 . We successively change the locations and number of DKMTs and evaluate their impacts. The results are shown in Table 4. Although the trends are different in different domains, overall, the model with DKMTs mostly shows different degrees of improvement than the model without DKMTs. This shows that the DKMT structure is effective. Some models that have DKMTs with fewer parameters exhibit declines in performance after a few epochs, so we set fewer epochs for these models to obtain better results by stopping the process early. Similarly, enabling all DKMT structures does not significantly improve the effectiveness of the model, as the use of too many parameters reduces the computational efficiency of MetaCDR; thus, we set DKMTs in the first two layers and omit them in the last layer.

Impact of Meta-Learners
To test the impacts of different meta-learners for MetaCDR, we choose four meta-learners: MAML [12] is a classic gradient-based meta-learner. It uses support sets for task-adaptive local updates and query sets for global updates. We improved it to suit the MetaCDR scenario. During the training procedure, first, all the parameters in the base model M base are initialized, and a copy M base 0 of the parameter set is generated. Second, we randomly select a batch of tasks t u (for users u) and feed their support sets as inputs into the MetaCDR model to obtain the prediction results. Third, we calculate the loss L and gradient G through the results to update the parameters u m and u h of M base 0 and obtain the meta-model M meta . This step explores the complex cross-domain relationship information, and it can be repeated many times to achieve the desired effect. It is worth noting that at this step, we have not changed the original model M base but only updated M 0 base . Fourth, the query sets are fed into the meta-model M meta to obtain the loss L 0 and gradient G 0 . In this step, we clamp the gradient range so that the model can be adjusted more conservatively (this is essential for sensitive metalearning models). Finally, all parameters in the base model M base are updated with the gradient G 0 . In general, the overall form of the model update can be expressed as: where L s and L q are the losses in the support and query sets, respectively. Thus far, the model has completed a batch update, and this process is repeated until the model converges. Then, we obtain the trained meta-model.
Meta-SGD [30] is another gradient-based meta-learner based on MAML. The difference is that Meta-SGD can not only update the parameters of the network but also adaptively adjust the learning direction and learning rate in meta-optimization.
First-Order MAML (FOMAML) [38] implements metaupdates with the first-order gradient. Unlike MAML, FOMAML is updated locally based on each task from a batch in turn, and the global update is based on parameters after the local update. FOMAML is faster and consumes less memory because it does not need to compute Hessian matrices.
Reptile [38] is a new first-order meta-learner. Unlike FOMAML, Reptile does not need a training-test split for each task, which makes it more flexible in certain scenarios. Specifically, Reptile uses the same set of samples from one task for multistep local and global updates. Here, we update the embedding parameters in both semantic-wise and relation-wise updates: where K is the number of tasks in each task batch.
According to the results shown in Table 5, in most scenarios, MAML is similar to MetaSGD and FOMAML, and MAML has the most stable effect. Reptile is faster, but it usually performs worse. Therefore, we adopt MAML in our model.

Ablation Experiment (RQ3)
Finally, we study the impact of meta-learning (MAML) and transfer learning (DKMT) on MetaCDR with ablation experiments.
Impact of Meta-Learning: To study the impact of metalearning, we design three ablation models for comparison with MetaCDR: 1) All parameters are updated in the relation-wise update step: MetaCDR-AP. 2) As in transfer learning, all training data are used for pretraining as in the traditional method, and the base model is fine-tuned when new users arrive: MetaCDR-FT. 3) Only traditional methods are used to optimize the basic model, and the information of new users is also considered training data without adaptation: MetaCDR-BM.
Figs. 11 and 12 show the comparisons among the above ablation models and MetaCDR. With respect to the three metrics in the two domains, MetaCDR always performs best. Moreover, MetaCDR-AP performs slightly worse than MetaCDR because the impact of hierarchical parameter optimization, which allows the model to focus on different information, is noticeable. The effects of MetaCDR-FT and MetaCDR-BM lag significantly behind those of MetaCDR, which shows that meta-learning plays a vital role in the model.
Impact of Transfer Learning: To understand how DKMT impacts this model, similar to the previous analysis, we design three ablation models for comparison with MetaCDR: 1) Two different cross-network parameters H s and H t are set for each DKMT: MetaCDR-DC. 2) DKMT is replaced with CSN; that is, the same weight is assigned to all information: MetaCDR-CSN. 3) Two recommendation networks are optimized independently in different domains, and only the user embedding is shared. The two networks are supervised by the labels obtained from the two domains separately and are optimized in an alternating manner: MetaCDR-OI.
Figs. 13 and 14 show the comparisons between MetaCDR and the above three ablation models. MetaCDR still performs best. The effect of MetaCDR-DC decreases because the introduction of many parameters obviously reduces the effect of the model. The results of MetaCDR-OI are worse than those of the first two models, indicating that the DKMT structure is necessary. It is worth mentioning that the MetaCDR-CSN model yields the worst effect in most scenarios and for most metrics because forcing all features to be given the same weight is not always conducive to the transfer of information.

Impact of Pretraining (RQ4)
In this section, we demonstrate the effectiveness of the proposed pretraining strategy through experimental comparisons. Table 6 shows the time and space consumption of the four different methods. FOREC [3] is a cross-market recommendation algorithm with pretraining. In fact, FOREC's pretraining strategy is similar to the meta-training in MetaCDR. The difference is that FOREC only trains the meta-model on a single network and adapts it to different markets. Therefore, for fairness, FOREC's network structure is set to be the same as MetaCDR's single-domain network; i.e., it uses the same embedding and MLP. MetaCDR-PT+ is another pretraining MetaCDR that is trained with additional nonoverlapping users during single-domain pretraining. From the results, we find that the time and space efficiency of MetaCDR-PT/PT+ are much higher than those of the other two strategies.

Model Applicability Study (RQ5)
The Amazon dataset is used to evaluate the performance of MetaCDR in scenarios without side information and with implicit feedback. Unlike previous datasets, Amazon has no user or item features, so we only conduct experiments in the user cold-start scenario. Furthermore, due to the change in the feedback pattern, we introduce new metrics Hit@10, area under the curve (AUC), and mean average precision (MAP)@10 in this section. Furthermore, to study the impact of the network architecture on MetaCDR, we introduce MetaCDR-PNN in this experiment, which uses PNN as the base network.
The experimental results are shown in Table 7. The metalearning methods still show excellent performance in scenarios without side information and with implicit feedback. MetaCDR achieves the best results on most metrics. Although the performance of MetaCDR-PNN is slightly worse than that of MetaCDR, it is still better than the other methods. We only conduct user cold-start experiments here to discover the difference in knowledge transfer between different domains. The performance in the user-item coldstart and item cold-start are not significant due to the lack of sufficient side information under our meta-learning setting.

Visualization (RQ6)
To understand what information is transferred by DKMT and what is being adjusted via the adaptation of meta-learning, we show an 8 Â 8 portion of the second layer of DKMT parameters in three cases in Fig. 15. We find that DKMT learns a unique weight for each feature dimension. In Fig. 15, the lighter the color, the closer the weight is to 0. A light-colored area indicates that the feature of this dimension plays a small role in cross-domain knowledge transfer, so the model learns a smaller weight for it. The dark blue and dark red areas indicate that the information of these dimensions has an important role, i.e., DKMT can use the differences in user preferences between different domains to capture the evolution of interest.
A comparative analysis of the heatmaps obtained under the above three conditions reveals the role of adaptation in metalearning in MetaCDR. After updating the model with the information from two different users, the two DKMTs obtain different parameters. That is, in MetaCDR, a personalized model can be generated quickly for each user through meta-learning.

CONCLUSION
We construct a novel recommendation model called MetaCDR based on meta-learning and transfer learning to solve the cold-start problem through cross-domain knowledge and a model optimization strategy.
MetaCDR implements cross-domain knowledge transfer in a meta-learning setting through a DKMT module and updates its parameters hierarchically with meta-learning to learn the complex relationships between the given domains and the appropriate embedding method for user and item features. Moreover, we propose a novel pretraining strategy The results of metacdr-pt/pt+ are reported as pretraining/meta-training.  to make the developed model more applicable. The experimental results prove that the effect of MetaCDR is significantly better than those of the state-of-the-art models in various scenarios.
Renchu Guan (Member, IEEE) received the PhD degree from Jilin University, Changchun, China, in 2010. He is currently a professor with the College of Computer Science and Technology, Jilin University, China. He has published more than 40 papers. His research has been featured in IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Geoscience and Remote Sensing, Nature Communications, etc. He was the recipient of several grants from NSFC. His research interests include machine learning, bioinformatics, and knowledge engineering.
Haoyu Pang received the BSc degree in computer science and technology from Jilin University, Changchun China, in 2020. He is currently working toward the graduate degree with the College of Computer Science and Technology, Jilin University, Changchun, China. He is supervised by Prof. Renchu Guan. His major research interests include machine learning and recommender systems.
Fausto Giunchiglia received the PhD degree in computer engineering from the University of Genoa, Faculty of Engineering. He is currently a full professor with the Faculty of Science, University of Trento, Italy. His research interests include focus on artificial intelligence, formal methods, knowledge management, and agent-oriented software engineering. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl.