Multi-Objective Deep Reinforcement Learning for Recommendation Systems

Most existing recommendation systems (RSs) are primarily concerned about the accuracy of rating prediction and only recommending popular items. However, other non-accuracy metrics such as novelty and diversity should not be overlooked. Existing multi-objective (MO) RSs employed collaborative filtering and combined with evolutionary algorithms to handle bi-objective optimization. Besides cold-start problem from collaborative filtering, it also vulnerable to highly sparse environment, while the evolutionary algorithm suffers from premature convergence and curse of dimensionality. These limitations have prompted this work to propose deep reinforcement learning (DRL) approaches for MO optimization in RSs. Several works in DRL are available but none has addressed MO RS problems. In this study, the performances of proposed DRL approaches that based on Deep Q-Network in MO recommendation problem were investigated. The approaches were evaluated with movie recommendation dataset by using three conflicting metrics, namely precision, novelty, and diversity. The results demonstrated that deep reinforcement learning approaches has superiority performance in MO optimization, and its capability of recommending precise item along with achieving high novelty and diversity against the benchmark that using probabilistic based multi-objective approach based on evolutionary algorithm (PMOEA). Although PMOEA algorithm secured higher average value in precision, it has lower values of novelty and diversity than the proposed DRL approaches. The DRL approaches surpassed the benchmark results in average of maximum novelty and the average of mean diversity metrics, the optimization between accuracy and non-accuracy metrics is inevitable. In addition, the experiments revealed that incorporation of user latent features enhanced the recommendation quality.


I. INTRODUCTION
The volume of information and data are growing exponentially nowadays, where we can simply get tons of information through online applications at fingertips. Excessive information may cause the online users difficult to meet the user's interest or correct target information. Therefore, recommendation systems (RSs) are applied to direct users through vast information space, toward the items that could fulfill the user's desire. The recommendation algorithm is aimed to provide a list of relevant items that user might be interested in order to secure user satisfaction and maintain user activeness in the system. From business perspective, an RS is a crucial and valuable tool to boost the business revenue. For instance, an online streaming platform consists of few million movies and series and thus requires a recommendation engine to generate playlists anticipating their interest. When the suggested relevant items are matched with user interest, it is expected to increase the subscription rate and enhance the user's online streaming experience.
Approaches for RSs may be categorized into 6 classes [1] as follows: i. Context-Aware RSs which provides references by user's contexts. ii.
Group RSs which consider the preferences of a group of users instead of a single user to generate recommendations.
iii. Multi-Criteria RSs which address user preferences various item's aspects (such as cleanliness and location, for the recommendation of a hotel). iv.
Cross-Domain RSs which utilizes knowledge inferred in a source domain (e.g., movie) for recommendation to users in a target domain (e.g., music). v.
Multi-Stakeholder RSs considers several perspectives of users to generate the recommendation. vi.
Multi-Task RSs which utilize ensemble method to combine output from several RSs typically through a joint optimization over a shared network representation. Most of the traditional RS approaches [2]- [9] are only focused on accuracy of rating prediction or item with high rating. Recently, findings of several studies [10]- [13] have shown that non-accuracy metrics included novelty and diversity in RS are significantly correlated with the satisfaction level of users. Another interesting finding [14] shows diversification is one of the noteworthy factors that affect users' satisfaction positively. Other scholars [13], [15] also supported that favourable recommendation quality is highly correlated to novelty and diversity of the items. In other words, focusing solely on accuracy metric does not secure high-quality recommendation since the items with high accuracy result do not assure satisfaction of users [16].It is not argued that accuracy metrics should be neglected, but rather that other evaluation metrics should also be considered simultaneously, hence multi-objective (MO) based recommendation. It provides extensive motivation to this work in solving MO recommendation problem.
Among classical RS, content-based filtering (CB) [5], [6], [8] and collaborative filtering (CF) [2]- [4] are commonly used. According to [9], [17], the former approach endured the limitations of handling inter-dependencies event, whereas the latter struggled in cold-start problem and data sparsity [9], [17]- [23] to generate recommendation if it lacks sufficient historical relationship information between the user and item. Another review study [24] on social media analysis by machine learning indicated that typical RS algorithms such as matrix factorization or Support Vector Machine (SVM) also suffered from cold-start, serendipity, and scalability problems. In addition, the CB method encounters difficulty in suggesting items from categories that are new to the users or have not been experienced by them since it focuses only toward similar content or item group to user [25]. Furthermore, the techniques involving CF have the limitation to include side features for query item such as user's latent information. The hybrid approach [26]- [28] tackled some limitations of CB and CF approach, but the main weakness of dealing with a new user or item that has never been experienced still persists [29]. Moreover, traditional recommendation approaches fail to consider feedbacks from users [30] and are inadequate for handling MO problem.
Scalarization and population-based heuristics methods are the common techniques for MORS [1]. Scalarization is used to transform a MOP to a single-objective problem (SOP), so that most of the existing optimization methods for SOP can be reused to solve the problem. Meanwhile, the population-based heuristics utilizes evolutionary algorithms (EAs) or swarm intelligence methods (SIs) to produce the Pareto optimal set. Scalarization methods are popular due to its simplicity since they transform MOPs to SOPs. However, scalarization methods may not be able to handle non-convex problems in contrast to the Multi-Objective Evolutionary Algorithms (MOEAs). By contrast, MOEAs can handle both convex and concave problems but may suffer premature convergence at local optima besides the weakness in diversity and typically trapped in efficiency issue when processing large dataset.
Most of the existing MO optimization frameworks [31]- [33] integrate genetic algorithm (GA) with classical CF method despite them being time-and resource-consuming [34]. This poor reputation is due to abundance of iterations required to search and populate the solutions as it has constraints on scaling up for large-scale optimization task. GA is a prominent technique used among evolutionary computing (EC) or evolutionary algorithm (EA) approaches [35], and premature convergence is one of the critical weakness of EC approaches [36]. In GA, only the most optimal solution is selected and this hill-climbing-based solution leads to premature convergence problem. In addition, while using EC approach [37], it is difficult to achieve good density points and converge to optimal solutions [38].
There are also approaches that focus on personalization in MORS such as the preference-based method called the Extreme Dominance and Statistical Significance Tests for defining a new Pareto-based dominance relation that guides the optimization search considering users' preferences [39]. Multi-criteria RSs based on deep learning such as the deep autoencoders are employed to exploit the non-trivial, nonlinear and hidden relations between users with regard to multi-criteria preferences [40]. Tensor model has also been used in Multi-criteria RS [41] which combines aspects (e.g., users or countries, restaurants, multiple ratings, and cultural groups) and applies factorization (e.g. higher order singular value decomposition) to process the inter-relations of the various aspects for predicting the missing values in the models, and then used for predicting the rating.
In the last decade, reinforcement learning (RL) approaches have grabbed attention of many researchers as its applications are increasing progressively. RL approaches have rose to prominence for solving complex decisionmaking problems. Q-learning [42] is one of the basic RL techniques and has been widely researched in various fields including electric power management [43], the Internet of things [44], and RSs [45]. Despite its demonstrated ability to learn optimal strategy dynamically, tabular mapping function approach is inefficient in a high-dimensionality environment [46]. Therefore, various deep reinforcement learning (DRL) approaches such as Deep Q-Network (DQN) [47] and Deep Recurrent Q-Learning [48] have been proposed to overcome the shortcomings of basic RL tabular function approaches in large-scale environments.
Several researchers have demonstrated that DRL approaches such as Q-learning-based approach outperform the EC approach in various MO problems [49]- [51]. RL approaches in MO problems could optimize power consuming and voltage stability, and DRL approach is better than the EA in terms of Pareto solutions achievement; besides, DRL has more accurate optimal points [50]. A significant study [49] on multi-objective traveling salesman problem demonstrated that DRL approach outperformed the MO-based EC approaches in terms of solution convergence and large sparse data handling. They presented the proposed non-iterative solver and demonstrated that DRL approach is more efficacious than EC approaches for solving MO problem.
The characteristic of DRL to explore the environment and make decision autonomously has become very useful in RS applications. Several researchers have evidenced the practicality of DRL in complex RS environment with singleobjective algorithms [52]- [55]. However, after a thorough search of relevant literature, it can be asserted that there is a lack of adequate research concerning the application of DRL techniques in MO problem of RS domain. In accordance with the works discussed above, the extensive potential of DRL approaches to handle MO problems in RS is further investigated throughout this work. This research further examines the capability of RL in RS application, along with MO optimization problem. In this context, we developed a DQN-based approach to solve MO problem in RS (called DQNMORS) and evaluated the same according to three metrics namely accuracy, novelty, and diversity. We have developed an MODRL approaches, which salient features are enumerated as follows: 1) Our approaches do not rely on rating predictor for MORS and optimize three evaluation metrics simultaneously, namely precision, novelty, and diversity. This paper also presents an extensive analysis of the comparison between optimization techniques by scalarization method and Pareto filtering method. 2) Our approaches are based on DQN (called DQNMORS), which incorporates user latent features to help improve recommendation and extensive investigation and show the impact of learning user latent on performance. 3) Our approaches consider time-sequential rating data as one of the input types and study the impact of learning sequential rating data using recurrent layer (through a model called recDQNMORS).
The remainder of this paper is organized as follows. Section 2 discusses literature review and related works. Section 3 presents the proposed DRL approaches for MO optimization in movie recommendation. Section 4 discusses the experimental results. Finally, Section 5 concludes the salient findings of this study.

II. LITERATURE REVIEW
As discussed in the previous section, traditional approaches such as CF, content-based, or even hybrid approaches suffer from cold-start issue and are inadequate to learn autonomously from a dynamic environment [17], [24], [25], [30]. Moreover, these approaches are insufficient to handle MO problems. Most of the general RSs tend to serve users using high-rating or popular items, or similar items in accordance with user's previous preferences in order to achieve high accuracy of prediction. Despite these accuracy metrics being adequate for evaluation purposes, the recommendation quality should not completely rely on accuracy metrics. In other words, other non-accuracy metrics should be considered as well [10]- [13], [16] in order to provide better recommendation quality.

A. MULTI-OBJECTIVE RECOMMENDATION APPROACHES
The existing techniques applied in RS for handling MO optimization problem are mostly EC techniques [31], [36], [56]- [58]. The GA is one of the most popular approaches in EC family that is inspired by biological evolution mechanisms such as nondominated neighbor immune algorithm [33], decomposition-based MO evolutionary algorithm (MOEA/D) [32], and nondominated sorting genetic algorithm II (NSGA-II) [31], [57]. These approaches are not only limited by time-complexity and resourceconsuming issues [34] because they require large numbers of iteration and population size for large-scale optimization [49], but they are constrained by premature convergence issue [34]. Furthermore, the EC approaches have difficulty to converge to optimal points [38] in higher-dimensionality MO problem. Moreover, such approaches primarily work on optimization and must be coupled with other prediction algorithms to generate recommendation list. Most of the existing MO studies primarily employed CF technique to predict rating prior to optimization using EC (see Fig. 1) which starts from rating prediction, and generates candidate list, followed by MO optimization.
In the realm of large RS application such as e-commerce platform, the database usually contains vast number of users and at least millions of items that are actively browsed. However, only a small portion of the items are rated by users. Thus, it is impractical to predict the relationship between user and all items since both dimensional quantities increase periodically. Consequently, loads of missing rating values make the user-item matrix scarcely filled. The CF approach that resulted to large data sparsity, also encounters a lot of difficulties to compute similarity between the user or item to identify appropriate items to recommend [2]. The challenge grows even further when it is required to generate recommendation for new users or items added into the system since EC technique is unable to perform optimization without complete item-rating data. The CF method is only learned from the rating matrix, and it is difficult to learn other potential useful features such as user latent. In contrast, the nature of DRL in adaptive self-learning through environment exploration and epsilon-greedy policy [47] makes the algorithm itself possess the capacity to sustain in large complex and even sparse environment. With epsilongreedy policy, DRL agent is allowed to explore the environment for searching better direction and preventing itself from being perplexed at local minima or maxima. The agent has certain policy to perform a random action with probability ε and take action using greedy policy with probability 1 − ε. In another words, the ε parameter determines the probability of agent exploration and exploitation as shown in (1).
The epsilon-greedy algorithm ensures all the action space is explored by maintaining a certain exploration probability. Fig.  2 illustrates the interaction between an RL agent with its environment where the agent observes the state from environment and takes action accordingly. Each of the action performed by agent will obtain reward as feedback. Several studies [53], [54], [59]- [64] have demonstrated the robustness of RL algorithm in complex RS application. As suggested by [46], the function approximation algorithm such as DRL approach is more suitable for solving vast state space problem rather than using tabular mapping function such as Q-learning tabular algorithm in numerous state-action environment. One of the most extensively deployed DRL algorithms is DQN [47]. It utilizes deep neural network to approximate Q-values and has been examined in few innovative RS applications [59], [60], [62].
Inspired by a study [48] that highlighted DQN's advantage in RS by using multi-step user-specific based interactive recommendation with explicit feedback mechanism, we represent the recommendation model as follows in order to adapt RL interaction with the dynamic environment: • States, S. The state, s ∈ S is the feature representation that expresses the interaction between a user and a movie; it represents user's behavior along the ascending timestamp. • Actions, A. The a is the action executed on a state, s.
Each a ∈ A is the is the recommended item list with a fixed length L for each user. The item in this case is referred to as a movie, and so the action output is list of movie item denoted by a unique ID number. • Rewards, R. R(s, a) is the immediate reward obtained by agent for every action a executed in state, s. Since the goals of the MO agent is to optimize the evaluation metrics, the reward is summation of metrics values. Higher overall rewards indicate a likely better performance. A few studies [49]- [51], [65], [66] have demonstrated the robustness of RL algorithms in tackling MO problems (such as power system [38] and traveling salesman problem [54]) and overcome the shortcomings of EC techniques. The stellar performance of DRL methods in MO problem provide strong motivation to this study for solving MORS optimization problem using DQN-based approaches. When dealing with MO issues, there is no perfect solution that can reach the best value in all goals concurrently, and it is inevitable to sacrifice at least one target in order to enhance another.
To further explore the potential of RL in RS, a study that applied Q-learning in single objective RS [67] pointed out that the present ratings are significantly correlated with the sequences of past rating. Hence, this finding is taken as encouragement for this work to further investigate the effect of learning rating in sequence by using LSTM layer. The sequential rating information is referred as the user-item rating data arranged in an ordered sequence, and the position of each entry data is significant. In [67], the entries of the training data had included explicit order of the rating given by user on the movie, which required additional effort to label the sequences of mass data. In this work, the training data is sorted ascendingly on the basis of a timestamp and stacked up to be fed as input to recurrent-based DQN agent without the explicit label of the rating order as the recurrent layer can capture the sorted sequential data directly.
Studies on online RS based on DRL is also available, which exploits the users' responses to the current recommendation results (immediate feedback) to optimize the recommendation strategy.
Generative adversarial network (GAN) is used to exploit the users' immediate feedback with Q-learning and actor-critic network [68]. This work also proposed a deep generative adversarial networksbased collaborative filtering approach to optimize the negative sampling method. Despite using different DRL approach, this work is similar to our proposed work in the sense that it uses Q-network. However, since it does not handle MORS, we do not consider it as our benchmark.
Another popular approach in DRL called policy gradient is used in combination with RNN [69] to propose a novel top-N model for long-term prediction in a single RS that focuses on hit-rate (fraction of users for which the correct answer is included in the recommendation list) and Normalized Discounted Cumulative Gain, (NDCG) which measures the quality of ranking. They also proposed a new extended GRU cell named EMGRU, which can efficiently enhance the recommendation accuracy by incorporating additional historical information to address the warm-start scenario in long-term prediction. Another similar work on policy gradient method with dynamic recurrent [70] is built in which a profile constructor with autonomous learning ability is designed to make personalized course recommendation. The approach is proposed to address the exploration-exploitation trade-off issue in constructing user profiles while the recurrent scheme by context-aware learning exploit the user's current knowledge and explore the future preferences.

B. EVALUATION METRICS
Optimization between the competing metrics is obligatory to achieve the optimum values in accuracy and non-accuracy metrics concurrently. As indicated by [31], there is trade-off dilemma between the accuracy and diversity of a recommendation, and it sacrifices the accuracy of one metric to improve the other aspects. The study [33] supported that conflict among matching quality and diversity function requires optimization. On the other hand, the relationship between accuracy and novelty is also competing, as shown in [32]. Hence, both accuracy and non-accuracy metrics are deliberated in the proposed algorithm.
Accuracy, commonly also known as precision metric, is an essential evaluation tool that measures how precise are the prediction results. It is measured as the level of correctness between predicted ratings and actual ratings given by user. It also indicates the proportion of recommended items with a high-rating value in the total user's preferable item list. Several studies [31]- [33], [57], [71] have used the precision function as used in this work to evaluate each recommendation list, as defined in (2).
where Lu is the predicted recommendation list that contains items for user u, Lu = [x1, x2, …, xn]. Tu is the list of actual items in the test set that the user u rated with high rating. A highrating item is an item that has been given a rating of 3 or above by the user. L is the length of the recommendation list. The recommendation result will be awarded better precision if the greater number of predicted items appear in the test set's highrated item list.
On the other hand, the diversity function quantifies the difference between items in the recommendation list. This difference can be described by various topics of items in the recommendation. There are some works [33], [71] that use intra-user diversity to assess the capability of recommending the different items to a user. In [31], the another kind of diversity is proposed, which is based on Shannon's entropy. The measurement in [31] is more comprehensive since it comprised of three principal parts, included topic distribution, number of different topics, and the distribution of a topic for each item in the recommendation list. Hence, the proposed diversity for evaluating the recommendation list is formulated as in (3).
where Div(Lu) is the numbers of topics and its distribution in the recommendation list Lu, | | is the amount of topics included in item xi, and | | is the total number of topics in the recommendation list. More precisely, the diversity function is related to the topics of items in the recommendation list. Novelty denotes the popularity of the recommended items. It is a measure of the ability to recommend low-popularity items to the user, assuming that such items, which are in long tail, are considered novel by the user. According to [31], [57], the novelty function is defined as in (4).
where M is the total number of users and Nα is the number of ratings for item α. The recommended items with lower popularity or fewer ratings received are considered novel and have higher novelty value according to (4).
As demonstrated in the MO optimization research works [31], [32], the accuracy indicator and non-accuracy indicators are contradictory. The existing works often focus only on biobjective optimization between accuracy and another non-accuracy metrics such as either accuracy against diversity or accuracy against novelty. In order to evaluate the robustness of the proposed method, both diversity and novelty metrics are taken into optimization simultaneously with accuracy as MO problem.

C. MULTI-OBJECTIVE OPTIMIZATION
Generally, MO problem involves more than one constraint, and there is no single or best solution for the problem; instead, it may have several solutions. Therefore, MO problem can be described in mathematically as in (5).
where x is solution, n is the number of objective functions, and X is the set of feasible solutions. The purpose of MO optimization is to achieve the optimal solution by trade-off to a certain degree on any objective values, and each objective function is represented by a vector in multi-dimensional space.
The MO optimization method is mainly distinguished into scalarization and Pareto methods [72]. The former transforms MO operations into scalar fitness function using the weightedsum approach as shown in (6).
where w is the weight assigned to each objective.
where if there is no other solution of f(xb) dominating f(xa), then xa is Pareto optimal solution. The dominance solution often requires degradation of one objective function in order to improve the target objective function to achieve optimal value. The non-dominated solution is also referred to as Pareto optimal solution. Based on literature search, precision, novelty, and diversity are the most common combined metrics optimization in MORS and thus being focused on this study. However, most of the existing works do not provide a Pareto-based solution that combines all these three competing objectives concurrently. In fact, precision, novelty, and diversity metrics are reflecting the essential objectives of higher quality recommender system respectively. Therefore, this indicates a room of improvement for MO RS.

III. DEEP REINFORCEMENT LEARNING-BASED MULTI OBJECTIVE RECOMMENDATI0N SYSTEM
Two types of DQN approaches and its variant are proposed to adapt in MORS environment to generate recommendation items list for user. The proposed DRL approaches are then compared with benchmark work [31], which applied EC technique coupled with CF method. There are three evaluation metrics that are concurrently taken into optimization, namely precision, novelty, and diversity. As discussed in the previous section, both accuracy and non-accuracy metrics are contradictory to each other such that if one objective is maximized, it degrades the other objective(s).
In this MO optimization work, the effectiveness of weighted-sum strategy and Pareto optimal filtering methods are evaluated. As aligned with the benchmark [31], the length of recommendation list L is fixed at 10 for all users. The recommendation output is evaluated in terms of precision (2), diversity (3), and novelty (4) metrics as proposed by [31]. This work is distinguished from previous studies [31], [33], [57], which only focus on optimization between precision and either novelty or diversity, where we simultaneously study the optimization between three objectives: precision, novelty, and diversity.

A. DQNMORS
When dealing with an enormous state space, utilization of large memory to save all state-action pair values is impracticable and inadequate. Moreover, exploration of every state and updating the Q-values using Q-table would be unrealistic. Therefore, the DQN method that uses function approximator to optimize the policy is more practical. The working principle of the proposed DQN algorithm is aligned with the RL mechanism as illustrated in Fig. 2. The algorithm learns to predict the item for user based on feedback from the interaction with the environment. In the RL algorithm, the agent is considered as a component that make decisions on which item to be recommended. It responsible to act accordance with the observed state from environment, each of the action taken will be rewarded corresponding values as feedback to the agent.
Our proposed algorithm, namely DQNMORS is based on DQN and examined in the recommender environment. Fig. 3 shows the diagram of the proposed DQNMORS with optimization. The DQNMORS architecture consists of experience buffer and two identical networks called predicting network (evaluation) and target network. Both networks are initialized with identical parameters. The Q-function approximator is used to optimize the policy, and the approximator is made up of neural networks that consists of 4 layers including the output layer. The two fully connected hidden layers are connected to output layer for each valid action. The first hidden layer consists of 512 neurons and followed by second hidden layer which consists of 1024 rectified units. Lastly, the output layer made up of 1682 units as there are total 1682 unique movies in dataset. The action performed on respective state and the reward obtained are stored in replay memory for experience replay as tuple form, = ( , , , +1 ) at each time step, t. During training, the agent randomly samples the minibatches of transition from the replay memory and then performs gradient descent with respect to the network parameters. The randomly sampling breaks undesirable correlations between the samples and therefore minimize the variance of the updates. The predicting network is updated periodically with parameters from the target network. It is responsible for regulating the action values toward target values, thereby leading to a more stable learning process. The loss function applied in the DRL neural network is mean squared error of predicted Q-value and the target Q-value.

B. recDQNMORS
The sequence of past ratings information is significantly correlated with the current ratings [67]. This motivates us to investigate further on the impact of learning sequential data by applying recurrent layer to the prediction task. Recurrent neural network (RNN) is mainly used for solving the shortterm memory issue in a basic neural network. Few researches [48], [73], [74] pioneered the integrated the RNN with DRL to learn sequential data. In RS domain, [63], [75] employed the hybrid RNN in RL algorithm and demonstrated the ability of capturing long-term sequential information. Thereupon, we fused long-short term memory (LSTM) recurrent layer with the DQNMORS algorithm and named this algorithm as recDQNMORS (see Fig. 4).
LSTM [76] is an extension architecture from RNN and is meant to address the short-term memory issue in basic RNN occurring because of vanishing gradient effect. The proposed recDQNMORS approach is modified from LSTM-based recurrent enhanced approach used in [77], which demonstrated the significant role of LSTM in handling data in order. The LSTM is placed on the top layer of network to handle the sequential input data.

C. OPTIMIZATION METHODS
Both scalarization and Pareto method are applied on identical DQNMORS algorithm in order to determine which optimization method has better performance. From the comparison result among DQNMORS approaches, the optimization method that contributes better result is selected to be adopted in recDQNMORS approach for subsequent experiment. In order to classify conveniently, the name of experimental algorithms is summarized in Table I. Both algorithms are applied to identical MO environment but associated with different optimization method. First, the DQNMORS, which used scalarization method, also called weighted-sum method (6) was used to compute the reward value after each recommendation list produced by agent. The reward for this recommended list is an aggregation of each metric function (2), (3), and (4) multiplied with corresponding weights as shown by line 29 in DQNMORS algorithm in Algorithm 1. The reward function in the proposed DQNMORS_ws framework is established by summation of the evaluation metrics as it reflects directly to DRL agent about the quality of recommended items. Since the importance of each metric is considered equivalent, the weights assigned to each objective is 0.3. On the other hand, the DQNMORS_pf with Pareto method (7) was used to select the optimal recommendation list from the five recommendation lists generated by agent for each individual user. Every recommended list was then evaluated with the metric function (2), (3), and (4), respectively and only one optimum list was selected as final recommendation list for that user.

A. DATASET
To evaluate the performance of the proposed DRL approaches in RS, the agent and environment were designed to be interactive, while a well-known dataset from GroupLens Research, namely MovieLens 100K dataset [78], was utilized. It comprises of 100,000 ratings that scale from 1 to 5, and total 943 individual users with1682 movies. All the users have rated at least 20 movies in the dataset. There are a total of 19 genres, and each movie has devoted to at least 1 genre topic. In this work, the user information was also exploited as latent input for the DRL agent, and the impact of taking the latent user input is discussed in the next section. The user information includes age, gender, occupation, and demographics. These  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181164 features are related to user personalization and act as a unique representation for every user. In the test set, there are 462 users and only those movie items rated 3 or above in order to accommodate precision evaluation as in [31].

B. INPUT LATENT STATES
The DRL agent interacts with the environment by observing the input state and performs corresponding action (explore or exploit) continuously. The state in RS environment commonly refers to the interaction between the user and the item in the application (according to the dataset). Since the DRL approach is effective in incorporating side features such as user latent into input states, it possesses advantages over other algorithm that incapable to learn the latent features.
A comparison was carried out between DQNMORS with user and without user latent in accordance with user's movierating values to validate the effect of incorporating user latent. The comparison was made to justify the impact of user latent on the recommendation result. In order to capture the essential information from the state, the input was embedded as input vector. One-hot encoding is not suitable in this case as it lacks meaningful relations between vectors. Instead, user information was embedded as latent representation. Same setting was applied to recDQNMORS. Table II summarizes the differences between the setting of input features. It is hypothesized that the user latent input will benefit the DRL agent since the additional features are strongly related to user personalization, and it is a unique representation for every user in dataset.

C. SEQUENTIAL INPUT STATES
Since the order of past rating has influence on the present ratings [67], the historical sequential rating data could benefit the agent to generate better predictions. In order to capture sequential input, the LSTM layer is applied on the top of dense neural network. The sequential rating information is referred to as the user-item rating data, which is arranged in ascending order. The recDQNMORS is proposed to study the effect of learning the sequential rating input. In order to verify this assumption, recDQNMORS is compared with DQNMORS by using the same input features and optimization method.

D. HYPERPARAMETER SETTING
Preliminary experiments were conducted to identify the optimum hyperparameter values for DQNMORS and recDQNMORS. The essential hyperparameters such as learning rate, discount factor, or epsilon values could directly control the agents' behavior in learning process. The tuned hyperparameters that were used are encapsulated in Table III. In general, the hyperparameter-tuning experiments are executed through 30 independent runs in order to collect statistical results. The average metrics values of 10 sample users are taken for analysis and plotted.

E. COMPLEXITY ANALYSIS
The computational complexity of proposed approaches is deduced from its pseudocode as introduced in

V. RESULTS AND ANALYSIS
Three experiments were conducted to substantiate the proposed algorithms, as summarized in Table IV. To investigate the influence of learning sequential input data

A. SCALARIZATION METHOD VERSUS PARETO METHODS
The results of comparison between the scalarization (weighted-sum) method and Pareto method are presented in this subsection. As shown in Fig. 5, the DQNMORS_pf_m_u approach with Pareto method outperformed the DQNMORS_ws_m_u approach, which adopts the weightedsum method, especially in terms of precision and novelty by 13.04% and 2.56%, respectively. The average diversity of DQNMORS_pf_m_u is lower than DQNMORS_ws_m_u by 8.42%. From the result shown in Fig. 5, DQNMORS_pf_m_u is regarded as a better performer than DQNMORS_ws_m_u from the aspect of average metrics value despite the diversity being slightly lower than DQNMORS_ws_m_u. This indicates that that Pareto method is more effective for attaining higher accuracy while maintaining other non-accuracy metrics. It has better ability to optimize multiple metrics because of Pareto filtering from the solution space without the requirement of assigning weight factor to each objective.

B. USER LATENT FEATURE VERSUS MOVIE-RATING FEATURE
To investigate the impact of user latent feature on the performance, DQNMORS with user latent input (denoted DQNMORS_pf_m_u) is compared against DQNMORS with movie-rating feature input (denoted DQNMORS_pf_m). The average metrics obtained by each agent are shown in Fig. 6. Overall, the results show that the hypothesis is true. i.e., user latent input contributes to better performance. The results from DQNMORS_pf_m_u have surpassed the DQNMORS_pf_m in terms of all evaluation metrics.
DQNMORS_pf_m_u achieved higher precision than DQNMORS_pf_m_u by 19.80% (Fig. 6). According to the average of novelty, the DQNMORS_pf_m_u outperformed DQNMORS_pf_m by 20.46% and attained higher average of diversity by 1.60%. As expected, DQNMORS_pf_m_u, which incorporates user features, can learn more context about the interaction between user and movie item, whereas the DQNMORS_pf_m lacks this information, which affected the performance. Therefore, incorporating user latent as side features beyond the query led to better recommendation results.

FIGURE 5. Average metrics values of recommendation results obtained by DQNMORS_ws_m_u and DQNMORS_pf_m_u for 10 sample users.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181164

C. LEARNING SEQUENTIAL RATING INPUT
In order to present the impact of learning sequential rating information, the recDQNMORS_pf_m_u algorithm was compared against DQNMORS_pf_m_u algorithm, where both algorithms applied Pareto method for optimization and user latent input sorted in an ascending order according to timestamp. The main difference is only the presence of LSTM layer, where the recDQNMORS algorithm utilizes LSTM layer to capture sequential input states, while the DQNMORS is purely based on DQN without LSTM layer. The comparison results are shown in Fig. 7.
The performance of recDQNMORS_pf_m_u using LSTM layer is not aligned with the expectation. The results show that learning sequential rating data does not enhance the recommendation as the average of precision was unexpectedly lower than DQNMORS_pf_m_u by 17.57% and novelty by 4.68%.
Only in average diversity, the recDQNMORS_pf_m_u achieved better performance than DQNMORS_pf_m_u by 2.66%. The contradictory performance of recDQNMORS_pf_m_u indicated that learning sequential past ratings has no positive effect to the agent.
The reason behind this result may be the inefficient representation of the sequential rating input state. First, the transition of the user movie-rating across the timestamp actually has no meaningful context to represent the changes of users' preferences. In contrast to the case in [77], which utilized LSTM layer to learn the trend of stock price, the change in stock price provide meaningful signal information to the agent. However, in the case of RS environment, the alteration of user movie-rating did not provide any useful representation. Besides, the MovieLens dataset contained high sparsity on top of watching and rating sequence gaps patterns, and the LSTM layer has difficulty to extract sufficient historical data, thereby causing unstable learning. Therefore, the strength of LSTM layer in this case is not exerted. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181164

D. PMOEA versus DQNMORS versus recDQNMORS
The performance of DQNMORS and recDQNMORS algorithms were compared by including comparison against the benchmark results from [31], which utilized probabilistic MO evolutionary algorithm (PMOEA) approaches in terms of precision, novelty, and diversity. The average of mean, minimum, and maximum values of precision, novelty, and diversity metrics from 10 sample users are presented in Appendices 3, 4, and 5.
As evidenced by the result plotted in Fig. 8, the proposed DRL approaches are capable of concurrently handling multiple competing objectives in RS. In general, none of the approach was found to achieve the best results on both metrics simultaneously. The PMOEA+CF_User technique from the benchmark obtained the highest average of mean precision value at 0.50, whereas the highest average of mean precision from DQNMORS_pf_m_u is only 0.22. However, the average of maximum precision values obtained by DQNMORS_ws_m_u and DQNMORS_pf_m_u is still considered competent against the best from benchmark, which achieved 0.50 and 0.46, respectively among the 10 sample users as shown in Appendices 3, 4, and 5. DQNMORS_ws_m_u and DQNMORS_pf_m_u have higher maximum precision than PMOEA+CF_Item for Users 1, 4, and 9, while maximum precision on User 2 and User 8 was achieved with PMOEA+CF_Item.
Although the benchmark PMOEA+CF_User achieved higher precision, it has lower values of novelty and diversity compared to any of the proposed DRL approaches. Both DQNMORS and recDQNMORS have higher average of mean novelty than PMOEA+CF_User, except DQNMORS_pf_m. The recDQNMORS_pf_m achieved higher novelty than PMOEA+CF_Item in all sample users. In average of maximum novelty, all the proposed DRL approaches surpassed all the PMOEA based approaches. In terms of average minimum novelty, the PMOEA+CF_Item has the highest values compared to DRL approaches. However, the majority of PMOEA approaches are considered lower in novelty compared to the proposed DRL approaches. The exploration-exploitation nature of DRL agents induce higher potential to explore items with more variety, as it enables the agent to reach wider range of items and contribute to better diversity. There is a striking achievement in diversity by all the DQNMORS and recDQNMORS approaches. As shown in the results, both DQNMORS and recDQNMORS approaches surpassed all PMOEA-based approaches in the average of mean diversity. The DQNMORS_ws_m_u and DQNMORS_pf_m_u clearly achieved higher average mean, minimum, and maximum of diversity compared against the PMOEA. The recDQNMORS_pf_m has lowest mean diversity among DRL approaches, but it still outperformed PMOEA+ProbS by 69% in average of mean diversity.
In general, our proposed algorithms have endured in optimization of three constraints simultaneously instead of dual-objectives. Appropriate exploration level support DRL agent to prospecting higher reward action, it advocates agent to discover long-tail items that potentially higher novelty and diversify the categories of recommendation list. The interactions between DRL agent with environment are dictated by a balance of exploration and exploitation, it provides advantages over the GA which vulnerable to premature convergence effect. The premature convergence issue generally is a consequence of losing diversity within the population due to GA operators. The crossover and mutation operator are function for exploitation and exploration respectively by produce genes from available parents. However, it is difficult to generate optimum solutions because of limited items dominated the sub-population and then constraining it to converge to a local optimum. In contrast, the exploration-exploitation strategy in DRL approaches has more flexibility as it provides larger probability for random selection, therefore, it has greater exploration rate to discover more novel diverse items. Compared to GA that applied in [31], it has only fixed probability to randomly select items from the sub-population.
The superiority of DRL approach also can be explained by its ability to explicitly represent uncertainty in its transition function and to monitor dynamic changes in the highly sparse environment. The agent able to predict and optimize the recommendation directly without rely on additional separated rating predictor as used in [31]. Although rating prediction tends to secure precision, it disregards novelty and diversity.
Nevertheless, balancing between conflicting objectives required additional efforts, and the cost of considering nonaccuracy metrics is that certain degree of precision metrics was required to be abated. As a trade-off, precision of recommendation was affected. The learning performance of DRL agent is heavily rely on reward that obtained after every action taken. However, the reward function for MO problem is always problem dependent, and it is difficult to justify the effectiveness of the reward function. This limitation is also exhibited by GA as the designing process of fitness function is daunting. Besides that, the learning rate hyperparameter value is set to static along the training which may lead to longer time to converge, however, larger learning rate will cause dramatic effect of learning a sub-optimal set of weights. Hence, dynamic learning rate should be considered.

VI. CONCLUSION
This work presented two DRL approaches, DQNMORS and recDQNMORS, capable of tackling MO problem in RS environment. These algorithms are proposed to optimize three different objectives or metrics, which are precision, novelty, and diversity. From the comparison of optimization methods, the Pareto method was observed to outperform the scalarization method. The DQNMORS approach was further investigated by incorporating user latent features as side feature, and the results show that the additional feature input improved the recommendation performance. Furthermore, DQNMORS was appended with LSTM layer and transformed to recDQNMORS for dealing with learning sequential input data, which regular neural network has difficulty to capture. Although recDQNMORS results have not achieved better precision than DQNMORS owing to ineffective input representation for agent, the ability of optimization is exhibited, and it achieved better result in terms of novelty and diversity.
As for future direction, more advanced DRL approaches can be investigated in terms of robustness and complexity. For instance, multiple networks DRL approach such as Actor-Critic has the potential to increase efficiency since it has double networks to learn value and policy functions. This work sets a benchmark for DRL-based approach in RS application for future research in this topic. Optimizing more than one objective concurrently will endure at least one objective, and it is still a main challenge. Lastly, the sequential rating input required a better technique to capture significant latent information in order to enhance sequential decisionmaking.

APPENDICES
Appendices 1 and 2 present the algorithms of the proposed DQNMORS and recDQNMORS, respectively, followed by experiment results of all proposed algorithms against benchmark in Appendices 3, 4, and 5.   if ε > random generated number then 7 Select random movie items with probability ε to compose P list of movies with length, L as action, at 8 else 9 Select the movie items by = ( ( , )) to compose P list of movies with length, L as action,