User Preference-Based Demand Response for Smart Home Energy Management Using Multiobjective Reinforcement Learning

A well-designed demand response (DR) program is essential in smart home to optimize energy usage according to user preferences. In this study, we proposed a multiobjective reinforcement learning (MORL) algorithm to design a DR program. The proposed approach improved conventional algorithms by mitigating the effect of the change in user preferences and addressed the uncertainty induced by future price and renewable energy generation. Because two Q-tables were used, the proposed algorithm simultaneously considers electricity cost and user dissatisfaction; when user preference changes, the proposed MORL algorithm uses the previous experience to customize appliances’ scheduling and swiftly achieve the best objective value. The generalizability of the proposed algorithm is high. Therefore, the algorithm can be implemented in a smart home equipped with an energy storage system, renewable energy source, and various types of appliances such as inflexible, time-flexible, and power-flexible ones. Numerical analysis using real-world data revealed that in case of price and renewable uncertainty, the proposed approach can deliver excellent performance after a change of user preference; it achieved 8.44% cost reduction as compared with mixed-integer nonlinear programming based DR while increasing the dissatisfaction level only by 1.37% on average.


I. INTRODUCTION
Demand response (DR) [1] refers to the changes in the energy demand of devices in response to time-varying energy prices. A DR program can reduce the cost of energy through adjustment of appliance operation. In a smart home, an energy management system (EMS) [2] plays a crucial role in a DR program. An EMS can shift certain demands for appliances from peak price hours to nonpeak price hours. This adjustment reduces the demand for electricity during peak load times [3]. A well-designed DR scheme with an EMS can benefit users, improve human comfort level, reduce energy consumption (or electricity cost), and reduce the system's peak demand.
The associate editor coordinating the review of this manuscript and approving it for publication was Amin Zehtabian .
Incentive-based programs and price-based programs are two major types of DR programs [2]. In incentive-based programs, incentive payments are given to induce lower energy usage at times of high electricity prices. Users adjust the energy usage after incentive is provided. In price-based programs, users can obtain financial benefits and discounts in return for their peak demand shifted and consumption reduced at designated times. Users can respond to the price change promptly when adopting a price-based program [4], [5]. Price-based DR programs for smart homes have been widely studied in recent years [6], [7]. In price-based programs, the power utility provider sets a time-based electricity tariff to encourage users to voluntarily adjust their energy consumption. One of the electricity tariffs commonly used is real-time pricing (RTP) whose pricing signals change with time [8]- [10]. The peak price is set at the time of peak demand VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to encourage users to curtail their energy usage. In smart homes, an EMS can shift the energy demand to nonpeak price hours and thus reduce the total electricity cost. However, although this method helps reduce the electricity cost, the user comfort level can also decrease, which consequently increases user dissatisfaction, because of partial satisfaction of energy demand required by users [11]. A well-designed DR scheme and an EMS can allow users to automatically manage demand and balance electricity cost and user dissatisfaction [12]. When a DR program is used to optimize energy resources, the objective is to minimize the electricity cost and user dissatisfaction. In [13]- [17], the authors applied mixed-integer nonlinear programming (MINLP) to achieve the optimal DR. Meanwhile, optimizing energy resources often involves the use of an EMS, which has been used by, for example, companies and academic institutions such as Aggregate Industries, Thorn-Zumtobel, and Sheffield Hallam University to save energy costs. These approaches require complete information, such as full-day electricity prices and renewable supply, to determine the optimal scheduling. In practice, however, users are only provided with forecast electricity prices and renewable supply; data uncertainty must be considered.
To address this uncertainty, machine learning technologies have been used for energy management [18], [19]. Reinforcement learning (RL) is a machine learning-based decisionmaking algorithm that can address an uncertain environment using limited information [20]. The use of RL involves the identification of agents and the environment. Agents interact with the environment to receive feedback from the environment; this reward (feedback) is then used to construct a Q-table, which stores Q-values for the evaluation of each state-action pair.
Conventional RL algorithms have been applied for energy management in smart homes over the past few years [21]. Ruelens et al. [22] developed a batch RL technique called fitted Q-iteration to perform day-ahead scheduling only for power-flexible appliances. Wen et al. [23] proposed an RL-based algorithm to operate home appliances automatically but considered only time-flexible appliances. Lu et al. [24] proposed a multiagent RL technique to schedule the operation of various home appliances in a decentralized manner. However, they did not include renewable energy sources and energy storage systems and the influence of the prediction error was not discussed. Xu et al. [25] developed a multiagent RL-based EMS but did not consider an energy storage system to reduce the energy bill. Remani et al. [26] proposed a RL-based model to control load balance problem but trained in single agent. Mathew et al. [27] developed a priced-based deep RL-based EMS. In modern smart homes, various home appliances and their interaction with the energy storage system as well as renewable energy sources should be considered.
In [22], [23], and [25], the authors formulated a DR problem as a single objective RL (SORL) problem. They used weighted-sum methods to combine multiple objectives.
Park et al. [28] and Lu et al. [24] devised the concepts of using various weights but they did not consider possible weight changes in response to user preferences. A userfriendly DR program was used to customize the scheduling of appliance operation according to user preferences. However, user preferences are assumed to be fixed, and possible changes over time are not considered. If user preferences change, learning algorithms can approximate an optimal scheduling after a certain period. However, the speed of approximating a new scheduling after change of preference depends on various factors. These factors have not been sufficiently investigated. To address the change of user preferences, a mechanism to expedite the learning process must be implemented in order to minimize the learning costs.
Multiobjective RL (MORL) [29] is preferred over the SORL to address the problem of multiple conflicting objectives mainly because of two reasons. First, MORL improves the performance of SORL by developing diverse Pareto optimal models with multiple objectives. Second, MORL is suitable for optimizing multiple objectives simultaneously for sequential decision-making problems. While the concept of MORL has emerged recently, the applications of MORL to the smart home energy management have not been explored. Therefore, further research is required in this field.
Due to the advantages of MORL over SORL, we proposed a user preference-based DR program that uses MORL for energy management in a smart home. We implemented the proposed DR program in a smart home equipped with various types of home appliances, an energy storage system, and a renewable energy source. Appliances and the energy storage system are considered as agents that aim to approach optimal energy scheduling in the presence of uncertainty from renewable energy generation and electricity price. To handle a possible change in user preferences, two Q-tables were used in the proposed approach to simultaneously consider electricity cost and user dissatisfaction. After the user preference changes, the previous experience can be used to swiftly approach the best objective value without undergoing any time-consuming learning process.
The main contributions of this paper are as follows. First, the proposed user preference-based DR program using MORL algorithm can swiftly approach the best objective value after the user preference changes. An EMS is used to rapidly customize energy scheduling according to a change in the user preference. The concept of using the MORL to improve the learning speed and its application to home energy management has not been investigated in the literature. Second, the proposed home energy management approach is flexible and can address a wide range of appliances and other smart home components, including inflexible, time-flexible, power-flexible appliances, an energy storage system, and a renewable energy source, whereas existing approaches are applicable to only some of these devices. This feature of the proposed approach is critical because users desire a DR program that can handle various types of appliances, and smart homes are often equipped with an energy storage system and renewable energy source to improve the energy efficiency. Third, a numerical analysis was conducted using real-world data; the results revealed that the proposed approach outperformed existing methods in terms of electricity cost and user dissatisfaction while simultaneously considering the price and renewable energy uncertainty.
The remainder of this paper is organized as follows. Section II discusses the related work. Section III details the scheme of a smart home, system models of diverse categories of home appliances, and the energy storage system. In Section IV, a user preference-based DR program that uses the MORL algorithm is proposed with dedicated designs of states, actions and rewards. Section V presents our simulation results and a comparison of the existing smart home energy management methods. Finally, conclusions are drawn in Section VI.

II. RELATED WORK
In a smart home, a renewable system has been used to supply electricity to home appliances [30]- [32]. To predict future renewable supply, various weather forecasting methods have been developed [33]- [35]. For example, Cerrai et al. [36] proposed an outage prediction model for an electric distribution network.
A battery system is often used to reduce the variability of renewable energy. As such, battery models have been investigated [37], [38]. Price forecasting [39]- [41] or prediction of energy consumption [42]- [44] may be involved in a battery system to form an EMS.
Prediction errors were observed in conventional prediction algorithms [45]- [47]. To reduce the prediction error, for example, Kodaira et al. [48] proposed to use prediction intervals of estimated error based on prior predictions to approach optimal energy scheduling; Lee et al. [49] applied an error driven prediction modulation to evaluate the difference between the forecast load and actual load.
To improve energy efficiency, hybrid systems have been examined [50]- [52]. For example, Li et al. [53] considered photovoltaic battery energy storage systems as a black start power source that can improve the regional power grid and broaden the application of photovoltaic (PV) power generation; Yulong et al. [54] proposed a hybrid energy storage system which consists of ultra capacitors and battery packs. The hybrid system was able to prevent the battery from a large current impact and increase instantaneous battery capacity.
While energy scheduling within a smart home has been widely investigated, its impact can extend to a larger scale if transactive energy is applied. Transactive energy is related to economic and control techniques that can be employed to exchange energy or power flow within a region of interest, such as a multimicrogrid system or a community. For instance, Iqbal et al. [55] proposed a metaheuristic polar bear optimization method to solve a DR problem considering interconnected microgrids in a transactive energy market. Meanwhile, blockchain technologies can be employed to ensure the security of energy transactions [56]- [58].  Fig. 1 illustrates the smart home environment considered in this study. Typically, smart homes include an EMS, an energy storage system, a renewable energy source, and several home appliances. A user sends a full-day demand request list to the EMS.

III. SYSTEM MODELS AND PROBLEM FORMULATION
In this study, we primarily focus on the price-based programs to optimize energy resources. The electricity can be bought from a utility or supplied from renewable energy sources. The price of electricity will fluctuate due to several factors, including fuels, power plant costs, transmission and distribution system, weather conditions, and regulations [59]. The electricity price is automatically updated periodically in the EMS to evaluate the energy bill and a renewable energy source is used to charge the energy storage system. The EMS controls home appliances and the energy storage system to improve energy usage and reduce the energy bill. Home appliances are categorized into three groups, namely inflexible, time-flexible, and power-flexible [24], [60]. The following subsections present models of home appliances, energy storage system, and our problem formulation.

A. INFLEXIBLE APPLIANCES
Inflexible appliances have critical energy demand that must be satisfied in each time slot. Inflexible appliances, such as refrigerators, must be operated continuously, and their energy demand cannot be adjusted. The energy consumption of inflexible appliance n in time slot h is denoted as E a n,h , where n ∈ {1, 2, 3, . . . , N a } and h ∈ {1, 2, 3, . . . , H }. Here, N a and H represent the number of inflexible appliances and the optimization window, respectively. The energy consumption of all inflexible appliances is as follows:

B. TIME-FLEXIBLE APPLIANCES
Time-flexible appliances, such as washing machines, can be scheduled using an EMS to operate at nonpeak hours to reduce the energy bill. For time-flexible appliances, users can send the request to the EMS and set an operation period. The EMS must finish the job in this operation period. The energy consumption of time-flexible VOLUME 9, 2021 representing the number of time-flexible appliances. The energy consumption of all time-flexible appliances during the observation period is as follows: We denote user dissatisfaction as U b n,h (θ b n , h b n,s ) with the following variable constraints: For example, user dissatisfaction can be linearly related to the difference between the actual start time h b n,s and the request

C. POWER-FLEXIBLE APPLIANCES
Power-flexible appliances, such as air conditioners, are common in a smart home. The use of power-flexible appliances can be adjusted to satisfy the energy demand during a requested time. Power-flexible appliances can operate with flexible power consumption between the minimum energy demand and maximum energy demand. The energy consumption of power-flexible appliance n in time slot h is denoted by E c n,h , where n ∈ {1, 2, 3, . . . , N c } and N c is the number of power-flexible appliances. The energy consumption of all power-flexible appliances is calculated as follows: An EMS can reduce the energy cost through the adjustment of power-flexible appliances, but this adjustment can increase user dissatisfaction.
Here, d c n,max is the desired energy demand for the power-flexible appliance n. The minimum energy demand is denoted by d c n,min . The EMS controls the energy consumption ranging from d c n,min to d c n,max . Furthermore, θ c n denotes the device-dependent dissatisfaction parameter of power-flexible appliance n with 0 < θ c n ≤ 1. A larger θ c n implies a higher priority for power-flexible appliance n.
The user dissatisfaction U c n,h (θ c n , E c n,h ) of power-flexible appliance n is related to the energy consumption and dissatisfaction parameters, and it has the following variable constraints For example, U c n,h (θ c n , E c n,h ) can be formulated as follows [24]:

D. ENERGY STORAGE SYSTEM
In this study, a battery system was used as the energy storage system. A battery system can charge at nonpeak hours and discharge electrical energy to appliances at peak hours to reduce the energy bill. The energy level of the battery system in time slot h is denoted by B h kWh, which satisfies the following equation [62]: where B min and B max are the minimum battery level and maximum battery capacity, respectively. Here, B h is mainly influenced by charging and discharging activities, denoted as P h kW. The renewable energy source charges the battery system in time slot h and it is denoted by E res h kWh. Positive P h represents charging and negative P h represents discharging, E res h represents the energy supply from the renewable energy source to the battery system. The power leakage occurs with a power leakage rate β, where 0 < β ≤ 1. The dynamics of the battery system can be expressed as follows [62], [63]: where h is the length of a time slot and η is the charging/discharging efficiency.
Let the following equation be the total energy consumption of all home appliances: Because of the energy demand of appliances and limited battery capacity, the charging/discharging power P h should satisfy the following equation: where maximum power P max h and minimum power P min h in charging or discharging events are calculated as follows: The electricity cost of a smart home in time slot h, denoted as f 1,h , is the multiplication of energy bought from the utility and RTP λ RTP h cent/kWh: For a charging event, P h > 0 and the term E HA h + P h h represents the energy consumption of appliances and battery charing. For a discharging event, P h < 0 and energy discharged from the battery, i.e., P h h < 0, is used by the appliances; hence, the difference E HA h + P h h represents the total energy that should be bought from the utility.
The total user dissatisfaction is denoted as f 2,h . The two objectives must be minimized: Consider weight w with 0 ≤ w ≤ 1 that represents user preference on objectives. A larger weight indicates the objective that the user prefers. A user preference-based DR in a smart home can be realized by solving subject to (3), (6), (8), and (11).
In this study, we focus on the dynamic adjustment in the EMS, where the agent can make decisions based on the user's preferences for the target, i.e., w and 1 − w. The weight may change with time; the EMS must reschedule energy when the user preference changes in the presence of uncertainty.

IV. USER PREFERENCE-BASED MULTIOBJECTIVE REINFORCEMENT LEARNING APPROACH
Owing to the uncertainty in electricity price and renewable generation, and possible change of user preference, this section develops a user preference-based DR approach that solves (16) using an MORL algorithm. The home appliances and battery system were considered as agents that interact with the environment (utility, renewable energy source, and the user). The EMS regulates the operation of appliances to reduce energy cost and dissatisfaction from the user. Furthermore, the EMS controls the charging/discharging of the battery. In this section, the framework of MORL is first presented, followed by our customized RL algorithms for energy control and scheduling of the charging/discharging of the battery and home appliances.
The framework of multiagent RL is based on a Markov decision process. Here, S n is the state set of agent n and s n,h is the state in time slot h, where s n,h ∈ S n . Given s n,h , agent n selects an action a n,h from its action set A n , and then proceeds to the next state s n,h+1 and receives reward r n,h+1 given by the environment. The reward r n,h+1 is used to evaluate action a n,h . The main goal of agent n is to determine an optimal policy π * n or an optimal mapping from states to actions that maximizes the expected value of the cumulative reward. Given a state s n,h , an action a n,h and a policy π n , the action value of pair (s n,h , a n,h ) is defined by the following expression: q n,π n (s n,h , a n,h ) = E π n [ H h=1 γ h−1 r n,h |s n,h , a n,h ], ∀s n,h ∈ S n , ∀a n,h ∈ A n (17) where γ is the discount factor. The action-value function under an optimal policy, denoted as q * n (s n,h , a n,h ), is expressed as follows: q * n (s n,h , a n,h ) = max π n q n,π n (s n,h , a n,h ).
Q-learning is often used to achieve or approach the optimality in (18) [20], [64]. In Q-learning, an agent uses a Q-table to estimate q * n and has two strategies to select an action: for exploration, the agent randomly selects an action to explore the state space and action space with probability ; for optimization or exploitation, the agent selects an action on the basis of the current state that yields the highest Q-value with probability 1 − . This mechanism is termed -greedy action selection. The greedy action a * n,h can be determined by a * n,h = arg max a Q n (s n,h , a). (19) Given the action selection strategies, Q-learning can approach q * n through the following update rule: Q n (s n,h , a n,h ) ← (1 − α)Q n (s n,h , a n,h ) +α(r n,h+1 + γ max a Q n (s n,h+1 , a)) (20) where α ∈ (0, 1] is the learning rate. In our smart home scenario, a user desires to minimize the cost and user dissatisfaction; we thus interpret reward r as a penalty, and the aforementioned maximization is replaced by minimization. We refer to home appliance n as agent n and design its state s n,h as follows: Let a n,h denote the action performed by agent n in time slot h. For inflexible appliances, the action space has only one option ''on'', because it cannot be scheduled. For timeflexible appliances, the action space has two options which are ''on'' and ''off'' to determine when the appliances are operated. For power-flexible appliances, the action space is a set of discrete numbers {1, 2, 3, . . . , E c n,max }; each element in the set represents a level of energy consumption and E c max is the maximum level. Each agent can receive two reward signals as feedback from the environment. For time-flexible appliances, the reward is expressed as follows: For power-flexible appliances, the reward is used as follows: Both the aforementioned reward functions are designed to encourage agents to minimize objectives (14) and (15) by updating Q-value Q k,n (s n,h , a n,h ) as follows: Q k,n (s n,h , a n,h ) ← (1 − α)Q k,n (s n,h , a n,h ) +α(r k,n,h+1 + γ min a Q k,n (s n,h+1 , a)) (24) where k ∈ {1, 2} represents the kth objective. We proposed to use an MORL algorithm to expedite the learning process in response to possible changes of the user preference on objectives. Algorithm 1 presents the pseudocode of the proposed user preference-based MORL algorithm for agent n. Agent n selects the action derived from wQ 1,n + (1 − w)Q 2,n . After Q-tables Q 1,n and Q 2,n reach steady values, greedy action selection can be readily performed given any weight w. Thus, the best objective value wQ 1,n + (1 − w)Q 2,n can be achieved without further learning when a change in user preference occurs.

Algorithm 1 Proposed User Preference Based MORL for Energy Scheduling of Home Appliance n
Require: dissatisfaction parameters θ b n , θ c n ; user's request t n,r , t n,e , d n,min , d n,max ; weight w. Initialize Q k,n (s n,h , a n,h ) arbitrarily. Ensure: scheduling policy for home appliance n derived from wQ 1,n + (1 − w)Q 2,n . 1: Loop for each episode 2: For h = 1, 2, . . . , H do 3: Choose action a n,h on the basis of current state s n,h using -greedy derived from wQ 1,n + (1 − w)Q 2,n .

7: End for
The EMS also considers the battery system as an agent in a smart home. Its state is designed as follows: The action of battery system agent, denoted by a ess h , is P h and subject to (11). Unlike the agents associated with the home appliances, this agent has one reward signal as feedback from the environment: The reward is the electricity cost because the goal of the agent is to reduce the energy bill. The agent's Q-table is updated by the following expression: h+1 , a)). (27) Algorithm 2 presents the pseudocode of the Q-learning for the battery system. The input contains the battery system parameters. The renewable energy source energy supply is stored in the battery system. The battery system agent observes the battery state and energy demand of home appliances to charge/discharge energy to home appliances.

Algorithm 2 Q-Learning for Control of Battery System
Require: battery system parameters B min , B max , β, η.
Initialize Q ess arbitrarily. Ensure: scheduling policy for battery system derived from Q ess . Choose action a ess h on the basis of current state s ess h using -greedy derived from Q ess .

5:
Take action a ess h , obtain r ess h+1 and next state s ess h+1 . 6: Calculate battery dynamics in (9). 7: Update Q ess using (27). 8: s ess h ← s ess h+1 . 9: End for Remark 1: The variation of renewable generation can affect the total energy bought from the utility. Because the electricity prices change with time, the total energy bought from the utility can affect the energy cost, which is minimized as a goal for a learning agent. Fortunately, renewable generation generally has a periodic pattern related to time; the agent can learn how to perform energy scheduling by accessing state h (time index) as well as other important states such as the current electricity price λ RTP h and battery level B h . All that information is thus encoded as a state vector in (25) along with historical information stored in the Q-table for the agent to plan for coming activities.

V. NUMERICAL RESULTS
We examined energy scheduling in a smart home by using the proposed approach. In our analysis, the scheduling horizon starts from h = 1 and ends at H = 24, and one episode consists of 24 hours. The RTP signals were broadcast periodically by the utility and day-ahead prices were used as forecast prices. Price uncertainty resulted from the difference between the real-time and forecast prices [65]. The prediction error of prices was evaluated using the following expression of mean absolute percentage error (MAPE): where λ D h cent/kWh represents the day-ahead pricing signal. Fig. 2 illustrates sample data, including the solar energy generation and energy price, from the electric industry leader PJM [66]. By using the data set of PJM and analyzing total  electricity prices of each month in 2019, we discovered that the highest prediction errors of prices occurred in January. The following three days were selected: January 2, 2019 (maximum MAPE), January 12, 2019 (50th percentile), and January 27, 2019 (minimum MAPE). Fig. 3 displays the RTP and day-ahead pricing of selected days. A PV system was used as a renewable energy source. Fig. 4 illustrates the corresponding PV power output. A prediction error of 10% for power generation was considered in our analysis [67].
The smart home in the study was equipped with an inflexible appliance, a time-flexible appliance, three power-flexible appliances, a battery system, and a PV system [2]. Table 1 displays the requested appliances operation list of the smart home user. The Li-Ion battery could have about 90% of the efficiency of charging/discharging and power leakage was about 0.1% to 0.2%; η = 0.99 and power leakage rate β = 0.99 in (9) were set [68]. The minimum battery level B min and maximum battery capacity B max in (8) were 1 and 12 kWh, respectively. For the learning algorithms, parameters were tuned for the best performance: = 0.1, α = 0.9 and γ = 0.8.
The proposed MORL was used to realize the user preference-based DR. Fig. 5 displays the appliances' scheduling using the proposed DR program on January 2, 2019. The energy consumption in peak price time slots was reduced. The peak electricity price occurs at t = 13 and all power-flexible appliances' energy consumptions were reduced in that time slot. Compared with the normal operation from t = 17 to t = 19, the washing machine operated from t = 18 to t = 20, which reduced the energy bill. The proposed DR program was effective in reducing the energy bill.
The proposed approach was compared with SORL [24], normal operation, and MINLP algorithms [16]. An average of 10 simulation runs were performed to smooth all learning curves. In MINLP-based DR, prior information regarding RTP and PV supply was used for all future time slots to produce ideal performance. For MINLP-based DR without prior information, SORL-based DR, and the proposed MORL-based DR, they used forecast prices and forecast renewable energy generation for energy scheduling. For SORL, the reward function of agent n in time slot h is the weighted-sum of rewards in (22) and (23) and is expressed as follows: The associated Q-table was updated using (20). The normal operation was the operation of appliances according to VOLUME 9, 2021   Table 1. MINLP-based DR used day-ahead pricing signals as the prediction of RTP signals to evaluate the energy cost for the day for energy scheduling. Fig. 6 displays the learning curves of comparable methods on the selected days. The performance of the proposed MORL-based DR was closest to the ideal performance, followed by the SORL-based DR. The MINLP-based DR program performed slightly better than the normal operation, but was susceptible to prediction errors. From Fig. 6(a) with the highest prediction error to Fig. 6(c) with the lowest prediction error, the performance of MINLP-based DR improved but was still worse than that produced by the proposed approach. As illustrated in Fig. 6(b) and Fig. 6(c), the MINLP-based DR exhibited better performance when the prediction errors were low. To verify the effectiveness and efficiency of the proposed approach addressing user preference changes, Fig. 7 illustrates the learning curves of the SORL-based and proposed MORL-based DRs. Two weight settings of w = 0.2 and w = 0.7 were considered. In Fig. 7, the user changed the preference at episode 300 (dotted vertical line). After the change in the user preference, the objective value achieved by the SORL-based DR required approximately 300 episodes training to converge to a steady value. By contrast, once the objective value achieved by the MORL-based DR reached a steady value, a jump to another steady objective value was observed when a change in user preference occurred. This quick convergence was attributed to the use of two Q-tables, which mitigated the effect of the change of weights.
Finally, the cost and user dissatisfaction under various preference weights of the proposed MORL-based and MINLP-based DRs were compared. Table 2 presents a summary of the average electricity cost and average user dissatisfaction level in January, 2019. The normal operation had the same cost (c / 32.34) because appliances were not scheduled in response to prices. For other energy scheduling FIGURE 7. Learning curves on January 2, 2019, considering a change in user preference (episode 1 to episode 300: w = 0.7, episode 301 to episode 1500: w = 0.2). When the weight changes, the SORL-based DR undergoes additional learning processes to update the Q-table until it reaches a steady objective value. For the proposed MORL, the weight change does not affect the update function (24) using rewards (22) and (23). Thus a steady objective value is achieved. methods, a larger weight w yielded a lower electricity cost and higher user dissatisfaction. As compared with MINLP, cost reductions achieved by our approach were 9.95%, 9.03%, and 6.35% for w = 0.3, 0.5, and 0.7, respectively (cost reduction by 8.44% on average), while the dissatisfaction level was slightly increased by 3.99%, 0.92%, and −0.79% for w = 0.3, 0.5, and 0.7, respectively (dissatisfaction level increased by 1.37% on average). The proposed MORL-based DR outperformed the MINLP-based DR in the presence of prediction errors of pricing signals.

VI. CONCLUSION
This study investigated the user preference-based DR program for smart home users to optimize home appliances' scheduling by shifting or reducing energy consumption. In the literature, user preference has been assumed to be fixed without considering possible changes over time. To address possible changes in user preference, the MORL-based DR was proposed to achieve fast convergence to a steady objective value in the presence of price and renewable uncertainty. Relevant numerical analysis using real-world data, including price signals and renewable energy generation, was conducted to illustrate the advantages of the proposed approach. For SORL-based DR, an extra learning period was observed as compared with the proposed approach. For MINLP-based DR using an optimization algorithm instead of a learning algorithm, its performance was not affected by a change in user preference, but susceptible to the uncertainty induced by electricity prices and renewable energy generation. The numerical analyses show that the proposed MORL-based DR outperformed the MINLP-based DR in terms of energy cost reductions by 8.44% while sacrificing little user dissatisfaction only by 1.37% on average. Our future work includes the investigation of Blockchain-enabled transactive energy markets for residential DR programs.
WEI-YU CHIU (Member, IEEE) received the Ph.D. degree in communications engineering from National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 2010. He is currently an Associate Professor of electrical engineering with NTHU. His research interests include multiobjective optimization and reinforcement learning, and their applications to control systems, robotics, and smart energy systems. He was a recipient of the Youth Automatic Control Engineering Award bestowed by Chinese Automatic Control Society, in 2016, the Outstanding Young Scholar Academic Award bestowed by Taiwan Association of Systems Science and Engineering, in 2017, the Erasmus+ Program Fellowship funded by European Union (staff mobility for teaching), in 2018, and the Outstanding Youth Electrical Engineer Award bestowed by the Chinese Institute of Electrical Engineering, in 2020. From 2015 to 2018, he was an Organizer or the Chair of the International Workshop on Integrating Communications, Control, and Computing Technologies for Smart Grid (ICT4SG). He is a Subject Editor for IET Smart Grid.
WEI-JEN LIU received the B.S. degree in electronic engineering from the National Taipei University of Technology, Taipei, Taiwan, in 2020. He is currently pursuing the master's degree in electronic engineering with National Tsing Hua University, Hsinchu, Taiwan. He research interests include blockchain technologies, network security, quantitative trading, and renewable energy.