Building Energy Management for Demand Response Using Kernel Lifelong Learning

Demand response (DR) aims at improving the reliability and efficiency of the power grids by shaping the power demand over time. Given that building energy consumption constitutes a significant portion of the overall grid load, building energy management is a critical component for the DR portfolio. In this study, DR control policies for lighting and air-conditioner systems for the individual spaces in buildings are proposed. The policies are designed to achieve the energy reduction amount specified in the DR request while minimizing the user discomfort. A significant challenge is to cope with the uncertainty of various environmental factors such as the solar illuminance and ambient temperature, as well as the psycho-economic factors such as the energy usage preferences of the occupants. We employ a data-driven machine learning approach to tackle this challenge. Our novel idea is to take advantage of the structural similarity of the control policies across the spaces in a lifelong multi-task learning framework. To accommodate significant nonlinearity in efficient policies, a kernel-based learning approach is pursued. The dual decomposition method is employed to relax the constraint coupled across the spaces, which allows solving the overall learning problem via a series of unconstrained subproblems. The efficacy of the proposed method is verified by numerical experiments based on semi-real data sets.


I. INTRODUCTION
In recent years, the demand response (DR) has become a key component of smart grid systems due to its significant potential for enhancing grid economy and reliability [1]. DR encourages the energy users to adjust their energy consumption by offering financial rewards or imposing penalties in order to elicit a desirable balance between the supply and demand of electrical power. It is worthwhile to note that the energy consumption of buildings accounts for more than 20% of the global energy consumption, which is expected to continue to rise. Thus, developing effective energy management strategies for buildings is a major DR The associate editor coordinating the review of this manuscript and approving it for publication was Victor Hugo Albuquerque .
challenge [2], [3]. A building can be viewed as a collection of numerous spaces, each of which contains controllable loads. Therefore, an energy management system (EMS) for buildings should be capable of controlling a large number of controllable loads effectively, while meeting the given DR requests from the utility, saving overall energy costs, and guaranteeing the comfort of the users occupying the individual rooms and spaces. Much research effort has been devoted to developing DR algorithms for controlling building energy consumption [4]- [16].
In this work, our goal is to develop a DR policy capable of achieving a given amount of reduction in building energy consumption, specified in the DR request signal, while minimizing the dissatisfaction levels of the occupants of each room. There are many types in commercial buildings such VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as offices, schools, hospitals, and so on, and the major energy consumption appliances can be different highly depending on the type of building. In this work, we consider the building, where the energy consumption of the lighting and air conditioning (AC) systems constitutes a dominant portion of the total building energy usage [17], [18]. A key challenge is that one must cope with the stochasticity of various environmental variables such as the ambient sunlight and temperature. Furthermore, it is desired that the policy also respects different energy usage preferences of the occupants. Acquiring the exact distributional information for such random processes can be challenging, especially when it is prone to change over time.
To tackle this challenge, a data-driven machine learning (ML) approach is taken, where the optimal policy is learned from the collected data. For efficient training and good generalization performance of the learned policies, one has to incorporate any prior information in the learning formulation. Our key idea is to recognize that the policies for different rooms share certain structural similarities, although there are variations from room to room due to the particularities of individual spaces. For example, the control policies for campus classrooms should have a lot in common, although there would be differences due to varying room sizes, directions/locations, and usage patterns.
Such shared structures can be captured in a multi-task learning (MTL) framework. In MTL, the classifiers or regressors are learned to solve multiple related tasks jointly by leveraging their similarities [19], [20]. Thus, MTL can attain a performance that exceeds what can be achieved by learning each of the tasks in isolation. In particular, MTL is useful when the numbers of data samples for the tasks are relatively small (or when their distributions are continuously changing so that the effective numbers for a given state are small), as the knowledge can be shared across tasks, and thus pooling all tasks' data facilitates the learning for the individual tasks.
As the number of tasks increases, the computational burden can become significant for the MTL training, with all tasks' data processed together. Furthermore, the classifiers need to be continuously updated when the data distribution is non-stationary. In some applications, new tasks can arrive, or additional data can be collected for existing tasks, after the classifiers have been trained. In this case, re-training the entire set of classifiers just to incorporate the new arrivals is not well justified. These issues can be addressed by the lifelong learning approach, where the MTL problems are solved in an online manner [21], [22]. Lifelong learning can transfer the knowledge acquired from novel tasks/data to all other tasks by continuously refining the basis of the shared representation of the classifiers. In our building DR setup, the framework can exploit the structural similarities of DR policies for different rooms, while reducing computational complexity, incorporating sequentially arriving measurements, and tracking slow variations of optimal policies over time.
For designing ML methods, one needs to specify the space of functions in which the best approximation of the optimal policy is searched for. A simplest example would be the space of linear functions, which map the features linearly to the target labels. Nonlinear relationship can be accommodated by adopting a parameterized family of functions, but selecting a proper parameterization often requires detailed domain knowledge and educated guesses. Kernel-based learning allows one to explore very flexible non-parametric classes of nonlinear functions, namely, reproducing kernel Hilbert spaces (RKHSs). A kernel-based efficient lifelong learning algorithm (KELLA) was developed recently, which is adapted here for our problem [23]. In particular, as our problem contains a constraint, whereas typical classification problems are formulated as unconstrained ones, a Lagrange duality-based method is employed to transform the overall problem into a series of smaller unconstrained learning problems.
Summarizing, the main contributions of this work are: • A DR problem involving major controllable loads in building EMS is formulated, where the occupants' discomfort levels are minimized while meeting a specified DR target.
• The lifelong learning approach is employed to exploit structural similarities in the control policies for individual spaces in the building, accommodating the nonlinearities in a kernel-based learning framework.
• The coupling DR constraint is relaxed via the dual decomposition technique so that the lifelong learning method, originally developed for unconstrained ML problems, can be readily adapted to the optimization problem at hand.
• The proposed algorithm is tested by numerical experiments designed partly based on real data sets. The rest of this paper is organized as follows. Related works are reviewed in Sec. II. The system model for the building EMS and the DR problem formulation are given in Sec. III. The overall solution architecture based on the dual decomposition method is described in Sec. IV. The proposed lifelong learning solution is derived in Sec. V. The proposed method is tested using semi-real data sets in Sec. VI. The conclusions are provided in Sec. VII.

II. RELATED WORKS
Existing building DR algorithms are summarized here. They can be categorized as optimization-based and ML-based.

A. OPTIMIZATION-BASED BUILDING DR ALGORITHMS
The building DR problems can be formulated as optimization problems of load schedules. A load dispatch model for residential buildings was formulated as a mixed integer linear program to achieve peak reduction and to meet DR requirements [4]. The role of heating, ventilation and air conditioning (HVAC) systems in commercial buildings as DR and ancillary service resources was explored in [5]. The control strategy for HVAC systems in commercial buildings was investigated in [6], where a transactive market mechanism was employed based on a convex optimization formulation. A load shedding strategy for lighting systems was studied, where the dimming levels were adjusted to meet the DR requirements without violating the minimum illuminance requested by users, through a constrained quadratic optimization formulation [7]. Based on the predicted load profiles, an optimal scheduling strategy for HVAC and energy storage systems was designed using model predictive control (MPC) to reduce the building energy bill [8]. In these works, however, the uncertainty due to various stochastic variables were often neglected by assuming the availability of accurate sensor readings and/or their predictions.
A number of studies focused on energy management for DR under uncertainty. A robust commercial building DR strategy was proposed employing a genetic algorithm under cooling load prediction uncertainty [9]. Limitations of conventional DR methods designed for individual buildings were observed and a simple coordinated DR strategy was developed for demonstrating the need to develop a coordinated DR strategy for improving the building group-level performance [10].

B. ML-BASED DR ALGORITHMS
Recently, there have been numerous efforts to incorporate ML techniques to DR algorithms, in order to cope with the uncertainty present in realistic DR scenarios. They range from using ML techniques simply to improve the prediction of stochastic variables [11]- [13] to incorporating ML models directly to the optimization formulations [14]- [16].
A neural network (NN)-based prediction model for the energy consumption of air conditioning systems in office buildings was proposed in [11], where the averaging effect of the prediction errors was reported for aggregated curtailment for DR. A support vector regression (SVR)-based building load prediction method was proposed for commercial buildings, where the HVAC set points capturing energy consumption trend were used as features [12]. A framework for forecasting the reduction in the individual users' energy consumption during the DR period was proposed using various ML methods including ordinary least-squares (OLS) regression, K -nearest neighbors (KNN), and support vector regression (SVR) algorithms [13]. A reinforcement learning approach was taken for the residential building EMS, where the running time and energy allocation of each scheduled device were determined by learning individual consumers' preferences and price variation [14].
Several studies proposed pricing schemes for DR. A dynamic pricing algorithm for DR in a hierarchical energy market was proposed to simultaneously optimize the profit of service providers and the costs of consumers, where the uncertainty in the load demand and the wholesale energy price was tackled using a reinforcement learning method [15]. An optimal dynamic pricing strategy for retail energy was developed, where an online learning-based algorithm was proposed to learn the behaviors of customers by observing their responses to varying prices [16].

III. SYSTEM MODEL AND PROBLEM FORMULATION
First, the overall architecture of the considered building EMS is explained. In this work, since we consider the building, where the dominant electric energy consumption is ascribed to lighting and AC systems, we focus on controlling these two appliances for DR. Extensions for including additional types of appliances are straightforward. Fig. 1 illustrates a building with each of the I rooms equipped with lighting and AC systems. An illuminometer and a thermometer in each room measure the ambient illuminance due to sunlight and indoor temperature of the room, respectively. The smart meter records the energy consumption of the controllable appliances and communicates the information to the processing unit of the EMS. In addition, the smart meter controls the power consumption of the appliances according to the control signals sent from the EMS controller. The building EMS needs to meet a DR request, which is the overall reduction in the energy consumption in the building, by controlling the appliances in the individual rooms.
The energy consumption of the controllable appliances can be modeled as follows. Let E light * i (E ac * i , respectively) be the energy consumption of the lighting (AC) system for room i ∈ {1, . . . , I } at the desired user comfort setting. Let l * i denote the corresponding desired illuminance in lux, l sun i the ambient illuminance due to natural sunlight in lux, η i the luminous efficacy of the lighting system of room i in lm/kW, S i the surface area of room i in m 2 , and τ the time duration in hour (h). In this paper, it is assumed for simplicity that whenever a DR request arrives, the EMS makes the control decisions for the next hour, i.e., τ = 1 is used. Similarly, let t * i represent the desired temperature setting for room i in • C, t cur i VOLUME 8, 2020 the current indoor temperature in • C, A i the energy consumption variation with respect to the temperature variation during an hour in kWh/ • C, and B i the minimum energy consumption required to maintain the current temperature for an hour in kWh. Then, E light * i can be modeled as [24] Note that the illuminance due to sunlight, l sun i , and the current indoor temperature, t cur i , are stochastic variables. However, they can be measured by an illuminometer and a thermometer, respectively. Since the desired illuminance and indoor temperature of each room often depend on the user preferences and the room type, we assume that l * i and t * i cannot be measured directly. In addition, we assume that and are the baseline energy consumption of the lighting and AC systems of room i, respectively, which are determined by the EMS according to the typical energy consumption of these systems. That is, a light i and a ac i capture the fractions of the energy expended relative to the baseline energy consumption. When a light i and a ac i are less than 1, reductions in the energy consumption of the lighting and AC systems from the baseline levels occur.
In this study, the user discomfort level is quantified simply by the amount of deviation (reduction) in the energy consumption from the desired comfort setting. By quantifying the discomfort levels in terms of energy, combining the discomfort for two different systems into a single cost function becomes straightforward. Specifically, the difference in 1 As our work develops a ML-based DR algorithm, the proposed method can accommodate a broad range of appliance models. Thus, in this work, relatively simple yet commonly employed models are adopted to set the stage for the algorithm development. It is also worth noting that the energy consumption breakdowns among the loads can depend on the size and purpose (residential/commercial) of the buildings/rooms. Such variations can be readily captured by setting the model parameters S i , η i , A i , and B i appropriately [26]. the energy consumption for system j ∈ {light, ac} can be expressed as Upon defining , the total user discomfort can then be modeled simply as the sum of the squared deviations.
The squared deviations are used here based on the intuitive notion that the user discomfort can increase quickly as the deviation grows. However, any other discomfort functions can be employed in our algorithm, as long as they are twice differentiable. For example, on can alternatively model the discomfort level to be zero in a certain comfort range, and outside the range a step or a linear penalty can be assessed. In such cases, a step function u(x − a) can be approximated by a twice differentiable function 1 1+e −(x−a)/b , and the hinge loss function max{0, x − a} by Finally, in order to combine the user discomfort levels for different rooms in a balanced way, the normalized user discomfort for each room i is defined aŝ Let E DR be the DR request from the utility, which is a mandated reduction amount in the total building energy consumption. The energy reduction amount for room i can be expressed as The sum of E red i over all rooms must be larger than or equal to E DR to satisfy the DR request.
Thus, our goal is to come up with a control policy that determines actions a i based on the available measurements l sun i and t cur i for each room i = 1, 2, . . . , I so that the DR request is met and the overall user discomfort levels are minimized. Upon defining ξ i := [l sun i , t cur i ] , the optimal control policy for each i is a mapping π i : ξ i → a i . Note that the actions can be dependent on ξ i , which is readily available from sensor readings, but not on E * i . A stochastic optimization problem for obtaining the optimal policies can thus be formulated as where E[·] denotes the expectation with respect to the random vectors {ξ i } and {E * i }.
Remark: The goal of our building DR is to to implement the energy reduction in the controllable loads according to the given DR request at the minimum discomfort for the occupants. Thus, it is assumed that the background load fluctuations are taken into account by the utility when generating the DR request signal. On the other hand, it is worth noting that our ML-based framework can seamlessly incorporate the background load uncertainty with minor modifications in the formulation. Specifically, the change in the background load L i in the i-th room can be augmented in the energy reduction in (9) as where L i is random. Then, formulation (10) can be modified by replacing where the expectation is now taken with respect to not only {ξ i } and In the following sections, our solution approach for (10) is described. The structure of our method is shown in Fig. 2, which consists of three parts. First, the overall problem (10) is divided to the subproblems involving individual rooms using a dual decomposition method (Sec. IV-A). Then, the subproblems are tackled independently via single task learning (STL), coordinated by the dual variable (Sec. IV-B). The shared skills from the STL policies are collected in a kernel-based lifelong learning framework (Sec. V). The algorithm iterates these processes based on the sensor measurements and user preference inputs, and upon convergence, the optimal DR policy of each room is obtained as the solution to (10).

IV. SINGLE TASK LEARNING BASED SOLUTION
There are a couple of issues associated with solving the optimization problem (10). First, the optimal policies need to be found jointly for all rooms due to constraint (10b), which couples all rooms. For a large building with many rooms, this may be prohibitively complex. Second, since the problem involves expectations, the knowledge of the probability distributions for {ξ i } and {E * i } is required, which may not be readily available in practice. Even if the distributions can be estimated, calculating the integrals may significantly add to the computational complexity.
In the following, the Lagrange relaxation-based dual decomposition method is employed to decouple the overall problem into I subproblems for individual rooms. Subsequently, the relaxed problem is tackled taking a ML approach. Thus, one can work with samples of ({ξ i }, {E * i }) to approximate the expectations. Once the optimal policies are learned, they can be evaluated very quickly without requiring complex optimization, facilitating the implementation of the real-time control in the EMS.

A. DUAL DECOMPOSITION
The constraint in (10b) can be relaxed via the Lagrange dual method, yielding a decoupled formulation, where each room's policy can be optimized separately, coordinated by the Lagrange multiplier [28]. Given a Lagrange multiplier ν ≥ 0, the Lagrangian for (10) can be expressed as The dual function can be obtained by minimizing the Lagrangian as Note that (12) can be split into I subproblems, which can be solved independently for a given ν. That is, upon defining the cost function the optimal control policies {π * i } can be obtained by solving the following subproblems: Now, the optimal dual variable can be found by solving the dual problem defined as max ν≥0 D(ν).
Thus, ν can be updated based on the subgradient method as where t denotes the iteration index, γ t denotes a small positive step size, and [·] + := max{0, ·}. The overall algorithm iterates the solution of (14) and dual update (16) until convergence. Upon convergence, VOLUME 8, 2020 the obtained optimal control policies {π * i } will satisfy the DR request given in (10b) and minimize the total average discomfort across the building at the same time.

B. LEARNING-BASED SOLUTION
The remaining issue is to solve the stochastic optimization problem in (14). A pragmatic approach to approximate the expectations would be to collect representative samples (ξ i,n , E * i,n ), n = 1, 2, . . . , N i , from the sensors installed in each room i and approximate (14) as Our approach is to search for the optimal policy π * in an appropriate function space using ML tools. To accommodate the nonlinear class of functions, kernel-based learning is employed. Given an RKHS H defined by a positive-definite kernel function κ(·, ·) : R 2 × R 2 → R, the kernel function κ determines a nonlinear mapping φ : it can be postulated that the control policy π i (ξ i ) for each room i can be well approximated by where ·, · represents the inner product on H, and h(x) := 1/(1 + e −x ) ensures that each action is always within [0, 1]. Therefore, the optimal policy for room i can be characterized by solving (17) with respect to θ i as Upon defining ( i ) := [φ(ξ i,1 ), . . . , φ(ξ i,N i )] ∈ H N i and a block diagonal matrix d i := diag{ (ξ i ), (ξ i )}, the representer theorem guarantees that the solution to the empirical problem (19) can be represented as a linear combination of features of the given input samples [29]. where The significance of this step is in converting the optimization problem in (19) formulated possibly in an infinite-dimensional space to that of finding vectors in R N i . Note that (19) is a STL problem, viewing each room i as a task, since the policy for each room is learned in isolation based on its own input features. An extension based on the MTL approach is introduced in the next section.

V. KERNEL LIFELONG LEARNING FOR DR
In the STL formulation, the control policies for the individual rooms are learned independently. However, it is reasonable to expect the policies to share intrinsic structures, although there would be differences as well. For example, the control policies for campus classrooms would have a lot in common due to their similarities in sizes and usage patterns. Such shared structures can be exploited in an MTL framework.
Specifically, a recently developed kernel-based efficient lifelong learning algorithm, called KELLA, is adapted here for the building DR problem [23]. The KELLA performs kernel-based MTL in an online manner for a sequence of tasks, effectively transferring knowledge across different tasks, while significantly lowering the computational complexity. It also allows tracking of slow variations in the optimal policies.

A. KERNEL MTL-BASED DR FORMULATION
One approach to capture the shared structure for MTL is to adopt a union-of-subspaces model for the learned classifiers/policies [20]. Specifically, in our setup, given the The MTL problem for all rooms can then be stated as where s i 1 , the 1 -norm of s i , is a regularizer that promotes sparsity in s i , and L 2 controls the complexity of the library. Parameters µ and λ are positive weights adjusting the strengths of the regularizers. Solving (22) directly requires processing the samples for all rooms jointly, which may incur significant computational and memory requirements, and hinders the online implementation. To mitigate this issue, a proxy is derived for the cost function from the second-order Taylor approximation around the STL optimal solutions {θ * i } [23]. That is, where is the Hessian, and v 2 H := v, Hv . Note that the first order term vanishes at θ i = θ * i due to the first-order optimality. The Hessian can be computed as follows.
With these, (22) can be approximated as Note that (26) is reminiscent of the kernel dictionary learning formulation [30]. Invoking again the representer theorem, one can show that where A := [(A light ) , (A ac ) ] is the coefficient matrix.

B. LIFELONG LEARNING ALGORITHM
As the number of rooms I increases, the law of large numbers (LLN) starts to kick in for the summation in (26). Thus, one can pursue the following stochastic optimization problem.
where the expectation is taken with respect to θ * and H. This problem can be solved in an online manner based on the stochastic gradient descent (SGD) method using {θ * i } and {H i } as samples.
Suppose that the input data samples {ξ i,n } N i n=1 from room i are acquired in the i-th iteration. Then, the STL solution θ i is obtained from (19). Given the last iterate of the library L(i − 1), the sparse coding is performed as Then, the instantaneous gradient of the objective of (28) is given by (30), the SGD update for L can be expressed as where > 0 is a step size. Per (27), the update can be equivalently done in terms of the coefficient matrix A ∈ R 2N ×K . Plugging (20), (27), and (25) into (31), one can obtain d A(i) Thus, upon defining N 1:

the update rule for A light and A ac can be written as
Note that K i,i and K i,1:i−1 can be computed without actually specifying the nonlinear mapping φ, as long as the kernel function κ is given, which is often called the kernel trick [31].
In fact, the kernel trick is instrumental for the sparse coding step in (29), as it can be re-written as (34) can be solved efficiently using various sparse coding solvers such as the SPAMS package [32].
Finally, θ * i is computed as from which the policy function for each room i can be obtained as π(ξ i , θ * i ) in (18), again using the kernel trick. Overall, the lifelong learning is done concurrently with the dual variable update so that the DR request is satisfied. The algorithm is listed in Table 1. Matrix A can be initialized from VOLUME 8, 2020 solving (26) using a data set involving a small number I 0 of rooms. The inner loop in lines 3-8 performs the lifelong learning for the I rooms sequentially for the given value of Lagrange multiplier ν t . Lines 10-12 represent the polishing step, in which the sparse codes for all the rooms are updated based on the most up-to-date library. Line 13 performs the dual variable update, where the expectation is taken again from the samples {(ξ i,n , E * i,n )} N i n=1 of each room i. The outer loop is repeated until the algorithm is converged.

VI. NUMERICAL EXPERIMENTS A. EXPERIMENT SETUP
The performance of the proposed DR algorithm is verified using numerical experiments. In order to conduct the experiments in a more realistic setup, real data sets are employed to the extent possible. First, an illuminance data set collected in a solar-powered house called SML System was used [33], [34]. We sampled I days' worth of data of direct sunlight illuminance levels and scaled them properly to simulate the ambient illuminance for I different rooms. The temperature levels were also taken by sampling I days' records from the data set in [35], which contains the data for 6 rooms during 31 days. Using the illuminance and temperature samples were generated based on (1) and (2). Here, the parameters were randomly generated as follows. For the lighting system, first the mean of the desired illuminancel * i for each room i was selected uniformly from the interval [500, 700], i.e.,l * i ∼ U[500, 700]. Then, l * i,n for n = 1, 2, . . . , N i was sampled from a Gaussian distribution with meanl * i and variance 10 2 , that is, l * i,n ∼ N (l * i , 10 2 ). The luminous efficacy was sampled from η i ∼ N (12.5, 0.2 2 ). The surface area was sampled as S i ∼ U[100, 150]. For the AC system, the mean of the desired temperaturet * i was set randomly from a uniform distribution U [22,25], and t * i,n was sampled as t * i,n ∼ N (t * i , 0.5 2 ). Similarly, withĀ i ∼ U[1.5, 1.7] andB i ∼ U[0.9, 1.1], the samples for A i,n and B i,n were taken as A i,n ∼ N (Ā i , 0.01 2 ) and B i,n ∼ N (B i , 0.01 2 ), respectively. The number of training samples was N i = 30 for all i.

B. TEST RESULTS
First, the convergence of the proposed algorithm is verified. Fig. 3(a) shows the convergence of the dual variable ν t , and the corresponding average total reduction in the energy consumption is plotted in Fig. 3(b). The x-axis represents the outer iteration index t. Note that the requested DR amount E DR is satisfied in about 20 iterations.  The convergence of a typical run of the inner SGD update loop is shown in Fig. 4, where the objective function values in (26) are plotted at different values of i. In order to ensure the convergence, here we actually used 5 epochs for the SGD updates, meaning that the samples of the I rooms were presented to the learning algorithm 5 times repetitively. Note that solving for {w * i } is needed only in the first epoch. It can be seen that the SGD update converges after processing around 100 rooms.
To assess the effectiveness of the proposed kernel lifelong learning-based DR algorithm, its performance is compared with that of four other methods, namely, the random, parametric STL, kernel STL, and parametric lifelong learning approaches. In the random policy, actions are randomly chosen among the actions satisfying the DR request. In the kernel STL approach, the control policy of each room is obtained solving (19) for individual rooms without leveraging any shared structure. The parametric STL policy is the STL policy  without using the kernel-based learning. It can be thought of the kernel policy but using the linear kernel function. Likewise, the parametric lifelong learning policy can be thought of as the kernel lifelong policy with linear kernels [21]. Fig. 5 depicts the difference between the action from the learned policy and the ideal action, for each room. The difference is calculated as the average Euclidean distance E[ a * i − a i 2 ] between the action vector a i = π(ξ i , θ * i ) based on θ * i learned from the proposed algorithm, and the ideal action vector a * i , which is obtained by solving (10) numerically assuming the full knowledge of {E * i,n }, which is not available to the learning algorithm. Note that, for obtaining the performance metric, we used the test samples (ξ i,n , E * i,n ) that are separate from the samples used for training. As can be seen from the figure, the proposed algorithm achieves the lowest action difference with an average of around 0.1, whereas the kernel STL approach yields roughly double the value. This highlights the benefit of lifelong learning, which exploits the shared structure. Note also that the parametric approaches yield slightly worse performance compared to the kernel counterparts, illustrating the advantage of using nonlinear policies through kernel-based learning. Finally, the random policy is seen to yield far larger differences.
In Fig. 6, the average absolute difference of the cost function value due to the learned policy from that of the ideal actions is depicted. That is, is plotted for all i. A trend similar to what was observed in Fig. 5 emerges. That is, the proposed algorithm achieves the lowest cost differences compared to the other methods. In fact, the proposed policy is seen to achieve virtually the same cost as the ideal actions. The user discomfort levels obtained from different approaches are compared in Fig. 7, which shows E[D i ] for all rooms. The minimization of the total discomfort is the prime objective of our DR formulation. It can be seen that the proposed algorithm achieves the lowest discomfort levels across all rooms among the compared five methods, and approaches quite close to the ideal discomfort levels. The discomfort levels obtained from the parametric and the kernel STL policies are much higher than those from the lifelong learning counterparts. This is because the STL approaches occasionally produce actions that reduce the energy consumptions excessively. In such cases, the actions are quite far from the actions of the proposed policy, which balances the DR constraint with the user discomfort effectively.
The average user discomfort across all rooms, i.e., 1 I i E[D i ], is shown in Fig. 8 when the DR request E DR is varied. It is seen that the average discomfort increases as E DR increases, regardless of the policies employed. This is expected since increasing E DR requires larger reduction in energy consumption, leading to higher discomfort for the occupants. Again, the proposed algorithm provides superior performance compared to the other considered methods, and produces the level closest to the discomfort level achieved by the ideal actions.
One of the advantages of the lifelong learning approach is that the library and optimal policy can be continuously refined based on newly arrived samples. This is particularly appealing when the distributions of the random variables change over time, for instance, due to seasonal effects. On the other hand, the batch MTL solvers would need to re-compute the library and policies whenever enough new measurements are obtained, without taking advantage of the existing solutions. To verify the advantage, we performed an experiment, where the seasonal variations in the illuminance and temperature are simulated over a period of 60 days. The DR requests are assumed to arrive once per day. The proposed algorithm updates L whenever new samples for all rooms arrive, whereas the MTL counterpart computes the new library using the batch of samples from the last 15 or 30 days.   9 shows the resulting user discomfort variations. It can be noticed that the proposed approach can track the variation of the ideal actions, whereas the MTL approach yields steplike curves. In fact, it can be seen that the performance gap between our method and the ideal one gradually decreases, as the library in the lifelong learning DR algorithm is refined further using the accumulated knowledge over many days. On the other hand, the MTL curves are sometimes seen to slightly worsen before the next update.

VII. CONCLUSION
A ML-based DR algorithm for controlling the adjustable loads such as the lighting and AC systems in individual rooms in buildings has been developed. The algorithm seeks a policy that can set the suitable energy levels for the loads as a function of the current ambient illuminance and temperature measurements, such that the discomfort levels of the occupants are minimized, while the DR requirement in overall energy reduction is met. The uncertainty in the environmental variables and user preferences has been tackled in a data-driven ML approach. Furthermore, an MTL framework has been adopted to exploit the structural similarities in the optimal policies across the rooms. In particular, a lifelong learning method has been derived, which can update the shared representation of the policies in an online fashion, for computational efficiency and the tracking capability of the optimal policies over time. The kernel-based learning was pursued to accommodate nonlinear policy structures as well. In order to cope with the DR constraint that couples all rooms, a dual decomposition technique was employed, which transformed the overall problem into a series of unconstrained stochastic optimization problems for individual rooms. The convergence and the performance advantage of the proposed method over benchmark policies were verified via numerical experiments designed based on semi-real data sets.