A Contextual Reinforcement Learning Approach for Electricity Consumption Forecasting in Buildings

The energy management of buildings plays a vital role in the energy sector. With that in mind, and targeting an accurate forecast of electricity consumption, in the present paper is aimed to provide decision on the best prediction algorithm for each context. It may also increase energy usage related with renewables. In this way, the identification of different contexts is an advantage that may improve prediction accuracy. This paper proposes an innovative approach where a decision tree is used to identify different contexts in energy patterns. One week of five-minutes data sampling is used to test the proposed methodology. Each context is evaluated with a decision criterion based on reinforcement learning to find the best suitable forecasting algorithm. Two forecasting models are approached in this paper, based on K-Nearest Neighbor and Artificial Neural Networks, to illustrate the application of the proposed methodology. The reinforcement learning criterion consists of using the Multiarmed Bandit algorithm. The obtained results validate the adequacy of the proposed methodology in two case-studies: building; and industry.


I. INTRODUCTION
An important aspect to improve the energy management, namely in the presence of demand response programs, is the forecasting of electricity consuming activities [1]. In fact, the present paper's authors have previously published several works in the literature concerning electricity consumption forecast [2]. K-nearest Neighbors (KNN) and Artificial Neural Networks (ANN) have been proved to be adequate technics for an office building application. However, in some specific periods, here stated as contexts, one of the algorithms is better than the other. Moreover, reinforcement learning has been largely applied to power and energy systems problems [3], providing learning of decisions in complex modeling environments. The authors of the present paper have also used reinforcement learning in buildings environments, despite not for consumption forecasting, in [4].
The electricity consumption forecasting is important to guarantee improved energy management in smart buildings [5]. Therefore, there are in the literature several buildings with data accessibility that research different machine learning techniques on how to achieve more accurate predictions, as in [6].
Buildings equipped with smart grids technology take advantage of data generated from several sources, including smart meters, phasor measurement units, and various sensors [7]. Using such data, forecasting algorithms are essential for prediction activities. Artificial Neural Networks have the advantage of extract and model unseen relationships and features. This ability gifts the neural networks with more robust choices if used the right way [8]. The K-Nearest Neighbour algorithm is an alternative recommended for time series classification. However, the algorithm's performance requires a minimum quantity of labeled data [9]. The decrease of energy costs may be more effective with the assistance of modeling strategies that combine different forecasting algorithms including Artificial Neural Networks and Random Forest [10]. In fact, the uncertainties of load demand in the energy management present obstacles to achieve accurate forecasts. Reinforcement learning is recommended to overcome complex nonlinear issues with a decision-making ability that optimizes the current solution to be more effective [11,12]. Reinforcement learning has a strong learning ability and high adaptability gifted with control and decision-making abilities. These are essential to ensure optimal outcomes in different scenarios including in robotics and distributed control [13]. Reinforcement learning is used for different applications according to the problem diversity, including performance improvement. It is also stated that a few applications use reinforcement learning to improve the prediction accuracy with different deep learning techniques, which is the case of this paper. Additionally, the learning method is also discussed being the Q-learning a researched option [14].
Given the results of the above-mentioned literature, the methodology proposed in the present paper aims to, in the first step, identify different contexts using decision trees. Then, reinforcement learning is applied in each context to identify the most accurate forecasting model. It innovates in overcoming the approach of selecting a single forecasting model for all the operational situations in a single consumer or building. For illustration purposes, models based on ANN and KNN forecasting algorithms have been used. The motivation consists in improving the forecasts obtained in recent research published by the authors of this paper [2]. Therefore, the authors reuse several forecasting aspects from [2] including the forecast horizon and forecast strategies. Innovative topics featuring the formation of new contexts with decision tree training and the reinforcement learning evaluation considering the most effective algorithm in different contexts are expected to improve these forecasts. Moreover, the decision tree and reinforcement learning innovative aspects are inspired from recent research published by the authors of this paper, respectively in [15] and [16].
After this introduction, Section 2 explains the proposed contextual approach, Section 3 evidence the details of the case study, and Section 4 presents the obtained results. Finally, Section 5 presents all the conclusions.

II. PROPOSED CONTEXTUAL APPROACH
In this section, it is explained the different phases of the proposed contextual approach. These include obtaining energy consumption forecasts, decision rule-based learning, definition of contexts, learning process, and the selection of the best forecasting algorithm for the target context.
The main goal is to evaluate the best forecasting model for each of different contexts. After obtaining energy consumption forecasts with different algorithms, a decision tree gifted with rule-based learning defines different contexts. Later, a learning process evaluates the best algorithm for different contexts. The first step consists of obtaining energy consumption forecasts for five minutes and according to two algorithms: Artificial Neural Networks and K-Nearest Neighbors.
Afterwards, a rule-based decision learning trains a decision tree with the forecasting data of both algorithms and additional factors from the actual and previous periods.
These factors consider time features including the weekday and the actual period and furthermore consider quantitative data obtained from the previous period including the consumption and two sensor devices data. These last two factors monitored on sensors devices consist of CO2 and a light variable with the value one or zero corresponding respectively to light in the building or no activity at all. These two parameters have been selected in sequence of the validation made in [2]. The learning process arises to evaluate the more suitable forecasting algorithm in different contexts. A set of agents perform this evaluation in an interactive environment through trial and error using feedback from their actions, observations, and rewards. The observations correspond to the contexts defined previously in rule-based decision learning. The agent's action is triggered every five minutes, and it corresponds to the selection of a forecasting algorithm, This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3180754 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ either K-Nearest Neighbors or Artificial Neural Networks. The reward is calculated after every five minutes after the agent algorithm selection, representing how good the forecasting algorithm selection was for each actual context. In one hand, Rewards assigned to 0 correspond to scenarios where the selected algorithm is the one with higher forecasting error. On the other hand, rewards assigned to 1 correspond to scenarios where the selected forecasting algorithm has lower forecasting error. Each obtained reward is updated to an average of rewards, measuring the reward performance for all five-minute periods. In other words, the average of rewards measures the algorithm selection performance with lower forecasting error expectations. In each context evaluation, the learning methods and the exploration and exploitation rates are updated. The learning methods may correspond to greedy or upper confidence bound -the exploration rate focus on the angle of unexplored territory for each forecasting algorithm selection. The exploitation rate focus on the knowledge exploration of a particular forecasting algorithm selection. After evaluating the best forecasting algorithm for all five minutes periods, the multi-agent system is prepared to select the best forecasting algorithm for the target context. Then, according to upper confidence and greedy learning methods, the action is calculated every five minutes according (1) and (2). Where: • Nt(a) -number of times the action has been selected before time t • Q(t) -current estimation • c -degree of exploration • a -maximizing action

III. CASE STUDIES
In order to illustrate the use of the proposed methodology, the implemented decision tree methodology studies a sample of data obtained from electric devices measuring different units and magnitudes. It has been implemented, in this paper, for two case studies: a building case study, and a industrial case study. In the building case study, it is contextualized for a whole week from 18 to 24 November 2019 in five minutes periods. Only a week with five minutes contexts from 18 to 24 November 2019 is considered to compare the same data size studied in recent publications by the authors of this paper [15]. Table I presents the decision tree inputs structure with the weekday, the allocated period, the consumption, the light, and the CO2. This table also adds the decision tree output structure with the forecasting algorithm application. Moreover, the input variables with nonlinear behaviors are studied according to their profile during 18 to 24 November 2019 in Fig. 2. Therefore, temporal variables are excluded from the analysis in Fig. 2 keeping however the consumption, light and CO2 profile. The light and CO2 sensors were added to the decision tree structure due to previous research published by the authors of this paper concluding that these two factors have more influence on the consumption [17].  The case study researches the different factors according to a weekly profile and five minutes contexts. Five similar patterns are identified, representing the activity data from each day of the week more concretely from Monday to Friday. This is followed by two similar patterns representing the low activity of the weekend. The consumption shows usual variations from 500 to 1500 W, as seen on the patterns from Monday to Thursday. The consumption variation from Friday is shown to be more productive, reaching consumption ranges higher than 2000 W. During the weekend, the consumption behavior is described by variations nearly to 600W. The light intensity describes variations between 0 and 1, representing respectively the absence or presence of light intensity measuring devices. CO2 devices present variations between 0 and 20%. The two sensors present null values during the whole weekend.
The reinforcement learning methodology studies the evaluation of the most suitable forecasting algorithm in five This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3180754 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ minutes from 18 to 24 November 2019. These five minutes decisions correspond to the forecasting algorithm selection, K-Nearest Neighbors, or Artificial Neural Networks. One week with five minutes contexts is considered to compare with other publications by the authors of this paper [16].
Regarding the industrial case study, which has been included for validation purposes, detailed information is not provided due to space limitations. Further details can be obtained in [18].

IV. RESULTS
In this section are presented the results regarding the use of the proposed methodology. These are obtained with the greedy learning method and according to four selected contexts (SC1, SC2, SC3, SC4).

A. BUILDING
The decision tree approach has been applied to the data in section III, testing different tree depths. Three data samples evidence different day features classified as the morning, afternoon, and night labeled respectively in a), b), and c), as seen in Fig. 3. These three samples correspond to previous known research published by the authors of this paper [16] and are detailed in this case study to support known forecasts in unique and different parts of the day. These forecasts are later used as research during the reinforcement learning evaluation of the most effective algorithm in different contexts. The k-nearest neighbors and artificial neural networks present very accurate predictions much nearer to the real consumption for almost all five-minute periods. The morning scenario presents consumption variations between 500 and 1500 W. The afternoon scenario presents variations between 500 and 1500 W and between 500 and 2500 W. Finally, the night scenario presents many variations between 500 and 600 W and sequences of 5 minutes reaching 1000W.
The accuracy of the decision tree resulted from the depth parameterization is presented in Table II.  Table II evidence very accurate results for the different depth parameterization values. It is noted that depth parameterizations assigned within ranges between 2 and 4 are not large enough to result in accuracies greater than 66.96%. However, it is possible to obtain higher accuracies by increasing the decision tree depth to values higher than 4. As seen in Table II, increasing the depth parameterization value to 5 and 6 results in more accurate results, respectively 67.86%, and 71.43%. Therefore, while no real improvements are seen for depth ranges between 2 and 4, parameterization depth value changes to 5 and 6 show accuracy improvements respectively of 0.90% and 4.47%. The reason for these improvements is a higher complexity in the elaboration of decision rules. Therefore, the higher the decision tree depth, the higher the complexity of rules, possibly resulting in more accurate results. The accuracy results obtained in the decision tree feature similar research provided by the authors of this paper [15].
A simple rules elaboration illustrates the decision tree for a depth assigned to the value two as presented in Fig. 4. This scenario is a simple example to summarize the simpler logic presented in the decision tree rules. As identified previously in TableII, the scenario with decision tree depth assigned to 6 leads to more accurate results. Therefore, the rules split of this scenario is analyzed in List 1. The decision tree presented in Fig. 4 shows very simple rules for depth assigned to 2. Two contexts are identified on the decision tree in Fig. 4 with a) weekday from Monday to Friday and consumption ranges below or equal to 568.833 W or b) weekday from Monday to Friday and consumption ranges higher than 568.833 W. List 1 presents very complex rules for a decision tree depth assigned to 6 corresponding to a total of 46 contexts. These contexts presented many differences, including the day corresponding to a weekday from Monday to Friday or a weekend and specified ranges for consumption (cons), CO2 (CO2), and the period allocated (min). From these 46 contexts, several can be identified within the restrictions defined in a) and b).
Moreover, the selected contexts are identified within the restrictions defined in a) and b) and separating small from large occurrences labeling respectively in SC1, SC2, SC3, and SC4.
The learning phase studies the average rewards and the history of actions for five minutes periods and all exploration and exploitation rates from 0.1 to 0.9 with the greedy learning method. Moreover, this is presented respectively in Fig. 5, and Fig. 6 for four contexts SC1, SC2, SC3, and SC4 labeled respectively in a), b), c), and d).
The average reward alternates every five minutes between 0 and 1, representing algorithm selections with higher and lower forecasting errors. All presented scenarios start with an average reward assigned to 1 in the first five minutes, followed by at least an alternate decision that causes the average reward to converge to an interval between 0.2 and 0.8. Scenario a) has average rewards convergences between 0.7 and 0.8 for low exploration rates. However, it tends to decrease to patterns between 0.4 and 0.7 as the exploration rate increases. Scenario b) has average rewards to converge to 0.6 for lower exploration rates and 0.5 for higher. Scenario c) has average rewards to converge to 0.8 for low exploration rates. However, it tends to decrease to patterns between 0.3 and 0.8 as the exploration rate increases. Scenario d) has average rewards to converge to 0.5. As noted in scenarios b) and d), the increase of the exploration rate makes the different exploitation rates converge towards a more similar pattern.
Thus, the exploitation rates assigned to values 0.1, 0.4, and 0.9 tend to converge to higher average rewards on some scenarios and for the different exploration rates. The historic actions associated with context SC1 and for exploitation rates of 0.9 are illustrated in Fig. 6.
The history of actions is illustrated for context SC2 for the three exploitation rates identified previously as frequent cases to result in higher average rewards. These rates are within 0.9, 0.1, and 0.4, labeled respectively in a), b), and c) in Fig. 7. The historical actions for context SC1 illustrated in Fig. 6 show long sequences of five minutes deciding to use KNN repeatedly. After nearly 75 sequences of five minutes, the history of action finds it essential to alternate between KNN and ANN, being this more frequent between 190 and 230 and between 260 and 297 long sequences of five minutes.
The historical actions for context SC2 show two possible behaviors for long sequences of five minutes: either to use repeatedly KNN as seen between 408 and 445 long sequences of five minutes or alternating very frequent between KNN ANN as seen between 445 and 482 long sequences of five minutes.
The history of actions of context SC1 presented in Fig.6, and SC2 presented in Fig. 7 labeled in a), b) and c) suggest a long-term learning approach more capable of alternating more between KNN and ANN according to the five minutes context, rather than repeatedly evaluating for KNN. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  Lower exploitation rates tend to repeatedly evaluate more sequences of five minutes as KNN as evidenced in Fig.7 when comparing scenario b) with scenarios a) and b), respectively low and higher exploitation rates. This is understandable as low exploitation rates take more sequences of five minutes to acquire knowledge about KNN. Therefore, scenario a) has the advantage of acquiring more knowledge about a particular forecasting algorithm in fewer periods of five minutes. The historic actions associated with context SC3 and for exploitation rates of 0.9 are illustrated in Fig. 8.  The historical actions are illustrated for context SC4 for the three exploitation rates identified as frequent cases to result in higher average rewards. These rates are within 0.9, 0.1, and 0.4, labeled respectively in a), b), and c) in Fig. 9. The historical actions for context SC3 illustrated in Fig. 8 show long sequences of five minutes deciding to use KNN repeatedly. After nearly 75 sequences of five minutes, the history of action finds it essential to alternate between KNN and ANN. This behavior is presented between intervals of sequences of five minutes, including between 90 and 110, 120 and 150, 152 and 190, 192 and 294, and finally 197 and 334.
The history of actions for context SC4 show two usually and possible behaviors for long sequences of five minutes: either to use repeatedly KNN as seen between 260 and 297 long sequences of five minutes or alternating very frequent between KNN and ANN as seen between 297 and 334 long sequences of five minutes. Although these two behaviors are usual, the scenario represented in b) with a low exploitation rate of 0.1 shows that the historic of actions is also capable of evaluating small sequences of five minutes periods repeatedly as ANN as seen between 112 and 149 long sequences five minutes. This is understandable as low exploitation rates need more time to acquire knowledge of ANN on five minutes contexts before having knowledge of both forecasting algorithm and reaching more pragmatic decisions.  The history of actions of context SC3 presented in Fig. 8, and SC4 presented in Fig.9 labeled in a), b) and c) suggest a long-term learning approach more capable of alternating more between KNN and ANN according to the five minutes context, rather than repeatedly evaluating for KNN or ANN. Lower exploitation rates tend to repeatedly evaluate more sequences of five minutes as KNN or ANN, as evidenced in Fig. 9 when comparing scenario b) with scenarios a) and b), respectively low and higher exploitation rates. This is understandable as low exploitation rates take more sequences of five minutes to acquire knowledge about KNN or ANN. Therefore, scenario a) has the advantage of acquiring more knowledge about a particular forecasting algorithm in less periods of five minutes. It is possible to research the learning phase results for the whole week from 18 to 24 November 2019 with no contexts distinction. This research presents the average rewards for five minutes and all exploration and exploitation rates from 0.1 to 0.9, as illustrated in Fig. 10. The results obtained in Fig. 10 presents overall average rewards nearly to 0.6, highlighting average rewards above reasonable. It is possible to obtain higher average rewards with context distinction for context SC3 nearly to 0.8 as illustrated in Fig. 5 scenario c).

B. INDUSTRY
An identical simulation contextualized in industrial energy consumptions compares the decision tree accuracies and the average rewards with the electrical building simulation previously studied. The accuracy of the decision tree is obtained for different tree depths according to an industrial use case as visualized in Table III. The decision tree accuracies visualized in Table III evidence very accurate predictions between 60.42 and 61.11% using decision tree depths assigned to values between two and five. The decision tree loses accuracy while improving the decision tree depth from value five to value six decreasing the accuracy from 61.11 to 56.25%. This is logical as the use of time features and industrial energy consumption has its limitations while elaborating decision rules. Table III also evidences the decision tree accuracy decrease from 61.11 to 60.42% while changing the depth from value three to value four. However, a decision tree depth increase from value four to value five, improves the accuracy from 60.42 to 61.11 %.
The average rewards evaluation of the most effective forecasting algorithm application in different five minutes contexts is also studied for the industrial context. This analysis considers all exploration and exploitations rates from 0.1 to 0.9 in the learning phase parameterization with the greedy method application as illustrated in Fig. 11. The average rewards contextualized in the industrial context show an initial average reward of one for all exploration rates due to the selection of the most effective forecasting algorithm in the first five minutes. This is followed by at least a forecasting algorithm selection with lower accuracy leading to the average reward decreasing from 1 to a lower value between 0.4 and 0.6. The average reward converges to 0.6 for exploration rates between 0.1 and 0.2 and to 0.5 for exploration rates between 0.3 and 0.9 until the last five minutes period evaluation.
The historic of actions studies the forecasting algorithm application in different five minutes periods. The k-nearest neighbors and artificial neural networks applications are alternated in different five minutes contexts for an industrial application with an exploitation rate assigned to 0.4 as illustrated in Fig. 12. Some examples are observed including between 1 and 37 sequences of five minutes and between 91 and 145 sequences of five minutes.

IV. CONCLUSIONS
This paper identifies suitable contexts through decision tree rules and analyzes the best forecasting model in different periods. The results obtained for the different decision tree depth values suggest the decision tree is suitable to identify contexts. It is also noted that increasing the depth value higher enough makes the decision rules complex enough to result in more accurate results. The obtained results on the learning phase for the greedy method show average rewards converging to values above reasonable. It is noted that increasing the exploration rate may decrease the final average reward in some contexts. The historic actions present two frequent patterns on long sequences of five minutes: to select KNN or ANN repeatedly or to alternate between KNN and ANN. It also noted that it is advantageous to use large exploitation rates to acquire more knowledge of a particular forecasting algorithm selection in fewer periods of five minutes. Moreover, this motivates to alternate between KNN and ANN on different five minutes contexts faster than for low exploitation rates. An accurate analysis of the learning phase results for the whole period reveals that context use is advantageous for obtaining higher average rewards. The industrial use case also reaches very accurate decision tree accuracies, however this is limited to a maximum of 61.11% while the electrical building application contextualized in this paper reaches accuracies with maximums of 71.43 %. It is inferred that the less precise decision tree accuracy in the industrial context is because of the lack of sensors data in the decision rules. Moreover, this problem may also explain why the increase of the decision tree depth at some point decreases the accuracy. It is inferred that the rules built in the decision tree training are able to reach stronger logics when including sensors data. The average of rewards analysis on the industrial use case has also obtained above reasonable forecasting algorithm applications in different contexts. The historic of actions contextualized in the industrial use case have shown two similar behaviors leading to either alternating between k-nearest neighbors and artificial neural networks applications or evaluating repeatedly with k-nearest neighbors.