Combined DR Pricing and Voltage Control Using Reinforcement Learning Based Multi-Agents and Load Forecasting

The demand for energy around the world continues to increase at a very high rate. To sufficiently supply this high demand, it is imperative to employ efficient methods so that the total costs for fulfilling such high demand in energy are minimized. To achieve this ambitious goal, this paper proposes a multi-agent reinforcement learning system for time of use pricing based combined demand response and voltage control. For this purpose, a long short term memory network is employed for day-ahead load forecasting in order to remove future uncertainties. The Q-learning algorithm is used which is a model free algorithm and hence, doesn’t require the agent(s) to have prior knowledge of the environment. The role of reinforcement learning in this work is very important since it allows the agent(s) to determine their respective optimal behavior(s) autonomously without explicit training by the end user. To allow effective cooperation among multiple agents, each household is controlled by its own agent, whereas all the household agents are directed by a master agent or service provider. Accordingly, the voltage control agent serves the purpose of checking voltage level violations in the system and removing them through optimal decision making. The proposed system yields very good results, whereby, not only is the overall cost of electricity reduced, but voltage level violations are also removed from the entire system. The implementation of this mechanism reduces the total average aggregated load demand from 5.23 kW to 3.86 kW, while reducing the total aggregated average cost from 94.01 Rs to 60.80 Rs, thanks to the proposed effective multi-agent based system.


I. INTRODUCTION
The continuous increase in demand for energy has put power systems around the globe under immense stress. The immediate solution to this problem, that comes to mind, is the expansion of power systems, but that in itself comes with a massive con; the huge cost associated with it [1]. An intelligent and viable method, thus, has to be employed to balance the demand and supply of energy without having to invest large amounts for achieving the given purpose.
The associate editor coordinating the review of this manuscript and approving it for publication was Padmanabh Thakur .
Demand Response (DR) programs are frequently employed to solve the demand-supply imbalance without having to bare the heavy financial constraints, that would otherwise be applicable. DR programs are broadly categorized into two classes, namely, incentive based DR programs and price based DR programs. In incentive based programs, participants get payments for their agreement to curtail load consumption when demand is high. Incentive based schemes are categorized into three types: Direct Load Control (DLC), Interruptible/Curtailable (I/C) and Emergency DR programs. In DLC scheme, the participants get payments for curtailing load consumption under a set limit. This program allows VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ utilities to remotely power off customers' appliances. In I/C programs, the consumers are required to curtail load consumption in emergency scenarios. Consumers who fail to curtail their respective loads suffer penalties which are agreed upon at the time of the initiation of the scheme. Emergency DR programs are a mix of DLC and I/C programs and are thought of as market based programs [1]. In price based programs, customers are offered time varying electricity prices, which encourages them to shift their respective loads to low priced hours [2]. Price based schemes are divided into two classes, namely, real time pricing (RTP) and time of use (TOU) pricing. In TOU pricing, the electricity prices are high when demand is high (peak hours) while the prices are low when demand is low (off peak hours). The prices for both these sets of times remain constant and are predetermined. In RTP, on the other hand, the electricity prices vary frequently i.e. hourly or minutely and customers are offered price variation in as low as five minutes [3]. The domain of DR is very vast, hence, reasonable amount of study associated with it is available in literature. Reference [1] proposes a reinforcement learning (RL) based single agent system to shift controllable appliances from high demand hours to low demand hours, smoothing the load consumption profile and reducing electricity cost. A real time incentive based DR mechanism for smart grid systems with RL and deep neural network (DNN) is presented in [2]. DNN is used for load and price forecasting, while RL is used to achieve the optimal incentive rates. The author of [3] analyzes the starting of various DR schemes because of slumping technology costs and recognition of consumers' behavior in the electricity market. The author also sheds light on the problems associated with DR implementations across United States of America, China and developed cities of Europe. Reference [4] implements a pricing mechanism that combines long short-term memory (LSTM) models and RL to eradicate the pricing problem of service providers when the consumers' response behavior is not known. In [5], an incentive based DR program with deep learning and RL is proposed, whereas in [6], an hour ahead DR algorithm for home energy management system (HEMS) is implemented. It makes use of artificial neural network (ANN) to predict future prices and a multi-agent RL system for making optimal decisions for various home appliances.
The author of [7] proposes a framework for home energy management (HEM) based on RL for achieving efficient residential DR. In [8], a hybrid price based DR system is proposed which is adaptable to pricing principles, while in [9], a deep RL based DR algorithm for smart facilities energy management is proposed to minimize electricity costs. The author of [10] presents a self scheduling model for HEMS, in which a formulation of linear discomfort index (D1) is proposed, taking into account the preferences of customers in the daily operation of home appliances. An optimization model for residential DR, based on a deep deterministic policy gradient (DDPG) algorithm, is implemented in [11], whereas the author of [12] proposes an intelligent multi-microgrid (MMG) energy management method based on DNN and RL. Reference [13] proposes a dynamic pricing DR algorithm based on RL for energy management in a hierarchical electricity market, whereas the author of [14] proposes a real time DR mechanism for optimal home appliance scheduling using RL. Reference [15] establishes real time pricing models, taking into consideration price based DR measures, and formulates real time pricing sale scheme. Reference [16] proposes a comprehensive pricing based DR for a smart home with different household appliances, while the author of [17] estimates customer elasticity for incentive based DR programs making use of data from surveys on two countries and combined with a comprehensive residential load model. In [18], a real time price based DR scheme is incorporated into the allocated model of distribution generation (DG).
The author of [19] proposes a voltage management mechanism in unbalanced distribution networks through the implementation of residential DR and on load tap changes (OLTCs). Reference [20] proposes a multi-agent system to obtain flexible price based DR in low voltage distribution networks, while in [21] and [22], a data driven, model free and closed loop control agent, trained using deep RL for voltage control is proposed. The author of [23] proposes a two-time scale voltage regulation scheme for distribution grids. To cover the gap in the literature, this study offers a multi-agent reinforcement learning system for time of use pricing based on combined demand response and voltage control. In order to eliminate future uncertainties, a long short term memory network is used for day-ahead load forecasting. Reinforcement learning agents are used to optimize home appliance scheduling and voltage management. Each home is controlled by its own agent, and all household agents are commanded by a master agent or service provider to allow for successful cooperation among many agents. As a result, the voltage control agent checks for voltage level breaches in the system and eliminates them through optimum decision making. The suggested solution produces excellent results, lowering not just the total cost of power, but also removing voltage level violations from the whole system. Because of the suggested effective multi-agent based system, the deployment of this mechanism decreases the total average aggregated load demand from 5.23 kW to 3.86 kW while lowering the overall aggregated average cost from 94.01 Rs to 60.80 Rs. The main contributions of this paper can be summarized as follows: • Proposing a multi-agent reinforcement learning system for time of use pricing based on combined demand response and voltage control.
• Precise load forecasting based on LSTM long short-term memory network.
• Proposing effective cooperation among multiple agents where each household is controlled by its own agent; in turn, all the household agents are directed by a master agent or service provider.
• Minimizing the overall cost of electricity, besides removing voltage level violations.

II. PROBLEM FORMULATION
This work proposes a multi-agent system for TOU pricing based DR and voltage control, taking into consideration multiple households with varying load consumption profiles, using RL and LSTM, aiming to reduce the overall aggregated cost of electricity for all the households and also to maintain voltage levels over the distribution network within the prescribed limits. In order to cope with future uncertainties, an LSTM network is used to predict the load consumption profile of each house for the next day. RL is then employed for the optimum scheduling of appliances, based on the priority list of each household, which not only reduces the overall cost of electricity but also makes sure that the comfort levels of the residents are not compromised. RL is advantageous in that it is model free. This means that an RL agent, which is the service provider (SP) in this case, does not require prior information about optimal appliance scheduling, instead, the SP discovers it from direct interaction with the customers or households (environment). The appliances are divided into two categories: Shiftable Appliances and Non-Shiftable Appliances. Shiftable Appliances are the type of appliances that can be rescheduled from their normal operating times if the SP requires load to be shifted. For each household, the appliances have different priority settings, which means that the SP has to make sure that each appliance is shifted, keeping in view the priority setting of each household. This is achieved through an RL agent. Non-Shiftable Appliances, on the other hand, are the class of appliances that have to be kept powered on till the need of the particular household from the appliance is satisfied. These appliances can thus, not be rescheduled or shifted to other times and have to be kept on till they satisfy the household's needs. The various shiftable and non-shiftable appliances, relevant to this study, with their respective power ratings, are listed in Table. 1.
The total energy consumed by non-shiftable and shiftable appliances is given by equation 1 and equation 2 respectively, whereas equation 3 represents the total energy consumed by non-shiftable and shiftable appliances combined.

III. LSTM AND MULTI-AGENT RL BASED METHODOLOGY
The load consumption data of households is obtained from [24]. The data was collected from 42 households in Lahore, Pakistan, over one minute intervals, for a period of one year. This study considers load consumption data from 9 such households, owing to the fact that most of the households had near similar energy consumption patterns, it was imperative to carefully look through the consumption patterns of each entity and choose households with distinctly varying energy consumption patterns, in order to develop a more generalized mechanism. The following subsections present in detail, the LSTM and multi-agent RL based mechanism.
A. LOAD FORECASTING WITH LSTM LSTM [25] is basically a recurrent neural network (RNN), which is fundamentally different from traditional feed forward neural networks [26]. RNNs are sequential models, which means that they have the ability to establish correlation between previous and current information. This property of RNNs is particularly useful for time series problems such as load forecasting, where previous sequences of load data are used to predict future value(s) for various households, all having diverse load consumption patterns.
The RNNs, however, have a major limitation of gradient vanishing [27], [28]. Gradient vanishing points towards the fact that the norm of the gradient for long-term components decrease very quickly to zero, restricting the capability of the model to learn long-term temporal correlation, whereas gradient exploding is the opposite scenario. To overcome this limitation, LSTM is frequently employed in forecasting problems. LSTM maintains an internal memory cell throughout its life cycle in order to establish temporal correlations, which makes it an improved version of the conventional RNNs.The basic representation of an LSTM network is depicted in Fig. 1. The compact forms of the equations for the forward pass of an LSTM cell with a forget gate are: where 'f t ' represents the activation vector of an LSTM's forget gate while 'i t ' is the activation vector of the input gate VOLUME 10, 2022

B. MULTI-AGENT RL BASED DECISION MAKING
RL is a machine learning algorithm which enables an agent to autonomously work out the perfect behavior in a probabilistic environment, maximizing the cumulative reward as a result. RL algorithm has six parameters, namely, agent, environment, state space, action space, rewards and actionvalue. At each time step, the agent executes an action, receives the numerical reward for that action and transitions to the next state. The goal of the agent is to maximize the cumulative reward, hence, it has to learn a policy (optimal policy) that allows it to choose the optimal action at each state. A general RL framework is depicted in Fig. 3. In order to perform the optimal action at each state, the RL problem is modeled as a markov decision process (MDP) framework. The MDP displays the markov property, which states that the transitions in states depend only on the current state and current action performed, and do not depend on any prior environmental states or agent actions. Q-learning [29], because of its ability to evaluate different actions for different states without needing to have a model of the environment, is used to get the optimal policy ν. The fundamental mechanism of Q-learning is to assign a Q-value i.e. Q(s, a) to each state action pair and then updating this value at each time step for optimising the agent's performance. The optimal Q-value i.e. Q * (s, a) refers to the maximum discounted future reward r(s, a) while performing action a at state s, and at the same time continuing to follow the optimal policy ν. It is represented by the equation below: When an action is performed, centered on a particular policy ν, a pre-determined reward r(s, a) will be received and the agent will take up a new state s t+1 . The action value Q(s, a) is simultaneously updated according to the equation below: (11) where α denotes the learning rate which determines how much the old value of Q(s t , a t ) is affected by the new reward. For instance, α = 0 shows that the latest information obtained is not employed in the learning process and thus, the reward obtained has no effect on the Q-value. If α = 1, only the new information is taken into account. γ is referred to as the discount rate and depicts the relationship between the future and current rewards. It takes values between [0, 1]. When γ = 0, the agent takes into account only the current reward, while γ = 1 refers to the phenomenon where the agent will go for future rewards.
The application of RL in the proposed multi-agent setting enables the master agent (SP) to take the appropriate action (schedule appliances for each household) in each state (combination of aggregated load demand and electricity cost of nine households) and also allows the voltage control agent to monitor the voltage across the distribution network and take the appropriate action to maintain its levels within the prescribed limits. The SP is chosen as the master agent in the appliance scheduling setting because it is responsible for power dispatch and scheduling in a power system. Each agent in the system has its own set of states, actions and the corresponding Q-values and aims to obtain the optimal Q-value, Q * (s, a). This framework is explained in the following sub-section. Fig. 4 shows the overall framework of the proposed appliance scheduling and voltage control algorithm for TOU pricing based DR using multi-agent RL. An LSTM network forecasts the minutely load of each household for the next day and at each time instant, the master agent (SP) receives the aggregated load and electricity cost of all the households. Based on the combination of both, the master agent takes the appropriate action. The pair of demand and cost constitutes the states of the master agent given as follows:

C. LSTM AND MULTI-AGENT RL FRAMEWORK FOR APPLIANCE SCHEDULING AND VOLTAGE CONTROL
The demand is categorized into three levels: high, average and low demand, while the cost is categorized into two types: high and low cost [1], as are given in the following equations.
The master agent thus has six possible states., the indexes of which are depicted in Table. 2.
There are three actions available to the master agent, i.e. shifting, valley filling and do nothing action given in equation 15.
Fuzzy logic is employed as the reward function for determining the action of each agent in a certain state. Fuzzy logic deals with approximate values instead of exact values. For VOLUME 10, 2022   Fig. 5 depicts the overall mechanism behind the reward function implementation. When at a particular time instant, both the aggregated demand and aggregated cost are high, the master agent directs the agents of each household to curtail load consumption (shift appliances based on the priority setting of each). On the contrary, when both the demand and cost are low, the master agent selects the valley filling action, directing each household agent to turn on the shifted appliances. For all the other states, the master agent directs the household agents to remain in their respective present states (do nothing action). Fig. 6 depicts the overall RL framework for the proposed appliance scheduling system.
The household agents constitute the environment of the master agent while the shiftable appliances constitute the environment for the household agents. At each time instant, the master agent takes an action depending on the state. The master agent's actions determine the behavior of the household agents and they then take the appropriate actions depending on the directions of the master agent. After this, the voltage control agent monitors the voltage levels across the distribution network at each time instant and whenever the voltage level <= 0.95 p.u, it acts to raise the voltage. The functionality of the voltage control mechanism is given by the following equations for sensitivity analysis: where S IP and S IQ are the sensitivity matrices with respect to the real and respective part of current, whereas R and X are the real and reactive part of impedances in the impedance matrix [Z ] [30]. Fig. 7 depicts the diagram of a 10 bus radial distribution system. The voltage control agent monitors the voltage at each bus in the network at every time instant and whenever the voltage falls the below the prescribed threshold, it requests the corresponding household to curtail load. It, then, again checks the voltage at the bus and until the voltage level violation is removed, it requests subsequent households to curtail load. Fig. 8 shows the flow chart for the proposed voltage control algorithm.
All the agents in the multi-agent RL setting follow the Epsilon greedy policy to achieve a balance between exploration and exploitation. Exploration is the phenomenon where an agent strives to explore its environment more, sacrificing any immediate reward that might come in its path, for future rewards. Whereas in exploitation, the agent takes the best possible action at the current state, without worrying about future rewards. Following the epsilon greedy policy, the agent  either selects a random action with probability , or selects a greedy action (best possible action at the current state with reference to the Q-table), with probability 1 − . The agent, as a result, explores its action space with an element of randomness, but does not become completely random. After executing an action at a given state, the agent receives a numerical reward r(s t , a t ) and transitions to the next state s t+1 . This procedure is repeated till the end of the day. Algo 1 explains the complete mechanism of the proposed system.

IV. RESULTS AND PERFORMANCE ANALYSIS
The performances of the LSTM model for load forecasting of all the households and multi-agent RL algorithm for TOU pricing based demand response (optimal appliance scheduling) and voltage control are presented in this section.

A. LOAD FORECASTING MODEL
An LSTM model was employed to predict the minutely variation in load consumption of all households for one day. The historical load consumption data was obtained from households based in Lahore, Pakistan. Since real time demand is not implementable in Pakistan, there was a need to forecast future load demand. The load consumption data set for a period of one year i.e. 1 June 2018 to 31 May 2019 was available, where 65% of the data set was used to train the LSTM model

Algorithm 1 Appliance Scheduling and Voltage Control With Multi-Agent RL System
Do day ahead load forecasting with LSTM Set γ , and α parameters and define rewards r(s t , a t ) for each agent Initialize Q(s t , a t ) to zero for each iteration do for each time step do for each agent do Chose a random state s t Select a random action a t from all possible actions for the chosen state Execute the chosen action a t , receive a numerical reward r(s t , a t ) and observe the next state s t+1 Determine the maximum Q-value for the next state in the Q-matrix Update Q(s t , a t ) using equation 2 Set the next state as current state end for end for end for while the rest of the data set was used for testing the model. Load consumption of individual households was predicted by VOLUME 10, 2022  the LSTM model and the forecasted consumption for each household was summed and fed to the master agent as one its state parameters. Python's colab environment was employed to run the forecasting simulations and the process was a smooth one. Fig. 9 and Fig. 10 depict the comparisons of actual load vs forecasted load with the LSTM model for household 4 and household 7, the households for which the LSTM gave the highest and lowest performance respectively. It can be seen that the LSTM model has accurately approximated the variations in load for both the households over time. The mean absolute error (MAE) and mean absolute percentage error (MAPE), represented by equation 17 and equation 18 respectively, are the performance metrics used to evaluate the LSTM model. Table. 3 compares the performance of the LSTM model on each household's load consumption data, while Table. 4 depicts the experimental error in the value function for the master agent over the course of training.

B. TOU PRICING BASED DR ALGORITHM
A multi-agent system was employed for optimal scheduling of household appliances and voltage control at each bus of the distribution network. Each household was controlled by its own agent, separately trained, while the master agent (SP) controlled all the households agents. The household agents operated on the directions of the master agent to control the various shiftable household appliances, whereas the voltage control agent monitored the voltage at each bus and maintained it within a prescribed limit i.e V > 0.95 p.u.  It is very natural for each household to have different priority setting for each appliance, thus, each household agent was trained to shift or turn on appliances in accordance with the priority setting of its corresponding household.
The hyperparameters of the Q-learning algorithm, i.e α, γ and , were all set to 0.9 to achieve a balance between the agent striving for future rewards and at the same time, giving importance to current rewards. The selection of these values for the hyperparameters also maintained a balance with regards to how much the latest reward received affected the Q-value.
Each agent was trained for a considerable amount of time, which allowed the Q-values of each agent to converge to their respective maximums. The agents were then enabled to choose the optimal actions for appliance scheduling and voltage control in a given state. The effectiveness of the overall multi-agent system for decreasing the aggregated load consumption can be seen in Fig. 11. It can be seen that the implementation of the DR algorithm reduced the overall load consumption significantly as compared to the scenario where DR was not employed. The average load consumption with DR was reduced to 3.86 kW from 5.23 kW without DR. There is a small window of time where valley filling was done i.e. the appliances shifted from peak hours were turned back on. Fig. 12 depicts the total cost of electricity at each time instant without and with the multi-agent DR algorithm implementation. It can be seen that the cost of electricity was markedly reduced with the TOU pricing based DR algorithm   as compared to the situation where the DR algorithm was not employed. The average cost without the implementation of the proposed DR strategy was 94.01 Rs as compared to 60.80 Rs with its implementation. Fig. 13 and Fig. 14 depict the effect of the multi-agent system on the voltage levels at bus 4 and bus 10 respectively. It can be seen that the voltage levels are very poor without the implementation of the mechanism proposed in this study. The DR algorithm along with the voltage control agent, maintained the voltage levels at each bus, within the prescribed limits i.e. V > 0.95 p.u. This, thus, added to the effectiveness of the proposed system, where not only the total cost of electricity was reduced, but the voltage levels across the distribution network were also kept within an acceptable range.  Table. 5 shows the maximum voltage, minimum voltage, standard deviation in the voltage, average load and average cost at each bus in the distribution network for three different scenarios i.e. without DR, with DR and voltage control applied on all buses and with DR but voltage control applied on only five buses farthest from the source bus. It can be seen that the standard deviation in the voltage is markedly reduced after the implementation of the proposed algorithm as compared to the case without DR and voltage control. Moreover, employing voltage control only on the five farthest buses from the source bus also yields very good results, whereby the voltage level at each bus remains at acceptable levels i.e V > 0.9 p.u and hence the SP will have to pay less amount for load curtailment to the remaining four households, adding to its profitability.
Latency in communication forms a core part of the proposed voltage control strategy. Fig. 15 shows the number of VOLUME 10, 2022 communication iterations between the agents in order to keep the voltage levels within the prescribed limit. It takes 20 ms on average for machine-to-machine interaction, based on a research conducted on LTE network communication [31].

V. CONCLUSION
In this study, a multi-agent RL system was proposed for TOU pricing based DR and voltage control, the aim being to reduce the total cost of electricity and remove voltage level violations from the system. An LSTM network was employed for day ahead load forecasting to remove uncertainties from the system. The RL system in combination with the LSTM network was used to make the optimal decisions with regards to appliance scheduling of each household and voltage control. The effectiveness of the proposed scheme is depicted in the simulation results, which prove that not only was the overall cost of electricity reduced, but voltage levels were also maintained within the prescribed limits. The work done in this paper is summarized as follows: 1. This paper implemented a decentralized, multi-agent DR system, where each household's load was controlled by its respective agent, and each household agent was controlled by a master agent (SP).
2. The household agents did not need to communicate with each other, reducing the overall complexity of the system, and also making sure that the privacy of the customers was not compromised.
3. This study also took into account the diversity of the households or customers with respect to their load consumption patterns, making sure that the appliance scheduling for each household was done according to its respective preference or priority.
4. This paper employed RL for the optimum scheduling of appliances for each household. RL is adaptive and model free, allowing the SP to independently determine the optimum appliance schedule for each household, without needing to have prior knowledge about the system. 5. This paper accomplished real time performance by predicting the load consumption data of each household through the use of LSTM networks, thus, mitigating future uncertainties. 6. Apart from the optimum scheduling of appliances, this work also implemented a mechanism to maintain the voltage levels across the distribution network of all the 9 households within prescribed limits, using a separate RL agent for voltage control.
7. The electricity cost with and without DR were compared.
In the future, the aim is to extend this work to incentive based DR, which too, forms a core part of the DR field. Furthermore, if available, Real Time Pricing will also be employed in future work.