Electric Water Heaters Management via Reinforcement Learning With Time-Delay in Isolated Microgrids

Isolated microgrids powered by renewable energy sources, battery storage, and backup diesel generators need appropriate demand response to utilize available energy and reduce diesel consumption efficiently. However, real-time demand-side management has become a significant challenge due to the communication time-delay issue. In this paper, a distributed model-free strategy is proposed to manage the demand of Electric Water Heater (EWH) units. The distributed artificial intelligence technology based on Reinforcement Learning (RL) is adopted to independently control the 150 EWHs using a virtual tariff. Two different strategies are proposed to generate the virtual tariff and they are compared to each other to investigate the impact of communication time-delay to the proposed RL algorithm in real-time control scenario. The first strategy is based on measuring the battery State of Charge (SOC) in real time while the second method is based on predicting the SOC 24-hours in advance using an Artificial Neural Network (ANN). The results show that the communication time-delay greatly influences the convergence result of the first method while the second method showed high immunity. The results also show that the proposed algorithm reduces the use of energy consumption by an average of 8.91%(6.675kW) for each EWH, which symbolizes the viability of the proposed approach.

α The learning rate. η The learning rate coefficient. µ The mean value. ρ The mass density of the water. σ The standard deviation value. ϕ 1 (t) The Gaussian density. ϕ 2 (t) The stochastic delay density A The cross-sectional vector area. a t The action.

E t
Reward for running the EWH.
The associate editor coordinating the review of this manuscript and approving it for publication was Tallha Akram .

erf (·)
The error function. L t Reward for water tank temperature.

M f
The mass of water in the full tank.

M t
The demand for inlet cold water to the tank. r t Total reward function. s t The state. t The time index.

T o
The ambient temperature ( • C). T t The current average water temperature in the tank. Tariff k Virtual tariff. Temp l Water temperature. ToD Time of day. v The flow velocity of the mass elements. UA The heat loss coefficient.

I. INTRODUCTION
The world is rapidly turning into a global village, and the requirement for energy and other related services is also increasing. However, 1.4 billion people worldwide still lack access to electricity, and about 85% of them are live in rural areas [1]. The CO 2 emissions from the electricity and commercial heat used in buildings have increased to 10GtCO 2 [2]. Due to the depletion of fossil fuels and their associated environmental impact, more distributed generators based on Renewable Energy Sources (RES) are penetrating the current power systems market. This will not only mitigate the global climate change caused by fossil fuel but also support social and economic development of remote and isolated communities [3]- [5]. Energy storage is considered as an essential element to balance the generation and demand. Energy management of storage and non-critical loads is also vital to improve the economic performance and reliability of an environmentally friendly power system [6]. The domestic hot water consumption accounts for up to 40% of the total domestic energy usage [7]. An effective control strategy needs to optimize the total power consumption including domestic hot water consumption among renewable generators, energy storage systems, and other facilities to minimize fuel consumption while meeting load demand. It cannot only help these traditional power networks upgrade to smart grids, but also reduce the cost of fossil fuels in the entire island power system, optimize the energy structure, and reduce greenhouse gas emissions. Many control and optimization approaches have been investigated to achieve optimal results in energy systems such as Linear Programming (LP) [8], Mixed Integer Linear Programming (MILP) [9], [10], Mixed Integer Non-Linear Programming (MINLP) [11], and Genetic Algorithm (GA) [12]. These existing traditional analytical approaches are quite cumbersome and need several simplifying assumptions. They all require a detailed mathematical model of the system and some of them require system linearisation.
Artificial Intelligence (Al) based methods can, however, perform complex non-linear non-convex optimization and predict the energy demand and generation without the need for a mathematical model [13]. There has been a growing interest in the application of Al-based algorithms in energy systems. Several studies have also been presented to predict power consumption in energy systems, including Artificial Neural Networks (ANN) [14], Multiple Linear Regression (MLR) [15], Support Vector Machine (SVM) [16], and Decision Tree (DT) [17]. For energy prediction in buildings, Mechaqrane and Zouak [18] presented a performance comparison between a linear Auto-Regressive model with exogenous input (ARX) and a neural network ARX (NNARX) model to forecast the indoor temperature, with the latter resulting in improved efficiency.
In recent years, RL has been used to implement Demand Response (DR) and distributed energy management strategies for smart homes and smart grids [19]- [21]. Xu et al. proposed a completely distributed multi-agent associated with RL to optimize the reactive power dispatch. The proposed Q-learning algorithm can increase the learning speed and achieve near-optimal solutions [22]. In [23], a cooperative RL algorithm is proposed for distributed economic dispatch without using a specific mathematical model. A Markov decision process (MDP) modelled the energy trading process and an RL algorithm was utilized to optimize the decision in the MDP [24]. The simulation results verified the performance of the proposed demand side management system.When interacting with a specific environment, RL-based optimization algorithms can learn and choose actions based on experience [25]. In contrast, traditional optimization methods need specific system's and environmental mathematical models, which require a high degree of data, knowledge of control, and expertise. In [26], a distributed energy management strategy for a combined heat and power system, and a vanadium redox battery was introduced to optimize the discharging policy using RL. A deep RL based energy trading scheme with multiple Microgrids was proposed in [27] to optimize the energy trading policy. [28] presented an RL based distributed energy management scheme to maximise the profit through energy management and load scheduling without prior information. Another distributed operation strategy was proposed in [29] to operate a community battery energy storage system based on a double deep Q-learning method. In [30], the authors presented a decentralised Markov Decision Process (MDP) to solve an online decentralised and cooperative dispatch problem in order to calculate the approximate Q-value function considering communication delay. These studies, however, did not consider the demand side management of Electric Water Heater (EWH) units. In fact, EWHs are responsible for nearly 30% of the electricity utilised by domestic consumers in winter-dominated climates [31].
The application of RL for demand side management of EWHs started to receive some attention in the literature. AI-Jabery et al. in [32] proposed a fuzzy Q learning to control an EWH, and showed that the proposed algorithm could achieve global convergence. In [33], the proposed Q learning and action dependent heuristic dynamic programming methods are shown to reduce the cost of domestic EWHs energy consumption by approximately 26% and 21%, respectively. [34] presented a batch RL approach to control a cluster of 100 EWHs to decrease the daily cost within a learning period of 45 days. The study in [35] applied a fitted Q-iteration algorithm to an EWH to control the heater's ON/OFF actions. It is shown that energy consumption was reduced by 15% in comparison to that when a thermostat controller was used. Somer et al. [36] proposed a model-based RL approach to optimize the heating cycles of an EWH to maximise the self-consumption of the local PV generation. Six residential buildings were tested and the self-consumption of PV generation was increased substantially. Another RL scheme to optimize the hot water production was presented in [37]. A set of 32 houses in the Netherlands was used, and the energy consumption was reduced by roughly 20% without affecting customers' comfort.
In the above studies, EWHs are considered as a standalone system with their own constraints and they are not considered as an integral part of a larger power network that also includes intermittent RES, limited capacity energy storage systems, diesel generators, and Information and Communication Technology (ICT) systems. Taking a comprehensive approach that also includes the influence of time-delay is an important aspect to realise reliable smart grids in practice. Moreover, in many islands, the energy tariff is subsidised and fixed, and thus there is no incentive for consumers to change their consumption behaviours. Furthermore, lacking knowledge of each consumer's demand profile makes the centralised control of EWH demand less efficient in reducing total power demand while satisfying individual consumer's comfort requirements. Therefore, this paper proposes an intelligent hierarchically distributed strategy based on RL to control EWH units in isolated microgrids. The distributed controllers use a virtual dynamic tariff that is generated centrally. The virtual tariff can be determined and broadcasted hourly using direct measurement of Battery's SOC. Realising that this method makes the system prone to communication delays and packet loss, an alternative approach that is based on prediction is proposed. This method requires the tariff to be broadcasted only once a day. The main highlights and contributions of the paper can be summarised as follows: 1) This paper proposes a distributed control framework to isolated networks whose energy tariff is fixed (as the case in Ushant Island). The proposed framework uses distributed controllers to optimize 150 electric water heaters independently. The main consideration is that the water consumption habits of each household are different so that the distributed approach can more accurately control the water temperature and reduce diesel energy consumption. 2) The distributed RL controller based on a distributed Q-learning algorithm is adopted. It can learn how to choose actions based on experience and be directly applied in real-time to reduce diesel consumption effectively with different EWH demand profiles. Seven different scenarios of different combinations of RES and energy storage are considered. Simulation results show that the energy consumption of the diesel with RL algorithm is reduced by an average of 8.91%(6.675kW) compared to controlling the EWHs by traditional hysteresis control, the proposed algorithm can support the service provider in optimizing the overall energy operation.

3) A dynamic virtual tariff as a cost indicator is pro-
posed to provide a directive/incentive signal for the local RL based controllers to optimize diesel consumption. To investigate the impact of potential time delays on RL algorithm. Two approaches, a direct measurement (DM) strategy and prediction strategy, are proposed for generating the virtual tariff. The simulation results show that the communication time-delay will produce certain fluctuations during the iterations, and the final convergence results will also be affected. It demonstrates that the prediction strategy allows the framework to execute the algorithm on the basis of ensuring communication quality. Results show that the errors caused by the prediction strategy are negligible.
The rest of this paper is organised as follows. The microgrid network is described in Section II along with the proposed virtual tariff. The electric water heater mathematical model, and the time-delay model are introduced in Section III. The proposed algorithms are introduced in Section IV. In Section V, simulation results for different scenarios are presented. Finally, section VI presents the conclusion.

II. MICROGRID DESCRIPTION AND DYNAMIC VIRTUAL TARIFF
The standalone microgrid under study is shown in Fig. 1. It consists of RES, a battery energy storage system (BESS), a diesel generator, and domestic loads. When the diesel generator is not operating, the battery unit acts as the grid forming unit controlling the bus voltage and frequency, and hence absorbing surplus power and supplying shortage power. However, the battery has a finite capacity and thus when it is fully charged, renewable energy production has to be curtailed. Similarly, when it is fully discharged, either some of the loads have to be shed or the diesel generator has to be dispatched. The required capacity of a BESS is normally determined by a set of various factors, such as uncovered energy demand, and excess renewable energy generation, in addition to the technical and financial constraints. If the battery is to be sized to completely eliminate the need for the diesel generator, i.e., rely 100% on RES, the battery capacity has to be large enough to cover any shortage in energy even if it happens very rarely. This may result in a high capital cost of the battery and, therefore, it is more economical to size the battery to cover 80% or 90% of the renewable energy generation and rely on the diesel generator to cover the rest. On the other hand, DR can play an important role in reducing the diesel usage and the battery size.

A. VIRTUAL TARIFF
The general purpose of any DR is to shift the load demand to time periods when the electricity price is low. However, in many islands like Ushant, the energy tariff is fixed and thus traditional demand response becomes difficult. To deal with this hurdle, a dynamic virtual tariff is proposed to optimize the distributed operation of EWHs independently. This tariff is generated at the energy management system (EMS) based on the surplus/shortage of renewable energy generation and hence the battery SOC and the consumption of diesel.
When the SOC is at its maximum limit, there is surplus in renewable energy. When the SOC is between its maximum and minimum limits, the renewable energy and battery are able to supply power demand. However, when the SOC reaches its minimum limit, there is shortage in energy the diesel generator must be started to cover the deficiency. Therefore, the tariff can be simply divided into three levels to reflect surplus/shortage of renewable energy. The value of proposed tariff has a scale of 1 to 3. When the SOC is at its maximum limit, the tariff is set to level 1. For SOC range from 30% to 100%, the tariff is set to level 2. And when the SOC reaches its minimum value of 30%, the tariff is set to level 3 which means that the battery is fully exhausted, discharge is not allowed, and the diesel generator is operating.
The proposed control structure is shown in Fig. 1. It consists of a centralised controller that generates the virtual tariff at the central EMS and distributed controllers for EWHs. Two strategies for generating the virtual tariff are proposed:

1) DIRECT MEASUREMENT STRATEGY
Every hour, the battery SOC is measured directly from the Battery Management System (BMS)and the virtual tariff is then determined, as explained above, and broadcasted to the EWHs' controllers in real time. This method is based on real data but it is prone to communication delays and packet loss.

2) PREDICTION STRATEGY
At the start of each day, the historical data is used to predict the generation of RES and load demand for a 24-hour horizon. Generation of renewable energy sources and load demand can be predicted with high accuracy [38], [39], and thus they are assumed to be known during the optimization process. Two years' historical data is used to train an Artificial Neural Network (ANN) model to predict renewable energy generation and load demand, and is updated every 24 hours. A battery model is then used to calculate the SOC profile for 24 hours. The virtual tariff is then calculated and broadcasted to the distributed controllers. This strategy broadcasts the tariff once a day which will reduce the potential impact of communication delays or packet loss in advance.
Once the tariff is broadcasted, the distributed RL controllers will select appropriate actions to operate the EWHs locally to minimise virtual cost in real-time which will result in a reduction in diesel consumption in the island but at the same time satisfy consumers' requirements in terms of maintaining comfortable water temperature.

III. SYSTEM MODELLING A. ELECTRIC WATER HEATER MODEL
The thermal model of the EWH describes the dynamic heat-power exchange while considering the inlet cold water and environmental conditions. The dynamic thermal model can be obtained using the Equivalent Thermal Parameter (ETP) approach [40], [41]. When the EWH is ON between the time t and t + 1, the temperature at t + 1 can be obtained as: where α = 1 RC , β = T o RC and R = 1 UA . Q is proportional to the power rating of EWH.
On the other hand, when the EWH is OFF between time t and t + 1, Q is zero and the temperature at t + 1 drops due to the thermal loss and inlet cold water.
Consumed hot water is continuously replaced by cold water through the tank inlet. Therefore, the water temperature can be obtained as Combining equations (1) to (3), the mathematical function that describes the dynamics of the EWH can be expressed as where p+q = 1 and ϕ 1 (t) * ϕ 2 (t) = t 0 ϕ 1 (u)ϕ 2 (t −u)du. ϕ 1 (t) is the deterministic delay density that can be approximated by 2σ 2 . ϕ 2 (t) = λe −λt assumes to follow the exponential distribution by one alternating renewal process with the mean length of the closure periods λ −1 .

IV. PROPOSED REINFORCEMENT LEARNING FOR EWH CONTROL
Reinforcement learning is an area of machine learning concerned with how to take actions in an unknown environment so as to maximise a cumulative reward. It learns by modifying an optimization policy in real-time through interacting with the environment and using past experience. The dynamic EWH problem is modelled as a discrete finite MDP. In this model, the EWH operation (ON/OFF) depends on the virtual tariff and the water temperature. RL elements including state and action spaces, reward function, learning and exploration rates, and discount factor are described in detail in the following subsections:

1) STATE SPACE
The state variables are time of day (ToD j ), virtual tariff (Tariff k ) and water temperature (Temp l ).
where ToD is discretised into J = 144(24 × 6, every 10 minutes), the virtual tariff is divided into K = 3 levels in the range of 1 to 3, and the water temperature is divided into L = 5 levels between 55 • C and 70 • C.

Action Space is the ON/OFF commands for each EWH
where r t updates the Q table and to encourage the agent to choose the appropriate action. E t is based on the virtual tariff and L t facilitates consumer preferences and comfort requirement: L t is represented by Fig. 2. It shows high negative penalty for going outside the temperature range of 55 and 70 degrees. It is similar to a coefficient without a unit. Furthermore, L t shows the highest value when the temperature is about 68 degrees which reflects consumer preference. Other preferences can be implemented by modifying the reward function. The main purpose of the reward function is used to update the Q table and let the agent to know the quality of different actions. During the iterative process, the reward value will train the RL agent to choose the best action with high probability.

4) Q-LEARNING
In the Q-learning algorithm, an action at a given state is chosen to explore or exploit the future reward value. The Q-value table Q(s t , a t ) is updated at each iteration. The highest value for each state s in the Q-table corresponds to the highest expected reward after taking action. The optimal updating policy based on the Bellman equation [43] is expressed as follows: Q(s t , a t )). (12) where α controls how much previous learning is retained in the update of Q-table. α starts at 0.9, and after 80 days of training it becomes 0.15.
To ensure exploration, an ε-greedy policy is selected [25]. The strategy can either pick an arbitrary action with the probability ε, or take an action corresponding to the maximum value in the Q-value table. ε starts at 0.8 to enable sufficient levels of exploration, and after 80 days of training it becomes 0.01 as the focus moves to exploiting the optimal policy. Note that both α and decrease with the number of days to ensure sufficient exploration even as the learning process goes on as follows: The proposed RL Algorithm in the pseudo code shows the detailed DR algorithm, including the prediction strategy and DM strategy. For the RL agent to learn the optimal policy, it has to explore actions that are less rewarding in order to learn from experience. Therefore, it is wise to train the agent offline using historical load/generation data and a mathematical model for the EWH before commissioning. This will avoid operating the real EWHs suboptimally during the learning period.
During offline training, two years' data is provided to train the RL algorithm day by day. Once the training is established, RL can then be commissioned to control EWHs in real time in a model-free fashion; it applies the ON/OFF actions to the real EWH, measures its reward and updates its parameters accordingly. If there is a difference between the model and/or the hot water demand used in the model and those in practice, RL can also adapt to this change thanks to it learning capability.
At the end of each day, the microgrid load/generation data of that day are fed back to the ANN to keep updating the historical data that is used for prediction as shown in Fig. 1. Choose a t from current s t via ε-greedy policy 7:

EWH-Based Reinforcement
Take action a t 8: Obtain reward r(s t , a t ) and next new state s t+1 9: +γ maxQ(s t+1 , a t+1 ) − Q(s t , a t )) 10: Output the optimal policy 11: End Process

V. SIMULATION RESULTS
Numerical simulation has been carried out to assess the performance of the proposed DR. The microgrid shown in Fig.1 has been used in this simulation. The training data is obtained from Ushant island in France for the time period of January 1st, 2014 to December 30th, 2015. Another data set from the year of 2016 is used for real-time testing. The load demands of 150 EWH units follow a Poisson distribution, which is proportional to the hourly average household hot water usage is adopted from [44]. A 0.2MW/2MWh Lithiumion battery storage is used. Seven different renewable energy generation scenarios are explored as shown in Table 1 [45]. Scenarios 1, 2 and 3 consider wind and solar PV generation while scenarios 4, 5 and 6 consider solar PV and tidal generation. Scenario 7 consists of three types of RES. The diesel generator supplies power only if the load demand cannot be met by RES and the battery. The ANN model is trained by using a long short-term memory (LSTM) network with 256 units. Adam, which is a replacement optimization algorithm for stochastic gradient descent for training deep learning models, is selected as an optimizer with a learning rate of 0.01 via Python. The RL models are established and tested in Matlab.

A. PERFORMANCE OF THE PROPOSED RL-BASED STRATEGY
The generation and demand data for scenario 3 with a 2MWh storage is shown in Fig. 3(a). Battery power and SOC as well as the power from the diesel generator are shown in Fig. 3(b). the virtual tariff is generated by the centralized EMS and is shown in Fig. 3(c) along with surplus and deficit powers. It is  clear that the virtual tariff can accurately describe the current state of energy storage and the surplus/deficit of renewable energy, i.e. the state of energy in the whole microgrid.

1) OFF-LINE SIMULATION OF THE RL-BASED STRATEGY (ONE DAY DATA)
The purpose of this simulation is to demonstrate the ability of RL to achieve optimal performance. According to the one day's virtual tariff, the RL algorithm will update the Q-table and repeat the iterations using the same daily data until convergence is achieved. The energy consumption of an EWH of both strategies is shown in Fig. 4, along with the results obtained using a GA optimization algorithm and the traditional hysteresis control. Two other global optimization approaches, Simulated Annealing (SA) algorithm and Particle Swam (PS) algorithm, are also utilized to verify the experimental results of RL as shown in Table 2. An optimal solution can only be achieved if continuous space/action space is used. Furthermore, in terms of the large search space, the computational cost is expensive and it will also be time-consuming if all state-action pairs need to be visited. The proposed RL can quickly search for sub-optimal solutions and perform realtime control. The results demonstrate that the proposed RL algorithm can reach the optimal results very fast within a few iterations. The energy consumption using the DM strategy and the prediction strategy are 62.53 kWh and 63.73 kWh, respectively. It is very close to the GA result of 61.33 kWh. The energy consumption when the EWH is controlled by the traditional hysteresis controller is 90.67 kWh. Table 3 shows the effect of a single hyper-parameter modification on the experimental results, and appropriate hyper-parameter settings could achieve outstanding results compared with others. However, when the time-delay model is considered, it is shown that the time-delay can lead to large fluctuations and poor convergence. The GA optimizer finds the optimal solution from the simulations and this will always happen unless the GA uses a different model, e.g. a linearised model, or a model without noise. The RL controller converges quickly to the optimal solution, whilst directly interacting with the environment, i.e. without relying on the simulation model. The oscillations are caused by the controller trying to explore new action-state pairs. The superiority of the proposed RL strategy considering the prediction strategy is clearly demonstrated.  Two years of historical generation/demand data from Ushant Island is used to generate the virtual tariff for two years. The virtual tariff according to the direct measurement of SOC is then used to train the RL agent offline using the EWH mathematical model. Energy consumption is shown in Fig. 5(a). The ability of RL approach to tracking the optimum cost achieved by the GA algorithm is clear.

3) REAL TIME CONTROL OF EWH
The trained RL is used to control 150 EWH units in real time as explained in subsection IV-4. The yearly island load data of 2016 and the resources from scenario 3 are used. The virtual tariff is generated and broadcasted to EWHs in two ways as was explained in section II-A: daily broadcast using ANN prediction of SOC, and hourly broadcast using direct measurement of SOC. The trained RL agent issues the ON/OFF actions on an hourly basis. At the end of each hour, the reward is calculated, the and the next action is chosen. Fig. 5(b) shows the energy consumption for one year along with the results obtained using the GA. Each day has its own optimal consumption value and the proposed RL strategy is able to track this effectively. Both strategies for virtual tariff generation are able to save energy consumption significantly and the results are close to the global optimization policy when time-delay is not considered. Using the RL with DM strategy for generating the virtual tariff can reduce the use of diesel consumption by 8.91% (6.675kW) compared to controlling the EWHs by traditional hysteresis control. If the virtual tariff is generated by the prediction strategy, diesel consumption is reduced by 8.85%, only 0.06% increase compared to DM strategy. This difference, caused by the prediction error, is quite minimal. The advantages for using the ANN are the avoidance of hourly communication with EWH units, and providing customers with the virtual tariff profile in advance.
The proposed DR scheme is applied to the seven RES scenarios shown in Table 1. The total generation and diesel consumption are presented in Fig. 6 with 150 EWH units being controlled by hysteresis, GA, and direct measurement strategy based RL controllers. It can be noticed that both  the RL algorithm and GA algorithm can save diesel cost significantly compared with the traditional hysteresis control, especially in scenario 7. However, the GA requires pre-knowledge of all information in advance and spends a lot of computing resources to get optimal results. The annual energy consumption when using the RL strategy is very close to that of GA-based strategy in all scenarios. However, the proposed RL algorithm can achieve near optimal results in real-time control with no previous knowledge of EWH models. Furthermore, the yearly summary of the seven different scenarios indicates that RL strategy can cover enough energy demands in scenario 7 and reduce diesel generator consumption significantly. Compared to the other six scenarios, scenario 3 generates up to more than 150 MWh renewable energy generation. However, the total diesel cost in the case of hysteresis controlled EWHs shows that there is a substantial surplus of renewable energy not being utilised due to the limitation of the battery size. The results in Fig. 7 for scenario 3 show the diesel consumption cost considering different sizes of batteries. The larger the battery capacity, the more diesel energy is saved. However, considering the battery cost and service life, and the energy consumption of the entire island, the 2WMh capacity energy storage is chosen. Fig. 8 shows the energy consumption performance of an EWH based on a typical virtual tariff profile (a) when it is controlled by the proposed RL using DM strategy (b) and a simple hysteresis thermostat (c). The virtual price in Fig. 8(a) represents three different prices under three different states (renewable energy only, renewable energy and storage energy only, and diesel consumption only) according to the different electricity prices of different utility companies. It can be seen that the temperatures in both strategies are controlled within the required temperature range (55 • C and 70 • C). However, Fig. 8(b) shows that the RL based strategy can shift the ON commands to periods when the tariff is low. It means that RL agent tends to store energy in the water when there is surplus in energy by keeping the temperature near its maximum. Meanwhile, it can also keep the water temperature just above the minimum during shortage of energy. Furthermore, RL resulted in less total energy consumption compared to that of the hysteresis control approach.
In summary, all the results verify the performance of the proposed DR strategy based on the RL algorithms. It is capable of learning a cost-effective way for EWH management under different conditions, without requiring information about the model in advance.

VI. CONCLUSION
An intelligent distributed real-time DR based on RL has been proposed to manage the demand of 150 EHWs in isolated islands. To overcome the problem of fixed electricity price, an adaptive virtual tariff that reflects the status of the battery and the diesel generator has been generated and used in the reward function of the RL algorithm. Two methods for generating the virtual tariff have been proposed: DM strategy and prediction strategy. Simulation results show that the prediction strategy is suitable to achieve good performance compared to the DM strategy and it makes the algorithm less dependent on communication time-delay. The prediction strategy can also be used to encourage customers to arrange the use of other electrical equipment in advance to reduce total energy consumption. The performance of the proposed distributed controllers is assessed by simulation which shows the ability of RL to learn the optimal control policy. It is shown that employing the proposed RL algorithm results in an average 8.91%(6.675kW) reduction in the usage of diesel generators for each electric water heater.
However, Q-learning can provide near-optimal solutions. In future, we seek to consider state-of-the-art reinforcement learning algorithms, such as deep reinforcement learning algorithm and Bayesian reinforcement learning algorithm, to generate better exploration strategy and minimize the objective function [46], [47]. In addition, the hyper-parameters adjustment has a significant impact on the performance of the RL algorithm and hence we plan to further optimize the parameters by using the state-of-the-art algorithms, e.g., Bayesian optimization method and Monte Carlo method [48]. , where he developed a number of commercial products that include grid connected inverters, dc/dc converters for electric turbo compounding systems, and sensorless drives for high-speed permanent magnet machines. He is currently an Associate Professor of renewable energy with the University of Exeter, U.K. His research interests include grid-connected inverters, microgrids, smart energy management systems, and control of wave energy converters. VOLUME 9, 2021