Optimizing the Post-Disaster Control of Islanded Microgrid: A Multi-Agent Deep Reinforcement Learning Approach

Extreme disasters may cause the power supply to the distribution system (DS) to be interrupted. The DS is forced to operate in island mode and forms an islanded microgrid (MG). In order to improve the post-disaster resilience of the DS and to provide longer power supply for as many loads as possible with limited generation resources, this paper proposes a multi-agent deep reinforcement learning (DRL) method which realizes a dual control on the source and load sides of the MG. The problem of resilience improvement is converted to a sequential decision making problem, where the objective is to maximize the cumulative MG utility value over the power outage duration. A multi-agent DRL model is proposed to solve the sequential decision making problem. A dual control policy including energy storage management and load shedding strategy is put forward to maximize the utility value of the MG. A reinforcement learning (RL) environment based on OpenAI and OpenDSS for islanded MG is constructed as a simulator, which has a general interface compatible with, and also can be published to, OpenAI Gym. Numerical simulations are performed for an MG equipped with wind turbines, diesel generators, and storage devices to validate the effectiveness of the proposed method. The influences of available generation resources and power outage duration on the control policy are discussed, which validates the strong adaptability of the proposed method in different conditions.


I. INTRODUCTION
A. BACKGROUND Recent years, the frequent occurrence of extreme disasters, such as earthquake, hurricane and flood, has exerted a significant impact on the normal operation of infrastructure and resulted in significant inconvenience and economic losses to residents due to the loss of electricity, water and communication. A Congressional Research Service study in 2012 estimates the inflation-adjusted cost of weather-related outages at 25 to 70 billion dollars annually in the U.S. [1]. The severe power outages caused by these extreme disasters have highlighted the importance and urgency of improving the The associate editor coordinating the review of this manuscript and approving it for publication was Zhe Xiao . resilience of distribution system (DS). Resilience is used to measure the ability of a DS to withstand and recover from extreme disasters [2].

B. RELATED WORKS
In the last decades, various methods are proposed to enhance the resilience of power grid, and these methods can be mainly divided into two categories based on the timeline: pre-disaster preparation and post-disaster decision making.
From the viewpoint of pre-disaster preparation, some studies focus on the natural disaster impacts on electric power systems, trying to understand the causes of the blackouts and explore ways to prepare and harden the grid [3]- [5]. A resilient defender-attacker-defender game framework is proposed in [6] to coordinate the hardening and distributed VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ generation resource allocation with the objective of minimizing the system damage against disasters. In [7], the weather information is integrated into the distribution damage assessment which helps to understand how different weather metrics impact the distribution grid. Some others, from the perspective of post-disaster decision making, focus on faster restoration of the system. Research of using microgird (MG) to restore the DS is reviewed in [5], [8]. Research [9] proposes a novel distribution system operational approach by forming multiple MGs energized by DG from the radial distribution system in real-time operations to restore critical loads from the power outage. A hierarchical energy management framework based on multi-MGs is proposed in [10] for resilience enhancement. A cost-effective systemlevel restoration scheme is presented in [11] to improve power grid resilience. An MG dispatch solution is proposed in [12] for emergency electric service restoration after a disaster. A methodology for MG management and control to maximize the duration of electricity supply in emergency situations is proposed in [13]. The feasibility of control strategies to be adopted for the operation of an MG when it becomes isolated is described in [14]. Microgrids (MGs) can enhance post-disaster resilience by improving generation availability (e.g., fuel cells, microturbines, wind turbines, photovoltaic panels) when the utility power of the DS is unavailable [5]. During extreme natural disasters and aftermath, the generation resources within the DS are limited and hard to supplement due to the direct or indirect damage to power grid and transportation [12]. Therefore, it is necessary to manage the generation resources within the system appropriately to prevent a complete outage. In addition, load shedding can be adopted where the non-critical load is shed gradually for continuous power supply to critical load [2], [13].
However, the uncertainties in renewable energy, load, system energy storage and the power outage duration, as well as the complex hybrid control at both load side and source side bring many technical challenges. These uncertainties make prediction to the future more troublesome, and how to make decisions based on known information becomes more difficult. The complex hybrid control at both load side and source side results in a large search space and high optimization cost, and the strategy updating becomes more difficult.
To address these challenges, effective methods are required. Classical optimization/convex optimization is one of the conventional methods. It has the following special characteristics: need a specific mathematical model but require the prediction of the future. Robust optimization (RO) and stochastic programming (SP) can deal with the uncertainty. They often formulate a multi-stage or multi-layer optimization problem and transform the problem into a deterministic problem to solve. But the computation time and model complexity of this kind of method will be enormous in complicated and high-dimension scenarios, and the feasibility cannot be guaranteed. Meanwhile, the solution obtained by RO or SP is a pre-determined solution. This means that the actual operation plan is implemented according to the predetermined solution, and real-time control cannot be realized.
In research [15], the state-based strategy is proposed. The strategy is made based on observed states during the unfolding events. Both RO and SP are not suitable for mapping sequentially real-time varying states to optimal strategies. To overcome the problem, Markov decision process (MDP) is employed to make state-based decisions in a stochastic environment caused by weather events. It chooses the action according to the MDP state (or the available information at each decision point). Although dynamic programming can be used to solve the MDP problem, it cannot deal with the uncertainty. It also has the weakness of the curse of dimensionality . The state space is huge in the high-dimension problem. Approximate dynamic programming (ADP) or RL can be used to deal with the curse of dimensionality. The policy learned by RL can realize real-time control.
Since AlphaGo was proposed in 2016 and defeated a world champion in the game of Go [16], deep reinforcement learning (DRL) has set off a new research boom again. With model-free algorithm and empirical learning, DRL solves many of the tough problems of the past, such as robot control [17], autonomous driving [18] and many kinds of games playing [16]. DRL is such a powerful tool in the scenarios with large uncertainties that can handle the aforementioned challenges effectively. After sufficient learning, well-learned RL agent can obtain a decision policy to realize real-time control. Some researchers also use RL to control MG [19] and DS [20]. But the application of DRL in resilience control is under-researched. Inspired by the successful application of DRL in the game field, this paper uses DRL to enhance the post-disaster resilience of DS, and this method relies only on current information without prediction of the future.

C. CONTRIBUTIONS
In this paper, we consider in the aftermath of a natural disaster, the power supply of the DS is interrupted. The outage duration of DS depends on the repair process and the severity of the damage caused by a disaster. Before the repair process of DS is completed, the DS has to supply its loads with its internal resources. In order to improve the post-disaster resilience of the DS, a longer power supply to as many loads as possible with limited generation resources over the power outage duration of the DS is necessary. In this paper, the resilience enhancing problem is converted to a decision making problem, a hybrid control including the energy storage management and load shedding policy is proposed to make full use of limited generation resources within the system, thus improving the resilience. The major contributions of this paper include: • A multi-agent DRL model based on MDP is developed for the sequential decision making problem. A dual optimal control policy on the source and load sides is achieved to improve the resilience. Test results validate the strong adaptability of the proposed method under various conditions such as different available generation resources and MG power outage duration.
• A RL environment for islanded MG operation based on OpenAI Gym is constructed and is used as the task simulator, which provides an easy-to-use interface of RL tasks. Limitations of the generation resources and power flow, as well as the uncertainties within the MG, are all considered in the environment.
The remainder of this paper is organized as follows. Section II formulates the decision making problem for an MG with limited generation resources. Section III develops the MDP and RL models for the sequential decision making problem and proposes a multi-agent DRL control algorithm. Section IV describes the islanded MG model and constructs an RL environment based on OpenAI Gym. In Section V, the numerical results are presented to validate the proposed algorithm. Conclusions are drawn in Section VI.

A. GENERIC MG MODEL
An MG is a small-scale low voltage DS that comprises controllable loads, several small modular generation and storage systems, and provides electrical and heat [combined heat and power (CHP)] supply to local loads [21]. The MG can be generalized into four types of devices from the standpoint of energy generation and consumption: intermittent distributed generators (DGs), such as wind turbines and photovoltaic modules; dispatchable DGs, such as fuel and natural gas generators; local loads with different priorities; and electric energy storage devices. It is worth noting that some devices in the MG have varying operating characteristics. Due to the influence of weather, human behaviours, and other factors, the output power of intermittent DGs and load demand have great uncertainties and vary significantly under different conditions. Continuous operation of dispatchable DGs and storage devices depends on the available generation resources, i.e., fuel reserve (FR) of DGs and state of charge (SOC) of battery storage devices [2].
When a disaster strikes DS and interrupts the generation availability from the main power grid, the islanded MG forms and has to use internal generation resources to power its loads. After a period of time T D , which is often the power outage duration of DS, the DS service is restored thanks to the repair of power grid staffs. DS returns to normal conditions. In order to improve the resilience of the DS, a longer power supply to as many loads as possible with limited generation resources within time T D is necessary. Utility value of the system power supply can be used as a measure to the postdisaster resilience. A more adequate and reasonable utilization of the energy after a disaster can result in a more resilient power system. The problem of resilience improvement can be converted to increase the cumulative utility value of the DS in time period T D . The purpose of the islanded MG control is to achieve a policy maximizing the cumulative MG utility value over the time period T D with the limited generation resources.

2) PROBLEM MODELING
The islanded MG control problem over the time period T D is modeled as a sequential decision making problem. Sequential decisions are characterized by a decision-maker choosing among various actions after taking an observation of the system at different points in time, in order to control and optimize the performance of a dynamic stochastic system [22].
In this paper, the decision at each point in time is the dispatchable DGs output control on the source side and load shedding action on the load side. The objective is to maximize the cumulative MG utility value over the time period T D and can be written as where π represents the decision policy. T D is the time period for MG control, usually is the power outage time of DS, it can be also selected by the MG operator. After T D the DS is restored or the supplemental generation resources are available. R π (t) is the utility value function of MG under policy π. R(t) can be measured by the load supply income, planned and unplanned outage loss. At time instant t it can be calculated as where p L k (t) denotes the active power of load L k at t. N L is the number of total loads. b L k reflects the supply income and outage cost of L k in $/kWh; i L k , c p L k , c u L k are the supply income, planned and unplanned outage cost of L k , respectively. o L k reflects the operation state vector of load L k , s L k represents the power supply status of L k which includes normal operation n, planned outage p, and unplanned outage u. I(x) is an indicator function. If x is true, the value is 1, otherwise it is 0.
Time period T D can be also discretized into N decision stages, and the objective in (1) is then expressed as where R π (n) is the utility value of each decision point in time. VOLUME 8, 2020 3) CONSTRAINTS During the normal operation of microgid, the constraints including power flow and resources should be satisfied, where B, L, G are sets of buses, lines, and DGs in the MG, respectively; P i (t) and Q i (t) are the injected active and reactive power of bus i at time t, respectively; V i (t), V min i and V max i are the voltage of bus i, and its lower and upper limit; ; Y ij is the admittance; I l (t) and S l (t) are the current and apparent power of line l at time t, I max l and S max l are their upper limits; P g (t) and Q g (t) are the active and reactive power of DG g at time t. E g (t) is the available generation resource of DG g at time t, and E M g is the maximum possible value of E g . In this paper, E g is used to represent the fuel reserve of DGs or the SOC of the battery devices, measured by the equivalent electric energy in kWh. The energy conversion efficiency is taken into consideration during the computation of E g . In the case of this paper, the scale of MG is relatively small. Micro gas turbine and energy storage battery are commonly used for power supply. The output of power supply changes sharply to ensure that the MG can maintain stability under the rapid change of load demand. Therefore, the ramp constraints of DGs are ignored in this paper.
It is challenging to solve the problem of formula (1)-(11) using traditional optimization methods. Inspired by the successful application of RL in the game field, learning based methods can be adopted to handle this problem. By establishing decision making model, designing learning algorithm, building learning environment and verifying the effectiveness, the solution can be effectively derived. These steps will be described in detailed in the following sections.

III. MULTI-AGENT DRL MODEL OF POST-DISASTER CONTROL OF MG
In this section, RL model based on MDP is developed for the sequential decision making problem. The basic information about MDP and RL will be described in Section III-A. Corresponding MDP and RL models are established in Section III-B. A multi-agent DRL model developed for this decision making problem is proposed in Section III-C.

A. MDP MODEL FOR THE SEQUENTIAL DECISION MAKING PROBLEM
A MDP is a discrete time stochastic control process which provides a mathematical framework for modeling deci-sion making in situations where outcomes are partly random and partly under the control of a decision-maker. A MDP usually comprises: a state space S, a action space A, an initial state distribution with density p 0 (s 0 ), a stationary transition dynamics distribution with conditional density p(s t+1 |s t , a t ) satisfying the Markov property p(s t+1 |s 0 , a 0 , s 1 , a 1 , · · · , s t , a t ) = p(s t+1 |s t , a t ), for any trajectory s 0 , a 0 , · · · , s T , a T in the state-action space, where s t ∈ S, a t ∈ A, and a reward function r : S × A → R.
In this MG sequential decision making problem, the MDP elements can be designed as follows: the state should be designed to include the information required to make appropriate decision, including power flow, system remaining available generation resources, and the remaining power outage time.
where P L , Q L , P G , Q G denote the active and reactive power of load demand and DGs, respectively. V M denotes the voltage magnitude of system, E G is the remaining available generation resources of DGs, t r ∈ [0, T D ] is the remaining power outage time. The action is dispatchable DGs output control and load shedding action.
where L S denotes the load shedding action. The reward should be consistent with the system objective, so the utility value of the MG R π (t) can be designed as a reward.
In the MDP, if the probabilities or rewards are unknown, the problem is one of RL [23], the transition probabilities can be accessed through a simulator that is typically restarted many times from a uniformly random initial state. In this MDP, it is impossible to get the transition probabilities because of the uncertainties, so RL can be used to solve this problem.

B. RL MODEL
We study RL and control problems in which an agent interact with an environment by sequentially choosing actions over a sequence of time steps, in order to maximize a cumulative reward. A policy in RL is used to select action in the MDP. The agent interacts with the environment using its policy and gives a trajectory of states, actions, and rewards s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , · · · , s T , a T , r T over S × A × R. The return G t is total discounted reward from time-step t onwards, (15) where γ is the discount factor indicating how much the next step affects the current step, 0 < γ < 1. Value function V π (s) or action-value function Q π (s, a) are defined to be the expected return, where the initial state s 0 is s, initial action a 0 is a. The agent's goal is to obtain a policy which maximizes the cumulative discounted reward from the initial state. The optimal policy is defined as: π > π ⇔ V π (s) > V π (s), ∀s ∈ S, we have the optimal Q * (s, a) = max π Q π (s, a). Q-learning is a classical RL algorithm. In Q-learning, a value called Q-value Q(s, a) is stored for each state s and action a. Q-value function is updated by Bellman equation. (17) where α is the learning rate, s is the resulting state after taking action a in state s, a is an action that can be selected in state s . In state s, the action-choosing policy of Q-learning is to select the action a that maximizes Q(s, a) according togreedy algorithm.
When there is a large state space, Q-learning is not quite practical where a very big Q-table is needed to storage Qvalue for each state and action. Deep Q-network (DQN) [24], [25] refers to a neural network function approximator with weights θ as a Q-network to fit Q-value function. The Qnetwork can be trained by minimizing the loss function L k (θ k ) that changes at each iteration k, where y k = E s ∼S [r + γ max a Q k (s , a ; θ k−1 )|s, a] is the target for iteration k. Differentiating the loss function with respect to the weights we arrive at the following gradient: DQN solves problems with high-dimensional observation spaces successfully. Nevertheless, the curse of dimensionality is serious when the action space is high-dimensional or continuous. Deep Deterministic Policy Gradient (DDPG) [26] is proposed to solve the continuous control problem which combines actor-critic approach with DQN and Deterministic Policy Gradient (DPG) algorithm [27]. Compared to DQN, DDPG adds an actor network to generate a specific action according to the current state a t = µ θ (s t ). The actor is updated by following applying the chain rule to the expected return from initial state with respect to the actor parameters: where J is the expected return G 0 from initial state. ρ β is the initial state distribution. θ Q and θ µ are the parameters of critic-network and actor-network, respectively. Details about DDPG is available in [26].

C. MULTI-AGENT DRL MODEL
The islanded MG control is a discrete-continuous hybrid action space control problem. On the source side, the DGs output control is continuous control. On the load side, the load shedding control is discrete control. If only one agent is used to deal with such a discrete-continuous hybrid action space control problem, the policy update will be extremely difficult. P-DQN is proposed in [28] to cope with the discretecontinuous hybrid action control problem in game King of Glory. But its continuous action is a low-level parameter which is associated with the high-level discrete action. Both the discrete and continuous action must be executed simultaneously. In our decision making problem, the actions on load and source sides have the same level and can be executed at different time. That makes P-DQN not straightly usable.
We propose a double agent DRL model, which has a load agent and a DGs agent on the load and source sides, respectively. The two agents control the MG together for maximum utility value. The design mechanism of double agent DRL model is shown in FIGURE 1. The load agent and DGs agent are achieved using DQN and DDPG, respectively. Two agents interact with the environment independently, without communication with each other. They get the same state from the environment, execute their own actions to alter the state of the environment and get their own rewards according to reward shaping. However, in this mechanism, to one agent, the other agent becomes part of the environment, so the two agents interact with and influence each other essentially.
Two agents receive the same state s t specified in Section IV-A. They can also conduct feature selection considering the difference of their tasks. The action of load agent a L t , a discrete value such as a L t = 0, 1, 2, . . ., is the load shedding action L S specified in Section IV-A, and each of the discrete value represents shedding specific priorities of loads in the MG. The action of DGs agent a G t is the output of controllable DGs. The merged action a t = (a G t , a L t ) will be executed on the islanded MG.
Another important task is how to design the rewards of two agents. Because of the difference in the tasks of two agents, reward shaping should be used to redesign the reward for better learning performance. Note that the load agent decides how much power the system needs, and the DGs agent decides how to provide it. Since load agent has direct influence on the utility value of MG, it is designed as a farsighted agent with a large γ of 0.99 whose goal is to maximize the utility value of MG over T D . The reward of load agent is designed as it is exactly the MG utility value at t. Nevertheless, on the source side, an explicit expression of control objective is hard to obtain, although we know the ground truth that the regulation of generation resources can truly influence the power supply duration of MG, thus influencing the resilience. A simple thought is to set the rewards VOLUME 8, 2020 of two agents the same, but the convergence of this design is so poor in our multiple experiments that a new reward design is needed. Inspired by the idea of optimal power flow (OPF), we introduce a factor f G to measure the resilience potential of each generator. Considering that the key to improving resilience is to use limited resources to provide more durable power for more loads, the generator with greater and more durable power supply potential is regarded as the one with higher resilience potential. On the source side, improving the resilience means minimizing "resilience potential cost", just like minimizing the economic cost of OPF. The reward of DGs agent and resilience potential are designed as where r G t is the reward for DGs agent. f G i is the resilience potential factor of DG G i , it depends on the DG capacity S G i and remaining sources E G i (t). σ is the penalty factor, H (p G (t)) represents the violation of constraints described in (7)-(10). The violation is added to reward r G t as penalty. DGs agent is designed as a short-sighted agent with a small γ of 0.01 whose goal is to optimize the resilience potential cost at time t. In this way, in the long run, the load shedding strategy ensures continuous power supply for as many loads as possible. In the short term, the generator output strategy guarantee the optimal resilience potential cost.
Under this kind of mechanism the two agents may interact with each other. If one agent performs not so well, the other will be influenced. For example, the load agent is so stupid to shed all loads all the time, and the DGs agent can not work at all under this case. Our solution is training the two agents alternately until the two agents can work together well. When one agent is learning, the other stays fixed, and exchanges training with the other. Considering that the effect of the load agent on the utility value of the system is more direct, we will train the load agent for first. The full algorithm of this doubleagent DRL is shown as Algorithm 1. Initialize DGs agent and Load agent with DDPG and DQN, respectively. The training scenarios are generated by sampling the prior probability distributions described in Appendix. At each decision point of an episode, choose action use -greedy method. After conducting the action, storage the experience in replay buffer. Train the two agents iteratively, and update the neural network based on the gradient information. The policy obtained after training is evaluated in the test scenario. In our design mechanism, the two agents have no communication with each other. How to introduce an appropriate communication mechanism to improve control performance is left as future work.

IV. RL ENVIRONMENT OF ISLANDED MG
There are mainly two objects in RL: environment and agent. The environment is an object the agent interacts with and tries to learn about. Sometimes there are so many uncertainties in the environment that what the agent can do is interacting with the environment constantly to acquire experiences to improve itself. Furthermore, the environment also Algorithm 1 Double-Agent Deep Reinforcement Learning Using DQN and DDPG For DGs Agent: Randomly initialize critic network Q G (s, a G ; θ Q G ) and actor µ(s; θ µ ) with weights θ Q G and θ µ Initialize target network Q G and µ with weights θ Q G ← θ Q G and θ µ ← θ µ Initialize replay buffer R G For Load Agent: Randomly initialize critic network Q L (s, a L ; θ Q L ) with weights θ Q L Initialize target network Q L with weights θ Q L ← θ Q L Initialize replay buffer R L Initialize 1 N TI M for training iteration for episode = 1, M do flag = int(episode/N TI )%2 Initialize a random process N for action exploration Receive initial state s for t = 1, T D do With probability select a random action a t With probability 1 − select action a t = [a G t , a L t ] Load action a L t = max a L Q L (s t , a L ; θ Q L ) DGs action a G t = µ(s t ; θ µ ) + N t according to current policy and exploration noise Execute action a t and observe r t ,s t+1 Sample a random minibatch of N transitions (s t , a G t , r G t , s t+1 ) from R G Set y G i = r G i + γ G Q G (s i+1 , µ (s i+1 ; θ µ ); θ Q G ) Update critic by minimizing the loss: Update the actor policy using the sampled policy gradient:

end if end for end for
provides agents with a platform to train and test. In this paper, the environment and agent are exactly MG and MG operator, respectively.
In this section, we will introduce how to establish an RL environment. This environment is exactly a game of islanded MG control, in which the agent need to learn to control the grid to keep it stable for a period of time, and the quality of the operation is reflected in the game score. Firstly, the uncertainties from the components, such as DGs, loads, wind turbines and batteries, within the MG will be described. Then an RL environment for islanded MG operation based on OpenAI Gym [29] and OpenDSS [30] is constructed, which has a general interface compatible with OpenAI Gym. The mechanism for how uncertainty is reflected in the environment will be introduced.

A. UNCERTAINTIES WITHIN THE MG
In the real world, for an agent, the environment is unknown and full of uncertainty. The task of agent is to learn to understand the environment. The uncertainties within the MG are mainly from renewable energy, load demand [31], system energy storage [2] and the power outage duration [3] of DS.
Renewable energy in MG mainly includes wind turbines and solar panels. Their power outputs are intermittent and uncertain because of the weather situation and other factors. We often use beta distribution to describe their outputs uncertainties [31]- [33]. Meanwhile, the uncertainties of load demands can be described using normal distribution [31]. The system energy storage and the power outage duration T D are usually different under different circumstances. For example, when the DS is out of power, different resource allocation schemes may cause the system energy storage to vary greatly. Moreover, many factors involving power system characteristics, geographic characteristics, climatic variables and repair process will influence the power outage duration T D . Both system energy storage and power outage duration T D are important for the decision-maker to control the MG. In this paper, in order to simulate the senses with different system energy storage and power outage duration, we use uniform distribution to describe the uncertainties of them. The detailed uncertainty model of these elements are shown in Appendices.

B. ISLANDED MG RL ENVIRONMENT BASED ON OpenAI AND OpenDSS
In order to construct a platform to train and test the DRL algorithms, an islanded MG control platforms based on OpenAI and OpenDSS is constructed. It is available to researchers in related fields. OpenAI Gym is a toolkit for developing and comparing the performances of RL algorithms, and defines the interface standard between agent and environment. Nonlinear power flow equations are used for power flow analysis and OpenDSS are used as power flow simulation tool. OpenDSS is a comprehensive electrical power system simulation tool primarily for electric utility power DS. OpenDSS serves as the DS power flow simulation tool in this environment design. A large amount of simulation data, which is used for training of RL, is readily obtained by performing power flow simulation based on OpenDSS.
In RL, the agent obtains observation or state information from the environment, then takes action to execute on the environment and gets reward to improve its policy. When we build an RL environment, the design of state, action, and reward should be considered. In this MG control problem, the state, action and reward are designed as (12), (13) and (14), respectively.
The design mechanism of this MG control RL environment is shown in FIGURE 2. The interaction between agent and environment and the flow of environment design in one episode is described. In each episode, the agent completes a full game playing until the simulation time arrives the power outage duration T D , which means the game of MG control terminates. Then, a new game restarts and the agent learns during a large number of games playing.
In one episode, the flow of environment design is as follows: (1) Start. A new episode starts.
(2) Parameters initialization. Initialize the MG case, and the related parameters including the power outage duration T D and available generation resources which can be obtained by sampling their probability distribution models described in Section III-A. In different episodes, T D and available generation resources at t = 0 may be different. In the begining of each episode, the probability model of power outage duration T D and available generation resources are sampled to generate the specific T D and available generation resources, just like the blue boxes and arrows in the left part of FIGURE 2. (3) State resets. Environment gives the reset state to the agent. This is the initial state where the agent begins to control the MG. (4) Receive and execute action. Environment receives action from the agent. The action is executed on the environment, and OpenDSS is used as a simulation tool to calculate the power flow.

V. SIMULATION RESULTS
In this section, we validate our methods through several case studies. OpenDSS is used as power flow simulation tools. Python 3.6.5 is used to realize the RL agent. The calculation is realized on a desktop PC with a 2.60GHz CPU(Intel(R) Xeon(R) E5-2670 0) and 64GB RAM.

A. CASE INFORMATION
The MG used to validate the proposed method is shown in FIGURE 3, which includes 7 buses (B1, B2, . . . , B7, excepts the point of common coupling PCC), 1 transformer Tr, 2 battery storage systems BT1 and BT2, 1 diesel generator DG1, 2 wind turbines WT1 and WT2, and 4 loads L1, L2, L3 and L4 with various priorities. The specific parameters of the MG are shown in Table 1.
The energy source in this case includes 2 battery storage systems, 1 diesel generator and 2 wind turbines. BT1, with a capacity of 500kVA and rated power of 400kW, is served as the master source in the MG and used to balance system power flow. Thus, bus B2 is the slack bus . Both BT2 and DG1 have a rated generation power of 50kW. They are considered   as adjustable DGs. The wind turbines are considered to be non-adjustable and recognized as negative loads. The loads are classified into 3 subsets according to their priorities: primary load L1, secondary loads L2 and L3, and tertiary load L4. Their supply income, planned and unplanned outage loss are defined in Table 2. Correspondingly, load shedding action L S can be defined as shedding loads with different priorities and is shown in Table 3. Load shedding strategy is be used to shed less critical loads gradually to ensure the continuous power supply to the more critical loads. As the load shedding action L S changes from 0 to 3, more critical loads will be shed.

B. BASE CASE STUDY
The outage duration of DS is up to the repair process and the severity of the damage caused by the disaster. The power outage duration of blackout in the United States and Canada on August 14th of 2003 [34] reached 29 hours. The India's Blackout [35] in 2012 lasted for nearly 2 days. In order to simulate the operation of DS in extreme situations, the outage VOLUME 8, 2020 duration of DS is chosen up to 30-40 hours in this paper. Assume that the MG is disconnected from a DS with full generation resources available, and the time period T D is set equal to 35 hours, the initial available generation resources are assumed to be 100%. Each train and test scenario is generated according to the probability distribution models described in Section IV-A. Therefore, the agent encounters a brand new scenario during the test.
In order to demonstrate the effectiveness of the proposed method, we use several strategies for comparison. On the load side, constant load shedding strategies that use actions L S = 0, L S = 1, and L S = 2, respectively are proposed. They are denoted as Strategy 0, Strategy 1 and Strategy 2, respectively. The control policy with only load agent and double agents are denoted as Strategy 3 and Strategy 4. On the source side, manual adjustment (MA) method is proposed for comparison, the controllable DGs are used in the order of resilience potential factor described in (23) from low to high in each regulation. That means we will use the DGs with low resilience potential factor for first. All the DGs output are controlled except Strategy 4 using manual adjustment method. The test results are shown in FIGURE 4, FIGURE 5,   The difference between Strategy 3 and Strategy 4 is the DGs control policy, which causes the difference in the load shedding action. As shown in FIGURE 8, the difference of two policies is the output of DG1. Under manual adjustment the output of DG1 is more conservative. The DGs control policy under RL is bolder, which not only meets the load demand but even charges the master battery at some time instant as shown in FIGURE 6. This manner ensures that the master battery has sufficient resources. For this reason, Strategy 4 will not shed all loads in later stages as shown in FIGURE 5. FIGURE 9 shows the outputs of 3 adjustable devices and the loads demand (include the negative loads: wind turbines) in one test scenario under the double RL agent strategy. With the consumption of system resources, more loads are   gradually shed. As seen from FIGURE 9, the loads demand becomes relatively smaller as time increases. The overall power supply is slightly larger than loads demand because of the power losses in the transmission lines. BT1 serves as the master source and is used to balance system power flow. It discharges when there is heavy load and charges when most loads are shed to adsorb the outputs of wind turbines.
Sample the probability distributions described in the appendix and randomly generate 10 test scenarios. Assume that the MG is disconnected from a DS with full generation resources available, and the power outage duration is also set equal to 35 hours. The utility values with 5 strategies under different scenarios are shown in FIGURE 10. Among these 10 test scenarios, the utility value of Strategy 4 (double agents) remains highest . The utility values of Strategy 3 (only load agent) are closed to those of Strategy 4 in certain scenarios but remains second highest in most scenarios. The utility values of all strategies fluctuate in different scenarios because of the uncertainty of loads and wind turbine. The results in different test scenarios demonstrate that the proposed method can cope with the uncertainty well in the system.

C. CASE STUDY UNDER VARIOUS CONDITIONS
For a given MG, the initial available generation resources and the power outage duration T D for MG control will influence the control policy. In this section, the proposed method is tested under variations of these factors. The results demonstrate that the proposed method is able to adapt to various conditions to make full use of the limited resources and maximize the utility value of the MG.

1) INITIAL AVAILABLE GENERATION RESOURCES
When the MG is disconnected from the DS, the initial available generation resources are assumed to be 80%, 85%, 90%, 95%, and 100%, respectively. The duration T D for MG control remains constant.
The MG utility values with double agents under various initial available generation resources are shown in FIG-URE. 11, and the load shedding actions are shown in FIG-URE. 12. In FIGURE. 12, we can see that with the decrease of initial available generation resources, the time for the agent to shed load of the same level is advanced, the ''bigger'' load shedding action occupies more proportion.     Table 4 shows the utility value of different strategies at T D = 35 h, and the utility value of double RL agents control policy remains the largest during E G (0) varies. It demonstrates the capability of the proposed method to adapt to various generation resource conditions.

2) POWER OUTAGE DURATION T D FOR MG CONTROL
In this case, the power outage duration T D for MG control is changed from 30 h to 40 h with an interval of 2 h, while the initial available generation resources remain constant.
The MG utility values of different strategies with different T D are shown in Table 5. Table 5 shows the MG utility value of double agents (Strategy 4) is the largest in every T D . Table 6 shows the proportion of each load shedding action in each T D ,  more loads are shed with T D increases. It indicates that more generation resources are used for the power supply of high priority load.
The agent can automatically adjust the load shedding strategy according to the length of control period. The results indicate that the capability of the proposed method to adapt to various control time period.

D. PERFORMANCE EVALUATION
RL is a kind of method which make decision based on the information of current state. It follows the Markov property. It does not require the future information and takes action according to the current state. However, many of the previous state-of-art methods, such as classic optimization methods, robust optimization, dynamic programming, etc., require the prediction for the future to make decision. Moreover, in the high dimension problem, the computational costs are huge.
Although there is no prediction for the future, RL can learn the trends of the future through the training process on historical data, thus making the decision. In this paper, several constant load shedding strategies are proposed as comparisons. Among these strategies, the performance of double agent RL is the best. The training process takes about 10 hours on a desktop PC with a 2.60GHz CPU(Intel(R) Xeon(R) E5-2670 0) and 64GB RAM to achieve the aforementioned training effect. After training, in one test scenario, the welllearned RL agent can derive the solution within one second. The training speed of can be further improved by using GPU and parallel simulation techniques.
Theoretically, those kinds of method which utilize the future information can achieve better performance than RL. That is because they get more information. Therefore, in this paper we do not take this kind of methods as comparison. In [36], the RL method can achieve an effect close to the optimization benchmark with perfect forecast. In [19], the proposed cooperative RL algorithm can do better than scenario-based algorithm. In our latest research, a well-learned reinforcement learning agent without the prediction for the future can achieve a performance close to the state-of-the-art dynamic programming which has perfect prediction, but with less computation time. The calculation time of the dynamic programming is about 2 minutes, while the RL can yield close results within one second. The author believes that how to utilize the information of future in RL is also a problem worthy of study.

VI. CONCLUSION
In this paper, a resilience enhancing problem is converted to a decision making problem. A multi-agent DRL model is proposed to control an islanded MG with limited generation resources. An RL environment for islanded MG operation based on OpenAI Gym is constructed, which has a general interface compatible with and can be published to OpenAI Gym. The proposed RL policy is applied to an MG with wind turbines, diesel generators and storage devices. It realizes a dual control: the energy storage management on source side and load shedding policies on the load side. The policy maximizes the utility value of the MG in a limited time period, thus improving the resilience. Test results demonstrate its effectiveness under various conditions such as different available generation resources and MG control time periods.
Our future work will use a larger scale case to validate the proposed method's ability to scale up, consider using parallel simulation to accelerate the agent training process and try to introduce an appropriate communication mechanism between agents to improve control performance.

APPENDICES
The uncertainties of renewable energy can be described as beta function, where α, β are shape parameters of beta function. Beta function models the occurrence of real power values x when a certain prediction value p has been forecasted. α, β can be calculated as follows, where p is the normalized predicted power output, σ 2 is the variance of the beta distribution. Generally speaking, there is a positive correlation between forecast error and predicted power output, a linear fit for the standard deviation as a function of the predicted power proposed in [33] is as follows: σ W = 0.249p + 0.035 (29) With the predicted DG outputs and the three formulas above, the parameters of beta functions for the current prediction data can be calculated. Meanwhile, the uncertainties of load demands can be described using normal distribution [31]: and a linear fit for the standard deviation σ L of load as a function of the predicted power p can be The fluctuation of profiles of loads demand and wind turbine outputs are shown in FIGURE 14 and FIGURE 13 (per unit value). 1000 profiles of loads demand and the output of wind turbine are shown, and the dashed lines are the average value. She is currently an Engineer with the Electric Power Research Institute of State Grid Fujian Electric Power Company Ltd., Fuzhou, China. Her research interests include intelligent power distribution inspection, low-voltage distribution network information technology, and the ubiquitous power Internet of Things. VOLUME 8, 2020