Energy Management of Networked Microgrids With Real-Time Pricing by Reinforcement Learning

Coordinating the microgrids (MGs) in the distribution network is a critical task for the distribution system operator (DSO), which could be achieved by setting prices as incentive signals. The high uncertainty of loads and renewable resources motivates the DSO to adopt real-time prices. The MGs require reference price sequences for a long time horizon in advance to make generation plans. However, due to privacy concerns in practice, the MGs may not provide adequate information for the DSO to build a closed-form model. This causes challenges to the implementation of the conventional model-based methods. In this paper, the framework of the coordination system through real-time prices is proposed. In this bi-level framework, the DSO sets real-time reference price sequences as the incentive signals, based on which the MGs make the generation and charging plan. The model-free reinforcement learning (RL) is applied to optimize the pricing policy when the response behavior of the MGs is unknown to the DSO. To deal with the large action space of this problem, the reference policy is incorporated into the RL algorithm for efficiency improvement. The numerical result shows that the minimized cost obtained by the developed model-free RL algorithm is close to the model-based method while the private information is preserved.


Energy Management of Networked Microgrids With
Real-Time Pricing by Reinforcement Learning Gaochen Cui, Student Member, IEEE, Qing-Shan Jia , Senior Member, IEEE, and Xiaohong Guan , Life Fellow, IEEE Abstract-Coordinating the microgrids (MGs) in the distribution network is a critical task for the distribution system operator (DSO), which could be achieved by setting prices as incentive signals.The high uncertainty of loads and renewable resources motivates the DSO to adopt real-time prices.The MGs require reference price sequences for a long time horizon in advance to make generation plans.However, due to privacy concerns in practice, the MGs may not provide adequate information for the DSO to build a closed-form model.This causes challenges to the implementation of the conventional model-based methods.In this paper, the framework of the coordination system through realtime prices is proposed.In this bi-level framework, the DSO sets real-time reference price sequences as the incentive signals, based on which the MGs make the generation and charging plan.The model-free reinforcement learning (RL) is applied to optimize the pricing policy when the response behavior of the MGs is unknown to the DSO.To deal with the large action space of this problem, the reference policy is incorporated into the RL algorithm for efficiency improvement.The numerical result shows that the minimized cost obtained by the developed model-free RL algorithm is close to the model-based method while the private information is preserved.
Index Terms-Reinforcement learning, microgrids, distribution network, energy management.

I. INTRODUCTION
I N A SMART distribution network, the energy management system is faced with a complex network with distributed energy resources and energy storage (ES) units [1].MGs consist of loads, generators, ES units, and renewable energy sources such as photovoltaic (PV) units.An MG could be both an electricity consumer and prosumer [2].The DSO normally has the decision authority to solve the energy management problem for the entire system.Coordinating the resources in these MGs is a critical task for the DSO to achieve high energy efficiency.However, the DSO and MGs are owned by different organizations, therefore the DSO usually has no authority to directly command the MGs.An effective way is to motivate the MGs through pricing signals [3], while the real-time prices could be applied to handle the high uncertainty of loads and renewable energy sources [4].Moreover, the MGs usually require a reference price sequence for a long time horizon (e.g., 24 hours) divided into multiple time slots (e.g., 5 minutes) in advance to make the generation plan [5].Therefore in this paper, the policy to set the real-time reference price sequence is optimized.
The DSO pricing task could be modeled as a bi-level optimization problem [3].At the upper level, the DSO decides the reference price sequence.At the lower level, the MG makes its generation plan for the received reference prices.This bi-level optimization problem could be transformed into a mixed-integer linear programming (MILP) problem using the Karush-Kuhn-Tucker (KKT) conditions and solved by commercial solvers [6].However, in practice, the MGs may not provide their private information about the response behavior toward the prices.In this circumstance, the model-based methods encounter challenges since the closed-form model is hard to build.
The above problem could be transformed into a Markov decision process (MDP) problem and solved by the modelfree RL [7] that optimizes the DSO pricing policy by learning from experience.In recent years, RL has been successfully applied to power systems [8].Online pricing algorithms of demand response and for MGs based on RL are developed in [4], [5].However, in this paper, the DSO agent decides the price sequence for a much longer time horizon at each time instant.This results in a much larger action space and causes challenges to the training process.
In this paper, the following contributions are made for addressing the above problems.(i) The price-based real-time coordination framework is established and formulated as a bi-level optimization problem.In this framework, the DSO transmits prices and collects metering data only once during each time slot, and thus is more tolerable for short-term communication failures.Moreover, the MGs would receive a reference price sequence to consider the long-term profit when making plans for the resources.(ii) This problem is transformed into an MDP that incorporates the responding behavior of the MGs, which could be solved by the modelfree RL method when the behavior is unknown to the DSO.
The reference policy-based RL algorithm is developed that improves the training efficiency, while the state-of-the-art RL methods fail to significantly improve the policy in finite time due to the large action space in practice.(iii) Numerical experiment shows that the developed RL algorithm is independent of the closed-form model at the expense of less than 5% economic cost compared with the conventional model-based method.

II. LITERATURE REVIEW
Coordinating the MGs in a distribution network is a hot topic in the last decade.The smart metering devices enable the operator to apply efficient dispatch with an accurate demandside model [9].One approach is to manage the power flow through the distributed scheme [10].This approach usually heavily relies on the communication network [11], and the issues such as packet loss are compensated in [12].The smallscale energy resources and load demand at each MG are modeled and solved by the particle swarm method in [13].A convex multi-objective optimization problem for multi-MG is formulated and solved by an online algorithm in [14].However, these MGs may belong to different owners in practice, so the DSO couldn't directly control them.In this circumstance, prices could be applied as incentive signals to coordinate the MGs.The pricing task is generally modeled as a bi-level optimization problem [3].A hierarchical market structure is proposed in [15] to guarantee the goals of both DSO and MGs.A bi-level stochastic model is formulated and transformed into a MILP problem that could be solved by commercial solvers to maximize the DSO profit in [6].This method is also utilized to coordinate the distributed resources in the virtual power plants [16], [17].
However, the above methods require the response behavior of the MGs to build the closed-form model, which is hard to acquire in practice.The responding behavior of the MGs could be approximated by a neural network [18], while the optimization problem is still hard to solve due to the nonlinearity.An alternative method is that the DSO and MGs bargain through sequentially solving the local optimization problem [19], [20], which heavily relies on the communication network.And whether the MGs would participate in this process is questionable.
Real-time electricity pricing models potentially lead to high energy efficiency as the stochasticity increases [21].The realtime pricing could be applied to cope with the discrepancy in the actual consumer behavior and the load levels from forecast and planning [22], where the model of demand elasticity is critical [23].It is also applied to energy management for electric vehicles [24].
Model-free RL methods could address this problem by interacting with the system and learning from experience to improve the policy.With deep learning showing strong feature extraction ability in computer vision and natural language processing [25], RL is empowered to deal with complicated tasks from games [26] to robotics [27].To handle the large state space, parametrized methods are developed to improve the efficiency [7].Deep neural networks further improve the learning ability to defeat humans in hundreds of Atari games [28] by the strong feature extraction ability.Similarly, the policy is also parametrized to handle large action space and is trained with the actor-critic framework [29].
In recent years, RL has been widely applied to power systems [8].A consensus transfer Q-learning for decentralized generation command dispatch of automatic generation control is developed in [30].The multi-agent RL is applied to Volt-VAR control in power distribution networks in [31].An RL-based online optimal control method is developed for the hybrid ES system in AC-DC MGs in [32].The problem of setting the tap positions of load tap changers for voltage regulation in power distribution systems is solved by RL in [33].Efficient economic dispatch for MGs is achieved by a cooperative RL algorithm in [34].A fully distributed multi-agent RL method for optimal reactive power dispatch is developed in [35].An RL-based distributed optimal power flow algorithm is developed that reduces the computational complexity of the conventional linear programming approach while addressing the stochastic nature of the energy resources and loads in [36].A hierarchical MG model considering communication uncertainty is developed and solved using RL in [37].
To solve the pricing problem, the RL algorithm is applied to solve dynamic pricing and energy consumption scheduling problem in [38].An RL-based method for online pricing of demand response is developed in [4].An RL-based gametheoretic approach is developed to solve the pricing problem for networked MGs in [39].However, these methods generate the prices for only a short time.In this paper, we deal with the circumstance where the MGs require a reference price sequence for a much longer time horizon to make the generation and charging plan.In this case, the large action space results in low efficiency in practice.Thus in this paper, a reference policy-based RL algorithm is developed to address this problem.

N
The set of all nodes in the distribution network.

N MG
The set of MG nodes.

E
The set of distribution network lines.
The time slot index.τ ∈ N The time slot index in a sequence.

T ∈ N
The number of time slots in a time horizon.t ∈ R The time duration of a time slot.λ HV t ∈ R The actual marginal price for buying electricity from the high voltage grid at time t.λHV t ∈ R T The predicted marginal price for buying electricity from the high voltage grid at time t for the horizon from t + 1 to t + T. PG i ∈ R The maximum active power output of the generator in the MG at node i. P G i ∈ R The minimum active power output of the generator in the MG at node i.
The maximum ramping rate of the generator in the MG at node i.
The maximum charging power of the ES unit in the MG at node i.
The maximum discharging power of the ES unit in the MG at node i.

Si ∈ R
The maximum energy storage of the ES unit in the MG at node i. S i ∈ R The minimum energy storage of the ES unit in the MG at node i. PPV i,t ∈ R T The predicted PV active power at node i, time t for the horizon from t + 1 to t + T. PL i,t ∈ R T The predicted load active power at node i, time t for the horizon from t + 1 to t + T. QL i,t ∈ R T The predicted load reactive power at node i, time t for the horizon from t + 1 to t + T. P PV i,t ∈ R The actual PV active power at node i, time t.P L i,t ∈ R The actual active power of the loads at node i, time t.Q L i,t ∈ R The actual reactive power of the loads at node i, time t.C G i ∈ R The cost coefficient of the generator connected in the MG at node i.

Vi ∈ R
The maximum voltage magnitude at node i.
The minimum voltage magnitude at node i. λ ∈ R The maximum price adjustment between two continuous time slots.

IV. PROBLEM FORMULATION
In this section, we first establish the real-time price-based coordination framework.In this bi-level system, the DSO aims at minimizing the cost of supplying electricity, and the MGs aim at maximizing their own profit by making electricity transactions with the DSO.Then, the pricing problem of the DSO is formulated as an optimization problem in closed form.

A. The Overall Structure
The operational framework of the DSO and MGs in the distribution network is shown in Fig. 1 t.
In this framework, the DSO transmits the reference prices and receives the metering data once every t (several minutes).Thus, it is with greater tolerance for packet loss than the method which requires the DSO and MGs to repeatedly solve the local problems and transmit variables during each time slot [19], [20].

B. The DSO Optimization Problem
At the upper level, the DSO agent sets the reference price sequences λMG i,t during time slot t.The MGs respond to λMG i,t according to their strategies at the lower level.The upper-level problem P 0 is formulated in (1a)-(1n).
is the strategy adopted by the MG which yields P i,t with respect to the reference price sequence λMG i,t−1 .Node 0 is connected to the high voltage grid and set as the voltage balance node.Index τ is the time index in the horizon of predictions and reference prices at each time instant t.T is the length of the horizon.
The first term in the objective function (1b) for the DSO is the cost of buying electricity from the MGs or the income by selling electricity to the MGs, while the second term is the cost of buying electricity from the high voltage grid.t is omitted since it is constant.We assume that the loads at i ∈ N /N MG are inflexible and the electric prices at these nodes are fixed.In this case, the objective function (1b) is equivalent to the profit of the DSO running the distribution network.Constraints (1c) and (1d) are the active and reactive power balance at each node.Constraint (1e) is the Distflow equation [40] which is widely applied to model the optimal power flow in the distribution network.Constraint (1h) is to guarantee that the voltage level of each node is within a predefined range for safety.Constraint (1i) guarantees that the variation of the reference price for the same time slot given at two continuous time slots does not exceed a certain range.This constraint is to prevent the DSO from tricking the MGs with a higher price into improving generator output in advance and then lowering the price when clearing the transacted electricity.Constraint (1j) is to limit the prices for the MGs.Constraints (1k) and (1l) are the power balance equations at nodes i ∈ N /N MG .Constraint (1m) is the strategy adopted by the MGs which is unknown to the DSO in practice.
The strategy function P * i,t−1 ( λMG i,t−1 ) is usually associated with the resources and states in the MG.Based on the strategy, the MG coordinates its generator, PV, and ES to optimize its objective, such as maximizing its generation profit when selling electricity to the DSO or minimizing its cost when buying from the DSO.A model of the strategy is given in the next subsection.
At the upper level, at time t, the DSO agent decides λMG i,t based on the predictions to optimize the upper-level problem P 0 .The electricity transacted during time slot t+1 is cleared at price λ MG i,t+1 = λMG i,t (1).Then the DSO observes the predictions at time t + 1 and decides λMG i,t+1 .So and so forth.Problem P 0 is generally non-convex since the objective function (1b) is with bi-linear terms λ MG i,t P i,t .Moreover, constraint (1m) may also result in a non-convex feasible region since P i,t is the solution to the lower-level optimization problem.

C. The MG Optimization Problem
In this subsection, we provide the MG optimization problem [13] as an instance of the response behaviors.We note that the developed method in this paper is not limited to this model, but is suitable for other MG strategies as well. 1he linear programming problem where i,t = {P i,t+1 , Pi,t , PG i,t , PS i,t , Ŝi,t } is the set of decision variables of the MG at node i ∈ N MG .
In objective function (2b), the first term is to minimize the cost of buying electricity when Pi,t (τ ) < 0 or to maximize the profit for selling electricity when Pi,t (τ ) > 0. Constraint (2c) is the active power balance of the node where the MG is located.The active power injection equals the sum of the generator output, ES unit (dis)charging, PV unit output, and loads.It is assumed that only the load produces reactive power, and thus only the active power is the decision variable in this model.Constraints (2d) and (2e) are to restrict the ramping rate of the generator.Constraints (2f) and (2g) are the energy balance of the ES.Constraints (2h), (2i), and (2j) are to restrict the generator power output, the ES (dis)charging power, and the stored energy, respectively.Constraint (2k) calculates the actual active power injection at node i to the distribution network, which is the feedback to the DSO.
The lower-level problems P LP i,t of the MGs are a part of the upper-level problem P 0 .At each time t, the DSO agent decides the reference price λMG i,t to minimize the long-term average cost.Then, the MGs solve the lower-level problem P LP i,t to determine the generation P G i,t+1 = PG i,t (1) and the (dis)charging power P S i,t+1 = PS i,t (1), and thus feed back the power injection P i,t+1 to the DSO agent in (1m).So and so forth.

V. THE PROPOSED METHODOLOGY
In this section, we first transform the pricing problem into an MDP problem.Then, a model-free RL algorithm is developed to solve this problem without the knowledge of the MGs' response behavior.To deal with the large state space, a deep neural network structure is developed to decrease the feature space.To address the difficulty of low exploration efficiency caused by the large action space, the reference policy is incorporated into the algorithm.

A. The MDP Problem
For the DSO, the system described in the previous section is modeled as an MDP, which is characterized by a tuple S, A, P, c , where S is the finite state space with cardinality |S|, A is the finite action space with cardinality |A| = T • |N MG |, P(s |s, a) : S × A × S → [0, 1] is the state transition probability from s ∈ S to s ∈ S determined by action a ∈ A, and c(s, a) : S × A → R is the cost function received by the agent.The agent executes policy π(s, a) : S × A → [0, 1] with a∈A π(s, a) = 1, where π(s, a) is the probability of choosing action a at state s.The physical system is continuous in R, while S is considered to be finite since the measurement accuracy is limited and the values are bounded.Action space A is considered to be finite because the price is also bounded and is quoted in increments of $0.01 in this paper.Requirement of state and action spaces to be finite is standard in most studies on MDPs [7], which makes it convenient to define the MDP and the corresponding objective function.
At each time t, the DSO agent stays at state s t ∈ S, where , the predictions and current system state.Then, it takes action a t ∼ π(s t , a), where a t is { λMG i,t , i ∈ N MG }, i.e., the reference prices.The cost will be received by the agent, where Pen(V t ) is a large number if ∃V i,t that violates the safe constraint and equals 0 otherwise.Then the agent steps to state s t+1 P(s|s t , a t ), i.e., the predictions and system state at time t + 1, where P(s|s t , a t ) is the state transition probability.Probability P(s |s, a) is implicit in the bi-level system defined in Fig. 1.It is determined by the following factors: variation of the load, the renewable energy, and the real-time electric price of the high voltage network; the measurement and prediction error; the response behavior of the MGs that determines the power injection by the MGs during the next time slot; the probability distribution of the facility state which are unobservable to the DSO such as the generator output and the state of charge of the ES units.The model-free RL algorithm does not require the closed-form of transition probability P. Since the cardinality of the state space is large in practice, methods that store a distribution over actions for each state are impractical.Thus, we apply parametrized functions to generate the distribution of randomized policy π θ with parameter θ ∈ for the agent, where ⊂ R m is a convex and compact set.Normal distribution Normal(μ θ , ) is applied to represent pricing policy π θ , where μ θ : S → R |A| is the parametrized function mapping the state space S to the mean vector.The covariance matrix is a diagonal matrix with fixed elements.When the DSO agent takes an action, a vector is sampled from π θ and then rounded to 2 decimal places to produce a t .The probability density of Normal(μ θ (s), ) is taken as π θ (s, a) for computing simplicity.
Assumption 1 implies that the MDP is with a stationary distribution d θ (s) induced by policy π θ .The objective of the agent is to find the optimal policy to minimize the expected long-term average cost, which is given by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Reference Policy-Based RL Algorithm
Given a policy π θ with parameter θ , the relative action-value function is defined by (5) and the state-value function is defined by which satisfies the Poisson equation [7] J(θ To optimize objective ( 4) over θ , it is needed to calculate the gradient.The result in [7] shows that the gradient of J(θ ) w.r.t.θ is given by when Assumption 1 holds.This means that the policy gradient could be estimated under the current stationary distribution of state and action.However, in practice, Q θ (s, a) is hard to obtain, so a parametrized function Q φ (s, a) with parameter φ ∈ R n is utilized to approximate Q θ (s, a).The update of φ is called the critic step.In the critic step, it is aimed at minimizing the error of approximating Q θ induced by parametrization.For each state s and action a, the residual is defined as The objective function is given by ( 10) The update of θ is the step.In the actor step, we replace Q θ (s, a) in ( 8) with Q φ (s, a) to estimate the policy gradient, and push θ in the opposite direction of the policy gradient to minimize J(θ ).
One difficulty in solving this MDP problem is the high action dimension.In our case study, the price sequences for 4 MGs are generated for the next 24 hours divided into 5 minutes, then the cardinality of the action space, |A|, would be 24 × 12 × 4 = 1152, which causes the agent hard to reach an optimized policy.The experiment shows that if the initial point of θ is random, the regular RL algorithm will converge to the policy with poor performance.To address this problem, we propose to incorporate a reference policy into the regular RL algorithm as shown in Algorithm 1 and Fig. 2. The reference policy could assist the agent to generate a reasonable policy at the beginning of training.
We use deep neural networks to represent function Q φ (s, a) and μ θ (s).It is not suitable to apply a fully connected layer as the first layer, because the large space of S and A would result in an extremely large weighting matrix.To address this problem, we develop the neural network structure as shown in Fig. 3, where the prediction sequences The DSO observes state s 0 3: The DSO takes action a 0 ∼ π θ 0 (s 0 ) 4: repeat 5: The DSO observes state s t+1 6: The DSO takes action a t+1 ∼ π θ t (s t+1 ) 7: The DSO receives c t+1 8: \\Critic Step 9: 11: \\Actor Step 13:  into the block that consists of 1D convolution layers and maxpooling layers, which could reduce the feature space.Second, the extracted feature vectors are flattened and concatenated with {P i,t , Q i,t , V i,t , i ∈ N }.Third, the fully connected layers are fed with the concatenated vector and generate the final Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.output.This deep neural network structure is similar to the structures in [41], [42], which are applied for the analysis of the aggregated load power.Detailed information on the 1D convolution and fully connected layers is omitted in this paper due to space limitations.
In Algorithm 1, α t and β t are the step sizes.The distribution network system stays at the initial state s 0 , and the parameters of the DSO agent are initialized as φ 0 and θ 0 .At each time t + 1, the DSO agent observes the state s t+1 and takes action a t+1 ∼ π θ t (s t+1 ) in line 5 and 6.Then the agent executes the critic step and the actor step.In the critic step, the estimator, μ c t+1 , for J(θ t ) is updated in line 9.The TD error, which is the residual value of equation ( 7), is calculated in line 10.The parameter of the action-value function, φ t , is updated to minimize the TD error in line 11.In the actor step, A t is the gradient of J(θ ) and B t is the gradient of (μ θ t (s t ) − μ * (s t )) (μ θ t (s t ) − μ * (s t )).The truncated λHV i,t , i.e., is set as the reference policy.This reference policy could be easily generated when the prediction for the real-time prices λHV i,t of the high voltage network is acquired.The policy parameter is updated to the direction of −β t (A t + γ t B t ) in line 15, which is a stochastic gradient descent step with γ t as the weighting parameter that decays to 0 as t → ∞.The initial value γ 0 balances the convergence to the reference value and the policy improvement through exploration.Since γ t → 0, a large γ 0 would not deteriorate the final optimized policy.However, if γ 0 is too large, the policy will be locked around the reference policy for a long time, which delays further policy improvement.In the following experiment, γ 0 is set as 100 with a decay coefficient 0.97, i.e., γ t+1 = 0.97 • γ t .These values are determined by experimental experience and the scale of the action-value Q is a critical factor.The entire training process is repeated until the policy converges or the user terminates the program.

A. Experimental Settings
In this section, we apply the developed Algorithm 1 to coordinate the 4 MGs in the IEEE 33-node distribution network as shown in Fig. 4.Each MG owns inflexible loads, a generator, an ES unit, and a PV unit.A simulator for this system is built, where each MG locally solves P LP i,t at time t to decide the power output of its generator and ES unit at time t + 1.The load data, PV data, and real-time electric price of the high voltage network are downloaded from PJM. 2 The prediction sequences such as PL i,t are simulated by adding an error vector in which the τ th element obeys Gaussian distribution N(0, σ e τ ) to the actual values, where σ e τ increases as τ increases.The time horizon of predictions and reference price sequences is 24 hours and divided into time slots of 5 minutes.Thus, T = 24 × 12 = 288.The simulator is built with Python [43] and the power flow is calculated using Fig. 6.The curves of λ MG i,t , P G i,t , and S i,t of the MGs with P LP i,t as the lower-level problem.
pandapower [44].Moreover, to illustrate the convergence of Algorithm 1, we also add 4 more MGs to the distribution network at buses i ∈ {4, 17, 25, 28} and test the algorithm.
To further evaluate Algorithm 1, the lower-level problem is replaced by P QP i,t , where the cost functions of the generators are quadratic w.r.t.P G i,t .Moreover, constraint (2f) is replaced by where η c i and η d i are the charging and discharging efficiency factors and so is constraint (2g).
To evaluate the trained policy, we set up the baseline solved by the model-based method.We formulate a finite horizon bi-level problem P 0 t by replacing (1a) with min 0 f ( 0 , T) and replacing the predictions P and Q in P LP i,t (P QP i,t ) with their actual values.P 0 t is solved every T time steps.The optimal solutions are set as the baseline.This problem could be transformed into a MILP (MIQP) problem and solved by commercial solvers.The outline is as follows and the whole process is similar to [6].First, the optimal solution of each problem P LP i,t (P QP i,t ) is represented by the KKT conditions since the problem is linear and the strong duality holds.In this way, the constraint (1m) could be replaced by these KKT conditions.Second, the complementary slackness conditions, which include bi-linear terms, are transformed into mixed-integer linear constraints.Third, the bi-linear term of λ MG i,t P i,t in the objective function is transformed into a linear term w.r.t. the dual function of P LP i,t (P QP i,t ) since the strong duality holds.Thus, P 0 t is transformed into a MILP problem.Gurobi [45] is implemented to the global optimal solution.In addition, TD3 [46] and SAC [47], which are state-of-the-art Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.RL methods, are directly applied to solve this MDP problem, while they both fail to significantly reduce the cost in finite time. 3his baseline is hard to achieve since the DSO does not know P LP i,t ( λMG i,t ) (P QP i,t ( λMG i,t )) in practice.Algorithm 1 is intended to manage this situation by iteratively evaluating and improving the current policy.In the following subsection, we show the experimental results.

B. Experimental Results
We first show the results with P LP i,t as the lower-level problem.The costs of the DSO during the training process are shown in Fig. 5 as the solid lines and the baseline is shown as the dashed lines.It could be found that Algorithm 1 finally converges to a locally optimal policy.
The comparison of the pricing policies is shown in TABLE I, where "Initial policy" is the policy with randomized θ 0 and "Reference policy" is μ * defined in (11).It is shown that the initial randomized policy is improved from the average cost of $400.06/5min to $323.32/5min in the 4-MG system and $404.04/5min to $240.35/5min in the 8-MG system, while the baseline is $317.13/5minand $228.14/5min,respectively.The result shows that Algorithm 1 could generate an optimized pricing policy which is close to the model-based method under incomplete information.It could be also found that the number of MGs affects the convergence of Algorithm 1 since the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.action and state space of the system grows larger as the MG number increases.In Fig. 6, the states of the coordinated facilities in each MG are shown.In general, when the value of λ HV t is high, the DSO would raise the price to motivate the MG to generate more active power.The resulting output of the generators and E.S. units also increases as λ HV t does.Moreover, the DSO agent is adapted to the local cost C G i .In the setting of the system, The DSO agent sets the local price λ MG i,t around C G i to maximize its benefit.This results in the divergence of the behaviors of different MGs.For example, the generator at node 20 (orange line) works at its maximum for the longest time, while the one at node 12 (green line) is for the shortest time.This is compatible with the intuition that the generator with the highest cost should be the last choice to generate active power.
The costs with P QP i,t as the lower-level problem are shown in TABLE II.The results show that Algorithm 1 is also effective for the response behavior that the MG considers quadratic cost functions and (dis)charging loss.Compared with the initial policy and the reference policy, the optimized costs by Algorithm 1 are closer to the baselines which is optimized by the model-based method in both systems with 4 and 8 MGs.
The states of the coordinated facilities in each MG are shown in Fig. 7.According to the results, the quadratic cost functions of the MGs lead to more variable λ MG i,t .This is because, without the quadratic term, the MGs would maximize the generator output for any positive return, i.e., λ MG i,t > C G i .The quadratic cost also leads to more fluctuations in P G i,t since the optimal output continuously varies as λ MG i,t changes.In addition, the (dis)charging loss causes the energy storage more stable since these operations result in energy loss.
During each time slot, the DSO agent only spends around 0.002 seconds computing the reference price sequences for all MGs, which is much smaller than t = 5 minutes.This is because the DSO agent only needs to perform a forward propagation of the neural network at each time.Thus, Algorithm 1 is suitable for online dispatch.

VII. CONCLUSION
This paper studies the optimization problem of the pricing policy to coordinate multiple MGs in the distribution network.
In practice, the MGs may not provide their response behavior for the DSO due to privacy concerns.Thus, this bi-level system is transformed into an MDP, where the DSO is the agent.The pricing policy is optimized by the developed modelfree RL algorithm.The numerical result shows that the policy optimized by our algorithm performs almost as well as the conventional model-based method, while the former is more practical by privacy preservation for the MGs.The number of MGs affects the convergence.And the optimized pricing policy encourages the generator with lower cost to generate more power.Moreover, it shows that the developed algorithm is also effective when the MGs consider quadratic cost functions and (dis)charging loss.

Fig. 1 .
Fig. 1.The bi-level operational structure of the distribution network with multiple MGs.

Fig. 2 .
Fig. 2. The flowchart of training the DSO agent.

Fig. 3 .
Fig.3.The neural network structure for Q φ (s, a) and μ θ (s).Compared with μ θ (s), the Q φ (s, a) network needs the action as additional input, which is λMG i,t .

Fig. 5 .
Fig. 5.The cost of the DSO during the training process with P LP i,t as the lower-level problem (the response behavior of the MGs).

Fig. 7 .
Fig. 7.The curves of λ MG i,t , P G i,t , and S i,t of the MGs with P QP i,t as the lower-level problem.
The minimum price at node i.The active power injection at node i, time t.Pi,t ∈ R T The planned active power injection at node i, time t for interval from t + 1 to t + T. Q i,t ∈ R The reactive power injection at node i, time t.
V i,t ∈ R The voltage magnitude at node i, time t.Vi,t ∈ R T The planned voltage magnitude at node i, time t for interval from t + 1 to t + T. P ij,t ∈ R The active power flow from node i to node j through line at time t.Q ij,t ∈ R The reactive power flow from node i to node j through line at time t.P G i,t ∈ R The active power output of the generator in the MG connected at node i, time t.PG i,t ∈ R T The planned active power output of the generator in the MG connected at node i, time t for the horizon from t + 1 to t + T. P S i,t ∈ R The (dis)charging power of the ES unit in the MG connected at node i, time t.PS i,t ∈ R T The planned (dis)charging power of the ES unit in the MG connected at node i, time t for the horizon from t + 1 to t + T. S i,t ∈ R The energy stored in the MG at node i, time t.Ŝi,t ∈ R T The planned energy stored in the MG at node i, time t for the horizon from t + 1 to t + T. λ MG i,t ∈ R The price for the MG at node i, time t.λMG i,t ∈ R T The reference price for the MG at node i, time t for the horizon from t + 1 to t + T.
. Without losing generality, it is assumed that each MG owns a generator, a PV unit, an ES unit, and inflexible loads.At each time t, the DSO and MGs consider the horizon of the next T time slots of time duration t.(a) The DSO has access to the predicted PV generation PPV i,t ∈ R T at all MGs, loads PL T for buying from the high voltage grid for the next T time slots.(b) Based on the prediction, the DSO decides the reference price sequences λMG In our model of the strategy adopted by each MG, similarly, the first elements of PG i,t , QL i,t ∈ R T at all nodes, and the price λHV t ∈ R i,t ∈ R T for the next T time slots for each MG.(c) The MGs make local predictions and receive the reference price sequences.(d) Each MG solves its local lowerlevel problem to plan for its generator output PG i,t ∈ R T and ES (dis)charging power PS i,t ∈ R T for the next T time slots.(e) At time slot t + 1, the first element of λMG i,t , i.e., λMG i,t (1), is set as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the clearing price λ MG i,t+1 = λMG i,t (1).

TABLE I COMPARISON
OF COSTS (U.S. $/5 MINUTES) OF THE DSO WITH P LP i,t AS THE LOWER-LEVEL PROBLEM TABLE II COMPARISON OF COSTS (U.S. $/5 MINUTES) OF THE DSO WITH P