Risk-Based Reserve Scheduling for Active Distribution Networks Based on an Improved Proximal Policy Optimization Algorithm

Achieving the coordinated optimization of tie-line reserve and energy storage on two timescales is a key issue in reserve scheduling for active distribution networks. To effectively manage the uncertainty related to renewable energy, we propose risk-based reserve scheduling for active distribution networks based on an improved proximal policy optimization algorithm. The reserve scheduling for active distribution networks is constructed as a two-scale multistage stochastic programming model, in which intraday real-time operation is constructed as a multi-stage stochastic programming model. With the help of more accurate and realistic intraday operation scenario simulation, the energy storage can be fully utilized to improve the system’s ability to climb and peak, reduce the reserve pressure of the main network, and achieve the purpose of improving the flexibility of system operation. An improved proximal policy optimization algorithm is proposed to solve optimization problems in a model-free manner while effectively handling conditional value-at-risk constraints. The effectiveness of the proposed risk-based reserve scheduling for active distribution networks based on an improved proximal policy optimization algorithm is verified by a modified IEEE33 case.

INDEX TERMS Active distribution network, multistage stochastic programming, conditional value-at-risk, reserve scheduling, deep reinforcement learning. Time-of-use price of electricity purchased from the main grid in period t.

Indices
The associate editor coordinating the review of this manuscript and approving it for publication was Jiankang Zhang .
λ R+ , λ R− Cost for the unit up and down reserve. λ loss Cost of unit network losses. U , U Upper limit and the lower limit of node voltage amplitude. P G , P G Upper limit and the lower limit of the active power of the root node. Q G , Q G Upper limit and the lower limit of the reactive power of the root node.  Periods. C RESC,t Curtailment cost of renewable energy. C LL,t Cost of load shedding.  Day-ahead policy function. π Intraday policy function. P π s t s t+1 Probability of the system transitioning from the state s t to the next state s t+1 under the policy π. r 0 Day-ahead reward function r t Intraday reward function in period t. R t Intraday auxiliary reward function in period t. c t Cost function in period t. c H t Auxiliary cost function in period t. µ π Stationary distribution of states s under π. λ k Lagrange multiplier of the constraints k. V π t Intraday value function corresponding to the intraday policy π in period t. V π 0 ,π 0 Day-ahead value function corresponding to the intraday policy π and the day-ahead policy π 0 . V c k ,π t Intraday cost value function corresponding to the intraday policy π in period t. V c k ,π 0 ,π 0 Day-ahead cost value function corresponding to the intraday policy π and the day-ahead policy π 0 . V H ,l,π t Intraday auxiliary cost function corresponding to the intraday policy π in period t. V H ,l,π 0 ,π 0 Day-ahead auxiliary cost function corresponding to the intraday policy π and the day-ahead policy π 0 . L Lagrangian function. V L,l,π,λ t Intraday Lagrangian value function corresponding to the given l, λ and intraday policy π in period t. V L,l,π 0 ,π,λ 0 Day-ahead Lagrangian value function corresponding to the given l, λ and intraday policy π and the day-ahead policy π 0 . Q L Intraday Lagrangian action value function.

I. INTRODUCTION
With the increasing integration of renewable energy resources, it is difficult for the traditional reserve scheduling method with thermal power units to meet the demand of power system development in terms of reserve scheduling algorithms and reserve providers [1]. In terms of the reserve providers, the operating reserves required by the power system continue to rise. It is challenging to increase the consumption rate of renewable energy by adding thermal power units to meet the operational reserve requirements of the power system. At the same time, due to the large random fluctuation of the output of renewable energy, it also presents higher requirements for the capacity of the reserve providers for ramping and peak regulation. When relying on only traditional thermal power units as the reserve providers, it is difficult to meet the demands of economical system operation and ramp-up and peak-shaving capabilities. Therefore, fully developing the initiative of an active distribution network (ADN) and realizing the coordinated scheduling of multiple reserve resources can effectively improve the economic and flexibility of system operation [2]. In terms of the reserve scheduling algorithm, different from a load with strong fluctuation regularity and small disturbance amplitude, renewable energy has a large random disturbance amplitude, and its fluctuation mechanism is difficult to elucidate. Therefore, it is challenging to determine the appropriate reserve ratio, which greatly reduces the applicability of traditional deterministic reserve scheduling methods [3]. Optimizing the reserve from the perspective of only the operational sufficiency of the power system cannot guarantee the feasibility of the reserve capacity call during actual operation, and it is difficult to meet a system's demand for operational flexibility.
To handle the impact caused by a high proportion of renewable energy and ensure the feasibility of reserve capacity invocation in actual operating scenarios, robust optimization and two-stage stochastic programming (TSSP) have become the two main algorithms for current reserve scheduling. Value at risk (VaR) and conditional value-at-risk (CVaR) are also often used in reserve scheduling to measure the power system operation risk. Reference [4] sampled wind power output scenarios based on the scenario optimization method, constructed a two-stage unit combination and a standby optimization model, configured the system spinning reserve and the non-spinning reserve, and considered the system network constraints. Reference [5] built a multi-timescale power system scheduling operation model based on the scenario optimization method and optimized the reserve for the next stage of the scheduling plan in chronological order. Reference [6] further considered the dynamic frequency characteristics of the system. Reference [7] established a robust energy and reserve co-optimization model considering the uncertainty of wind power output based on two-stage adaptive robust optimization. Reference [8] considered multiple uncertain factors, such as wind power, load and forced outage of generator sets in the affine robust reserve scheduling problem. Reference [9] constructed an affine robust look-ahead scheduling model to configure intraday flexibility reserves. Reference [10] proposed a data driven probabilistic characterization of the real-time balancing stage to inform the day-ahead scheduling problem of an ESS owner. Reference [11] proposed a coordinated robust reserve scheduling model for the coupled transmission and distribution networks, and presented a two-layer iterative process to enhance the convergence property of standard alternating direction method of multipliers. Reference [12] proposed the reserve secure constraints to ensure reserve dispatchability of pumped-storage hydropower despite state-of-charge deviations. Robust optimization focuses more on safety than economics, and its optimization results may be conservative. The TSSP, which has been widely used in power systems, divides the problem VOLUME 11, 2023 into two stages: before the experiment and after the experiment. With the help of the deterministic subproblem of the second stage, the influence of different outcomes of intraday uncertainties on the reserve scheduling is considered. But in TSSP, the second stage decision is calculated under the assumption that the uncertain parameters in all 24 periods are observed simultaneously, ignoring the time causality of uncertainties. It was shown in [13] that the unit on-off status provided by the two-stage robust optimization approach leads to an infeasible real-time dispatch if non-anticipativity was considered.
The development of measurement and communication technology provides a prerequisite for the application of multistage stochastic programming (MSSP) in optimal operation of power systems. The model of MSSP makes the interaction between decision making and uncertainty unfolding is represented more accurately and realistically. At the same time, MSSP attempts to adjust decisions dynamically and adaptively with the help of online measurement, which making decision-making more accurate and targeted, and can more effectively deal with the uncertain renewable energy. Compared with the TSSP, the MSSP considers the time causality of uncertainties, which leads to tree-like coupling of uncertain scenarios, forming a scene tree [14]. So, the practical solution is to decouple the MSSP problem between stages, with the help of dynamic programming (DP). This solution can be divided into two categories according to the different approximation methods of the value function. One method is that the value function is underapproximated by linear cuts, such as stochastic dual dynamic programming (SDDP) [14]. The other is that the value function is approximated by deep neural network (DNN), such as deep reinforcement learning (DRL) [15], which is currently widely used in the artificial intelligence field. Under the assumption that the value function is piecewise linear and convex, the value function of SDDP is underapproximated by linear cuts. Various proofs of almost sure convergence of SDDP under mild assumptions have also been proposed. Reference [16] developed SDDP to solve the MSSP model of joint economic energy and reserve dispatch. Reference [17] proposed an improved SDDP algorithm to solve the MSSP model of unit commitment. Reference [18] was based on the SDDP period decomposition, with the help of the Lagrangian relaxation power balance equation to further realize the space decomposition, which can effectively solve large-scale problems. However, SDDP cannot share the linear cuts that underapproximate the value function for subproblems at the same stage when faced with decision variables that are coupled with more than two stages. Thus, SDDP cannot solve the curse of outcome dimensionality when facing controllable resources coupled with full scheduling cycles, such as energy storage and grouping switching capacitors banks (CB), so it has difficulty dealing with uncertain renewable energy.
As a representative of a new generation of artificial intelligence algorithms, DRL is a decision science used to solve the optimal control problem of the Markov decision process (MDP), which is not completely known. Because the explicit probability distribution of uncertainties is replaced by feedback measurement information, DRL realizes the model-free manner to MDP problems in a trial-and-error manner [15]. In our previous work, DRL was validated to efficiently solve MSSP problems in a model-free manner [19]. The three most representative DRL algorithms are proximal policy optimization (PPO) [20], twin delay deep deterministic policy gradient (TD3) [21] and soft actor-critic (SAC) [22]. However, the traditional DRL algorithm reconstructs the optimization problem into an MDP, which has some system operating constraints that are difficult to deal with. Safe reinforcement learning (SRL) can effectively manage constrained optimization problems by transforming the optimization problem into a constrained Markov decision process (CMDP). The current SRL algorithm mainly includes two components: the primaldual optimization (PDO) algorithm [23] and the constraint policy optimization (CPO) algorithm [24]. PDO relaxes the original problem into an unconstrained optimization problem by constructing a Lagrangian function. CPO solves a relatively simple one-step constraint optimization problem at every update step, ensuring that each update does not violate the constraints and improves performance. Reference [25] proposed a safe off-policy DRL algorithm to solve Volt-VAR control problems in a model-free manner. Reference [26] formulated the EV charging scheduling as a CMDP and proposed an SRL solution based on CPO to handle the charging constraint. However, unlike the general system operating constraints, the CVaR constraints cannot directly use the existing SRL algorithm because the VaR values need to be solved while optimizing. Therefore, this paper proposes an improved PPO algorithm to solve the reserve scheduling for ADN based on CVaR constraints. The main contributions of this paper are as follows: 1) The reserve scheduling for ADN based on CVaR constraints is constructed as a two-scale multistage stochastic programming (2-MSSP) model. Different from the traditional TSSP, the model constructs the real-time operation of the ADN into a MSSP model. The MSSP model considers the time causality of intraday uncertainty realization, which makes the intraday operation scenario simulation more accurate and realistic. It can make full use of energy storage to improve the system's ability to climb and peak, reduce the reserve pressure of the main network, achieve the purpose of improving the flexibility of system operation, and further improve the effectiveness of day-ahead tie-line base power and reserve scheduling.
2) To manage the CVaR constraints, an improved PPO algorithm is proposed. Based on the PPO algorithm, the proposed algorithm adopts the constraint processing skills of the PDO algorithm. The 2-MSSP model is reconstructed into a CMDP, and a Lagrangian function is constructed to further relax it into an unconstrained optimization problem. The parameters related to the algorithm are optimized by gradient descent method.
The rest of the paper is organized as follows. Section II presents the mathematical formulation of the 2-MSSP model for the ADN reserve scheduling problem considering uncertain renewable energy. In Section III, an improved PPO algorithm is proposed to solve the ADN risk-based reserve scheduling problem. Section IV presents a case study focused on the modified IEEE 33-bus test system. Finally, the conclusion is presented in Section V.

II. MODELING THE ADN RISK-BASED RESERVE SCHEDULING PROBLEM
This paper considers a two-timescale coordinated reserve scheduling problem for ADN. On the day-ahead timescale, the reserve of the tie line is considered, and the intra-day timescale considers the state of the energy storage in different scenarios. This section considers dispatching units, such as distributed generator (DG), energy storage system (ESS), and grouping switching capacitors banks (CB). First, the system operation risk is defined based on CVaR constraints. On this basis, the general form of the ADN reserve scheduling model coordinated with two-timescale is given, and the model is further transformed into a 2-MSSP model.

A. POWER SYSTEM OPERATION RISK
When the actual available renewable energy output of the system is greater than the upper limit of the absorbable capacity, meaning that when the system's down-regulated reserve capacity is insufficient, there is risk of a renewable energy spill. Conversely, when the actual available renewable energy output is less than the lower bound of the absorbable capacity, that is, when the system's upward reserve capacity is insufficient, there is risk of load shedding. This paper defines power system operational risk based on CVaR. As a result, the curtailment cost of renewable energy C RESC.t and the cost of load shedding C LL,t are given first, as shown in (1)(2). (1) where t is the period, and t = 1, . . . , T ; λ RESC is the curtailment cost of per unit renewable energy; λ LL is the cost of per unit load shedding; N is the set of bus i; t is the period length and this article takes t = 1h; P DG i,t is the maximum allowable active power output of renewable energy at bus i in the period t; P DG i,t is the actual active power output of renewable energy at bus i in the period t. P Load i,t is the load shedding power at bus i in the period t.
VaR refers to the expected maximum loss value of the system under a given confidence level within a certain period. When in the condition of confidence level 1 − ε, the VaR of a renewable energy spill in period t VaR + ε,t and the VaR of load shedding VaR − ε,t are shown in (3)(4).
where Pr (x) represents the probability density function of x; l + t and l − t are the auxiliary variables of VaR + ε,t and VaR − ε,t in period t, and l = l + 1 , l − 1 , . . . , l + T , l − T . CVaR refers to the average loss value of the system in the condition that the loss of the system exceeds a given VaR value [27]. When in the condition of confidence level 1 − ε, the CVaR of a renewable energy spill in period t CVaR + ε,t and the CVaR of load shedding CVaR − ε,t are shown in (5)(6).
where ξ t represents the random variables in period t. Thus, the system CVaR constraints can be given as shown in (7).
where T represents the number of periods, this article takes T = 24; β 3 is the allowable upper limit of the system operation risk. VaR + ε,t and VaR − ε,t are associated with system uncertainty scenarios and scheduling decisions and cannot be given prior to system decisions. Therefore, in the practical application of the power system, the mathematical programming method is usually used, and the solution of CVaR is transformed into a mathematical optimization problem by introducing auxiliary functions. During the process of solving the system optimization problem, the VaR value is solved simultaneously. Auxiliary functions H + ε,t and H − ε,t , are shown in (8)(9).

B. THE GENERAL FORM OF THE MODEL 1) OBJECTIVE FUNCTION
The objective function is shown in (12)(13)(14), which is composed of the day-ahead cost C 0 and the intra-day cost C t . VOLUME 11, 2023 The day-ahead cost is composed of the power purchase cost of the day-ahead base power of the tie-line and the up and down reserve cost. The intraday cost is composed of the electricity cost and the network loss cost incurred by invoking the reserve capacity in actual operation.
where λ G,t is the time-of-use price of electricity purchased from the main grid in period t; λ R+ and λ R− is the cost for the unit up and down reserve; λ loss is the cost of unit network losses; P G t is the base power of the tie line in period t; P loss t is the network losses in period t; P G t is the adjustment power of the tie line in the period t relative to the base power; R + t and R − t are the up and down reserve in period t.

2) BRANCH FLOW MODEL
This paper uses the branch flow model simulation to generate the data set required by DRL. Based on Ohm's law and Kirchhoff's current law, and assuming each phase of the ADN is balanced, the branch power flow model of the ADN can be derived as shown in (15).
where M is the set of branch ij, P ij,t , Q ij,t , and I ij,t are the active power, reactive power, and current amplitude of branch ij in period t, respectively, R ij and X ij are the resistance and reactance of branch ij, respectively, U i,t is the voltage amplitude of bus i in period t, Q DG i,t is the reactive power of renewable energy of bus i in period t, P Load is the load active power and reactive power of bus i in period t, P E i,t is the ESS active power of bus i in period t, Q CB i,t is the reactive power of the grouping switching CB of bus i in period t, and i : i → j represents the node combination of power flowing to node j among all nodes that have a connection relationship with node j.

3) SAFE OPERATION CONSTRAINT
To simplify the problem, only node voltage constraints are considered in this paper. By default, the current replication of each branch is within the allowable range.
where U and U are the upper limit and the lower limit of node voltage amplitude.

4) RENEWABLE ENERGY OPERATION CONSTRAINT
The reactive power output of renewable energy considers the limitations of inverter capacity and power factor.
where S DG i is the capacity of the renewable energy inverter of bus i, cos φ i,t is the power factor of the renewable energy output of bus i in period t.

5) TIE LINE POWER CONSTRAINT
In this paper, the constraints of the day-ahead base power and the up and down reserve, and the constraints of the adjustment active power and the reactive power during the real-time operation are considered.
where P G and P G are the upper limit and the lower limit of the active power of the root node, Q G t is the reactive power of the root node in period t, Q G and Q G are the upper limit and the lower limit of the reactive power of the root node.

6) ESS OPERATION CONSTRAINTS
This paper considers the capacity, charge-discharge efficiency, and maximum output power constraints of the ESS.

7) GROUPING SWITCHING CB OPERATION CONSTRAINTS
This paper considers the constraints of the maximum number of groups and the maximum number of daily adjustments for grouping switching CB.
where k CB i,t is the number of CB switching of bus i in period t, Q CB i is the reactive power output of each group of CB of bus i, k CB i is the upper limit of the number of CB groups of bus i, and K CB is the upper limit of adjustments within a day.

8) LOAD SHEDDING CONSTRAINTS
In this paper, because the adjustment range of tie line power is limited, load shedding is introduced as a power balance method, and the operation risk of load shedding is evaluated by means of CVaR. The maximum load shedding is not allowed to exceed the node load.

C. THE 2-MSSP MODEL
To deal with the coordinated optimization of tie-line and energy storage on a two-timescale, the reserve scheduling for ADN based on CVaR constraints is constructed as a 2-MSSP model. Different from the traditional TSSP, the model constructs the real-time operation of the ADN into a MSSP model. The MSSP model considers the time causality of the intraday uncertainty realization, which makes the intraday operation scenario simulation more accurate and realistic. It can make full use of energy storage to improve the system's ability to climb and peak, reduce the reserve pressure of the main network, achieve the purpose of improving the flexibility of system operation, and further improve the effectiveness of day-ahead tie-line base power and reserve scheduling. The general form of the objective function of (12) is transformed into the form of a 2-MSSP model, as shown in (27)(28).
× min where the day-ahead decision variables are x 0 = x 0 1 , . . . , and ξ [1,t] = (ξ 1 , . . . , ξ t ); the intraday deci- x [1,t] = (x 1 , . . . , x t ); X t is the definition domain of the decision variables; X π t represents the control policy π that is the mapping between the decision variables and the uncertainties outcome trajectory, x t = X π t ξ [1,t] . (27) corresponds to day-ahead optimization, which is carried out before any uncertainty is realized and serves as the basis for intra-day optimization, which impacts the effect of intra-day optimization. (28) corresponds to intra-day optimization and is in the form of a MSSP model. It is worth noting that the optimization of the MSSP model is different from that of the TSSP model. The decision is not only a function of period t but also a function of the trajectory of uncertainties ζ [1,t] . There is no longer only one decision in each period, but a set of candidate decisions corresponding to the outcome of the uncertainties.

III. RISK-BASED RESERVE SCHEDULING FOR ADN BASED ON AN IMPROVED PPO ALGORITHM
In this section, an improved PPO algorithm is proposed for the two-timescale reserve scheduling problem of ADN based on CVaR constraints. First, to deal with CVaR constraints, node voltage constraints and tie-line power constraints, the constructed 2-MSSP problem is transformed into CMDP. Second, a Lagrangian function is constructed to relax the original problem into an unconstrained optimization problem. The Lagrangian multiplier and the auxiliary decision variables constrained by CVaR are simultaneously optimized in the process of iteratively solving the optimal policy. Third, to solve the 2-MSSP problem, policy functions are constructed for day-ahead optimization and intra-day optimization. Based on the policy gradient (PG) algorithm, the policy function is optimized, and the proximal policy search skill of the PPO algorithm is added to the intraday policy function. Finally, to solve the coexistence of discrete and continuous control variables in intraday optimization, an action value function is introduced to guide the optimization of discrete control variables.

A. FORMULATING THE 2-MSSP PROBLEM AS A CMDP
DRL usually uses MDP to reconstruct MSSP problems and introduces the concept of value function to achieve inter-period decoupling of sequential decision problems. However, the system model is reconstructed into state transition equations, and it is difficult to deal with some system operating constraints. Therefore, this section transforms the two-timescale reserve scheduling problem for ADN based on CVaR constraints into CMDP and constructs the corresponding value function. While realizing the decoupling between the periods of optimization problems, the system operation constraints and CVaR constraints are satisfied.

1) RECONSTRUCT 2-MSSP PROBLEMS THROUGH CMDP
The CMDP, without considering the discount factor, is composed of five tuples of state s t , action x t , state transition VOLUME 11, 2023 probability P π s t s t+1 , reward function r t , and cost function c t , shown as: State s t is the core of the MDPs and contain all the information needed to make decisions. The state s t usually consists of auxiliary variables and random variables. To realize the inter-period decoupling of the proposed 2-MSSP problem, it is necessary to decouple the inter-period coupling of the intra-day optimization and the coupling of the day-ahead optimization and intra-day optimization. On the one hand, to decouple the inter-period coupling of the intra-day optimization, auxiliary variables are constructed as shown in (30).
By introducing auxiliary variables K CB t−1 and K CB t , the noafter-effects of intraday decisions can be achieved. On the other hand, to decouple the coupling of the day-ahead optimization and the intra-day optimization, the day-ahead decision result is introduced into the state as an auxiliary variable. Therefore, the reconstructed state is shown in (31).
With the help of the reconstructed state, the day-ahead and intra-day policy functions are respectively constructed as a stochastic policy as shown in (32-33).
The state transition probability P π s t s t+1 represents the probability of the system transitioning from the state s t to the next state s t+1 under the policy π. Since DRL is the modelfree algorithm, this paper does not need to derive the specific expression of P π s t s t+1 . The reward function r t is usually the function value of the objective function in a single period. The definition of reward function r t is shown in (34-35).
To deal with some model constraints that MDP is difficult to deal with, CMDP adds an additional cost function to MDP. The constraints of the proposed 2-MSSP problem that need additional processing include node voltage constraints, tieline power constraints and CVaR constraints. The cost function c t is constructed as shown in (36-40).
It can be seen that c 1 t corresponds to node voltage constraints, c 2 t corresponds to tie line power constraints, and c 3 t and c 4 t correspond to CVaR constraints.

2) CONSTRUCT THE VALUE FUNCTION
Value functions are used in DRL to balance the cost of the current period with future periods. First, we construct the value function corresponding to the reward function r t . The value function corresponding to the intraday policy and the day-ahead policy are shown in (41-42).
where V π t is the value function corresponding to the intraday policy π; V π 0 ,π 0 is the value function corresponding to the day-ahead policy π 0 .
Second, we construct the value function corresponding to the node voltage constraint and the tie line power constraint. The cost value function corresponding to the intraday policy and the day-ahead policy are shown in (43-44).
where V c k ,π t is the cost value function corresponding to the intraday policy π; V c k ,π 0 ,π 0 is the cost value function corresponding to the day-ahead policy π 0 .
By constructing the cost value function, the node voltage constraint and the tie line power constraint can be transformed as shown in (45).
where β k is the allowable upper limit of node voltage out-ofbounds and tie-line power out-of-bounds. When β k is set to 0, because of the non-negativity of the cost function c 1 t and c 2 t , it can be equally considered that the node voltage constraint and the tie line power constraint are satisfied. As a result, the proposed 2-MSSP problem is transformed into (46).
Finally, to deal with the CVaR constraints, an auxiliary cost function c H t is constructed as shown in (47).
Furthermore, we construct the cost value function corresponding to the auxiliary cost function, the cost value function corresponding to the intraday policy and the day-ahead policy, which are shown in (48-49).
According to the properties of CVaR and its auxiliary function, under the premise of given day-ahead and intra-day policy, Equation (50) can be considered to hold.
As a result, the proposed 2-MSSP problem is transformed into (51).
In summary, by reconstructing the problem into a CMDP and constructing the cost value function corresponding to the cost function c t , it is possible to achieve the decoupling of the decision problem between periods and satisfy the system operation constraints and CVaR constraints.

B. PG ALGORITHM
The current mainstream DRL algorithms can be divided into two categories. One category consists of a series of algorithms based on Q-learning, and the representative algorithms are deep Q network (DQN), deep deterministic policy gradient (DDPG)and TD3. This kind of algorithm transforms the MSSP problem into a series of TSSP problems through decoupling between periods, corresponding to the Bellman optimal equations. This kind of algorithm solves the optimal value function by sampling iteratively, and then uses the optimal value function to determine the optimal policy. This type of algorithm takes the value function as the core, and the policy function is only used to assist in solving nonlinear programming. Another class of algorithms includes a series of algorithms based on PG algorithms, and representative algorithms include trust region policy optimization (TRPO), PPO and SAC. This type of algorithm constructs a parameterized policy function and uses the gradient descent method to directly solve the optimal policy. This type of algorithm takes the policy function as the core, and the value function is used as the loss function of the gradient descent method. Additionally, unlike Q-learning, the value function corresponds to the Bellman expectation equation. Similar to the series of learning algorithms based on Q-learning, PG also needs to address the issues associated with the nested state, action and outcome dimensionality and solve the optimization problem in a model-free manner [15]. The difference is that PG does not need to solve a series of TSSP problems because it directly uses the gradient descent method to optimize the policy function parameters, and gradually improves the policy to the optimum. Its value function is used as the loss function to assist in the iterative update of the policy. This section is used to illustrate the basic mechanism of PG, so it is not based on the previously proposed CMDP; it is based on the conventional MDP.

1) INTRODUCING THE PARAMETERIZED POLICY FUNCTIONS
PG takes the policy function parameterized by θ as the core, as shown in (52).
On the basis of constructing the parameterized policy function, PG uses the gradient descent method to minimize the loss function J (θ), to update the parameters θ, and directly solve the optimal policy π * .
where α θ is the update step size of the policy parameter θ. PG usually takes the initial state value function V π θ (s 0 ) as the loss function. Through parameterized policy functions, PG can deal with continuous state and action variables and solve the curse of action dimensionality. At the same time, because PG directly uses the policy function to make decisions, it avoids the optimization return and system state transition in advance with the help of the system model in the entire iterative calculation process, which is another model-free manner that is different from Q-learning.

2) UPDATE THE POLICY FUNCTION PARAMETERS
We take the initial state value function V π θ (s 0 ) as the loss function, and its gradient with respect to the policy parameters is shown in (54). Reference [15] provides a detailed derivation process for policy gradients.
where ρ π (s) = ∞ k=0 Pr (s 0 → s, k, π) is the stationary distribution of states s under π θ ; Q π θ (s, x) is the action value function; t is the advantage function, because of the existence of the expected operation, it can be equivalent to replace a variety of expressions. By introducing the likelihood ratio, the policy gradient is transformed into an expected form that can be approximated by the average sample experience, thereby solving the curse of the outcome dimensionality.
We update policy parameters using sampling iterations. Assuming a trajectory of 24 hours is sampled each time, the update method of the policy parameters is shown in (55).
where the superscript n represents the number of iterations; s n t and x n t are the sample state and action of the nth iteration at stage t. By iteratively updating the policy parameters by sampling, the entire iterative calculation process can be focused on the optimal subspace, and the curse of state dimensionality can be solved.

3) INTRODUCING THE PARAMETERIZED VALUE FUNCTIONS
This paper uses temporal difference (TD) error as an unbiased estimate of the advantage function. Thus, the value function parameterized with φ is introduced, as shown in. (56).

V (s t |φ)
(56) Therefore, the update method of policy parameters can be rewritten as shown in (57-58).
δ n t ∇ θ ln π θ x n t |s n t | θ=θ n−1 (57) where δ n t is the TD-error of the nth iteration at stage t; φ n is the value function parameter for the nth iteration. The parameterized value function also needs to be iteratively updated along with the policy function. Its loss function and parameter update method are shown in (59-60).
where J n t is the loss function of the nth iteration at stage t. In PG, the value function is only used as the loss function to assist in updating the policy function and does not directly participate in optimization.

C. CONSTRAINED PG ALGORITHM
This paper uses the same constraint handling techniques as the PDO algorithm. The original problem is relaxed into an unconstrained optimization problem by constructing a Lagrangian function. The Lagrangian multiplier and the auxiliary decision variables constrained by CVaR are simultaneously optimized in the process of iteratively solving the optimal policy. This section first constructs the Lagrangian function and its corresponding auxiliary value function. Then, according to the gradient of the Lagrangian function with respect to each parameter, the update method of each parameter is deduced. Finally, the premise and the assumptions of the algorithm are given.

1) CONSTRUCTING ING THE LAGRANGIAN FUNCTION
Before constructing the Lagrangian function, we first construct the day-ahead policy function π 0 ϕ parameterized with ϕ and the intra-day policy function π θ parameterized with θ as shown in (61-62).
The optimization problem corresponding to (51) is transformed into an unconstrained problem using the Lagrangian relaxation method, and the corresponding Lagrangian function is shown in (63).
where λ k is the Lagrange multiplier. When the optimization problem is transformed into the unconstrained optimization problem shown in (63), the loss function corresponding to PG also becomes L (l, ϕ, θ, λ). The direct optimization of the loss function requires a parameterized value function and three cost value function, a total of 4 sets of parameters. Therefore, to simplify the parameter update process, the auxiliary reward function R t and its corresponding auxiliary value function are constructed as shown in (64-66).
where V L,l,π θ ,λ t is the auxiliary value function corresponding to the given l, λ and intraday policy π θ ; V L,l,π 0 ϕ ,π θ ,λ 0 is the auxiliary value function corresponding to the given l, λ and day-ahead policy π 0 ϕ . It can be seen that because the value function and cost value function constructed in this paper both satisfy the recursion, the auxiliary value function can be directly used as the loss function to guide the parameter update. We introduce an auxiliary value function V L,l,π θ ,λ t that is parameterized by φ, as shown in (67).
The auxiliary value function V L,l,π 0 ϕ ,π θ ,λ 0 corresponds to only one number, so the estimated value shown in (68) is used.
Thus far, the loss function corresponding to the unconstrained optimization problem of (63) has been constructed.

2) PARAMETER UPDATE
The parameters that need to be updated for the unconstrained optimization problem of (63) include: auxiliary value function parameters φ and V L0 , day-ahead and intra-day policy parameters ϕ and θ, auxiliary variables l and Lagrange multiplier λ. The above parameters are all updated using gradient descent, assuming each iteration samples M trajectories of samples.
where the superscript n represents the number of iterations; the superscript m represents the number of trajectories; δ L,n,m 0 is the TD-error of the mth trajectory auxiliary value function V L,l,π 0 ϕ ,π θ ,λ 0 for the nth iteration; δ L,n,m t is the TD-error of the mth trajectory auxiliary value function V L,l,π θ ,λ ∇ λ 3 L (l, ϕ, θ, λ) = β 3 − V H ,l,π 0 ϕ ,π θ 0 (83) where α λ is the update step size of the Lagrange multiplier λ.
To simplify the algorithm, this paper uses the sum of the cost of the samples to replace the initial value of the cost value function, avoiding the establishment of additional parameterized cost value functions.
It is worth noting that in the strict policy iteration process, the value function should be iterated to convergence in each iteration, that is, to converge to the value function corresponding to the given parameters l n , λ n , ϕ n and θ n at the current iteration. However, in the actual calculation process, especially when the gap between the current policy and the optimal policy is large, the complete iterative calculation greatly slows down the calculation speed of the algorithm. Therefore, this paper adopts a generalized policy iteration framework: in each iteration process, the value function parameters are only updated a limited number of times to increase the algorithm solution efficiency.

3) PRECONDITIONS FOR THE ALGORITHM
It has been shown that the iterative approach for updating the parameters will guarantee the convergence to a local optimal and a feasible solution when the following three assumptions hold [20], [28]. First, V L,l,π 0 ϕ ,π θ ,λ 0 is bounded for all policies π 0 ϕ and π θ . Second, every local minimum of V c i ,π 0 ,π 0 and V H ,l,π 0 ,π 0 is a feasible solution. Third, the update step size meets the requirements of (85-87).

D. PROXIMAL POLICY OPTIMIZATION
Among the series of algorithms based on the PG algorithm, PPO is recognized as the most representative algorithm.
To ensure the stability of the algorithm, PPO restricts the iterative update of the policy parameters to be carried out within the trusted region to guarantee the improvement of the policy. The concept of the importance sampling rate is introduced to evaluate the degree of policy change as shown in (88).
where ratio t (θ) is the importance sampling rate with respect to the policy parameter θ. The importance sampling rate is the probability ratio of taking an action x t in the state s t corresponding to the old and new policy parameters, which can be used to measure the degree of policy change. To limit the region of the policy update, in each iteration process, all MT samples are not used for updating at one time, but are used in batches for multiple times, and the policy is gradually updated. That is, the inner loop is embedded in the algorithm iteration process, the policy is updated multiple times in each iteration process, and the degree of policy change is evaluated. When the change range of some action probabilities in the policy exceeds the limit, the update of the action probabilities is stopped in the remaining inner loop. In this way, the update range of the policy is limited. This paper chooses to use a trajectory sample for each inner loop. It is worth noting that because the policy has changed during the inner loop, its corresponding loss function can no longer directly use the auxiliary value function. Therefore, it is necessary to perform importance sampling correction on the loss function. The modified intraday policy loss function with respect to the gradient of the policy parameters and the intraday policy loss function are shown in (89-90).
Similarly, the loss function of the day-ahead policy can be modified as shown in (91).
PPO achieves the purpose of limiting the update range of the policy by clipping the importance sampling rate within a certain range. The clipped loss functions of the intraday and day-ahead policy are shown in (92-93).
where clip indicates that the importance sampling rate is clipped within the range of [1 − c, 1 + c]. The gradient calculation of the clipped loss function of the intraday policy with respect to the policy parameters can be analyzed in the following situations: parameter θ is rewritten as shown in (110).
The summary of the proposed improved PPO algorithm is shown in Algorithm 1.

Algorithm 1 The Improved PPO Algorithm
Orthogonally initialize the day-ahead, intra-day policy function network, auxiliary value function network and auxiliary action value function network parameters ϕ, θ, φ, and ψ. Initialize auxiliary variables l, Lagrange multiplier λ, auxiliary value V L0 , confidence level 1 − ε, cost tolerance β, and the maximum number of iterations N .

IV. CASE STUDY
This case is programmed with JetBrains PyCharm 2018 editor, based on Python 3.6 and solved with the help of TensorFlow 1.12.

A. SYSTEM SETTINGS
This paper uses the modified IEEE33-bus test system shown in Fig.2 to verify the feasibility and the effectiveness of the proposed risk-based reserve scheduling for ADN based on the improved PPO algorithm. Where, the locations of PV, WT and ESS were selected by a bilevel programming model. The upper level optimization considers multiple load levels and solves the location and capacity of PV, WT and ESS with the objective of minimizing investment cost. The lower level optimization considers the charging and discharging process and energy balance constraints of ESS, and further optimizes the location and capacity of ESS with the help of several typical daily scenarios. The location and capacity of PV, WT and ESS are finally solved by the iterative calculation of upper and lower levels. For detailed parameters of IEEE33bus test system, please refer to [29]. In Fig. 1, WT denotes the wind power, and PV denotes the photovoltaic. The specific parameters are shown in Appendix Table. 3-7. The electricity purchase price of the main network adopts the data of the reference [19]. λ R+ and λ R− are 40 / (MW · h), λ loss is 100 / (MW · h), λ RESC is 400 / (MW · h), and λ LL is 2000 / (MW · h). β 3 is set to 5000 , while the confidence level 1 − ε is set to 0.95. The power reference value is 100MVA. Considering that the area covered by the power distribution system is not large, and the result analysis is simple, all photovoltaic PV adopts the same predicted value curve, and the predicted value of the DG output is shown in Fig.2. The maximum load of the system is 3.715MW+2.3Mvar, and the maximum total active output of renewable energy is 5.458MW.
To characterize the uncertainties related to renewable energy, it is assumed that the output of renewable energy obeys a normal distribution. The predicted value of renewable energy is taken as the expected value µ, and one-tenth of the expected value 0.1µ is taken as the variance σ . From this, the uncertainty of renewable energy is modeled as N µ, (0.1µ) 2 . To ensure that the intraday operation scenario selected in this paper are more representative, the Latin hypercube sampling method is used to generate scenes. At each stage, 10 samples are generated by the Latin hypercube sampling method, and a scene tree with different transition probabilities in each stage is constructed [30].

B. CONVERGENCE
The improved PPO algorithm proposed in this paper uses the day-ahead policy function network, the intra-day policy function network, the auxiliary value function network and the auxiliary action value function network. All the above networks are fully connected neural networks with 3 hidden layers. The hyperparameter settings of the learning algorithm are shown in Table. 1.
The algorithm hyperparameters in Table. 1 are the optimal values after debugging. Fig. 3 shows the convergence of the auxiliary reward sum. Because the value of the auxiliary reward function R t is   affected by the value of the Lagrangian multiplier λ, at the beginning of the algorithm iteration process, when the policy is poor, the auxiliary reward sum rises rapidly together with the Lagrangian multiplier. When the policy is gradually improved and the constraints are satisfied, the Lagrange multiplier is reduced to 0, and the auxiliary reward sum also becomes equal to the reward sum. Due to the large variation of the auxiliary reward sum, the convergence of the reward sum is also given as shown in Fig. 4. Fig. 4 shows the total cost of the sampling scenario in the iterative calculation process, that is, the reward sum. Because the scenes calculated in each iteration are randomly selected, there are still oscillations after the values converge. Fig. 5 shows the convergence of the auxiliary value V L0 in the iterative process. Like the auxiliary reward sum, at the  beginning of the algorithm iteration process, the auxiliary value increases rapidly due to the influence of the Lagrange multiplier. When the Lagrangian multiplier is reduced to 0, the auxiliary value gradually decreases and gradually converges. At the same time, due to the small step size, the change trend of the auxiliary value is relatively stable.

C. DISPATCH RESULT
To demonstrate the control effect of the proposed algorithm, this part directly shows the results of the previous decision as shown in Fig. 6. For the intraday decision results, a scene is randomly selected from the scene tree as the actual output curve of the renewable energy. The charging and discharging power of ESS and SOC curves of ESS are shown in Fig. 7-8.
The tie line base power and reserve curves. VOLUME 11, 2023 Fig . 6 shows the 24-hour tie line base power and reserve curves, where P + t = P G t + R + t , P − t = P G t − R − t , P + t and P − t define the allowable fluctuation range of tie line power. It can be seen from the base power curve of the tie line that the existence of renewable energy can greatly reduce the electricity purchased from the main grid of the distribution system. Additionally, during hours 11-14 when photovoltaic power is high, the system sells electricity to the main grid within the allowable voltage range to reduce the total power purchase cost. As can be seen from the reserve curve, the amount of reserve purchases is related to the degree of volatility in renewable energy uncertainty. Even during hours 11-14, the distribution network still needs to purchase backup from the main network to handle the uncertainty of the renewable energy.    7 shows the 24-hour charging and discharging of the ESS. Fig. 8 shows the 24-hour SOC change of the energy storage device. ESS1 and ESS2 represent the ESS of integrated buses 8 and 26, respectively. The ESS is used to deal with the rapid fluctuation of renewable energy that is difficult to handle by the traditional thermal unit due to the limitation of the ramp rate in the real-time control, reduce the reserve pressure of the main network, and achieve the purpose of improving the flexibility of system operation. As can be seen from the figure, because the extracted scene is not an extreme situation, the ESS output is still conservative in this situation and leaves some adjustment space as a reserve, which can reduce the backup pressure of the tie line to a certain extent. Additionally, in the day-ahead optimal dispatch, the reverse flow can also be reduced by absorbing power, thereby suppressing the voltage rise caused by renewable energy to a certain extent, and further consuming renewable energy. During the periods of high photovoltaic output between 10-14 o'clock, ESS absorbs power to limit the voltage level of the distribution network. During the low voltage periods at 6-8 o'clock and 18-20 o'clock, ESS discharge to raise the system voltage. At 0-3, when the load is not heavy and the electricity price is low, the ESS absorbs power to ensure sufficient energy storage capacity.

D. PERFORMANCE COMPARISON
To reflect the effectiveness of the proposed algorithm, the performance is compared with the traditional TSSP algorithm. Among them, the second-order cone relaxation technique is used to convexify the power flow equations in the TSSP algorithm [31].  Table 2 compares the performance of the proposed algorithm and the traditional TSSP algorithm. It can be seen that the total cost of the proposed algorithm is 3.9% lower than the traditional TSSP algorithm, the energy cost is reduced by 3.6%, and the reserve cost is reduced by 9.7%. The proposed algorithm effectively reduces the curtailment of renewable energy and load shedding. At the same time, the backup cost is reduced, and the network loss is also slightly reduced. The TSSP model used in traditional ADN reserve scheduling does not consider the temporal causality of uncertainty, and the estimation of the impact of uncertainty is too optimistic. The MSSP model corresponding to intraday optimization can more accurately and realistically simulate the actual operation scenario within the day and then realize the coordinated optimization of tie line reserve and energy storage on a twotimescale. At the same time, this set of TSSP decisions must consider all the uncertainties of the day by giving up part of the renewable energy to ensure that the decision is feasible in all scenarios. The MSSP captures the dynamics of unfolding uncertainties over time, and the decisions can be adaptively adjusted according to different outcomes of renewable energy. Compared with that of the TSSP, the optimization of the MSSP is more targeted and can more effectively absorb the uncertain renewable energy and reduce the cost of an electricity purchase. It should be noted that the calculation time of MSSP is much higher than that of TSSP.

V. CONCLUSION
Achieving the coordinated optimization of tie-line reserve and energy storage on two timescales is a key issue in the reserve scheduling for ADN. However, the TSSP model used in traditional ADN reserve scheduling does not consider the temporal causality of uncertainty, and the estimation of the impact of uncertainty is too optimistic. In this regard, this paper constructs the reserve scheduling for ADN as a 2-MSSP model. The MSSP model corresponding to intraday optimization can more accurately and realistically simulate the actual operation scenario within the day to realize the coordinated optimization of the tie line reserve and the energy storage on a two-timescale. At the same time, the algorithm proposed in this paper reconstructs the optimization problem into a CMDP based on the PPO algorithm, and constructs a Lagrangian function to further relax it into an unconstrained optimization problem, which can effectively deal with the CVaR constraints. In this paper, only the coordination of energy storage and tie-line reserve is considered. But demand response is also an important part of ADN reserve, we will leave this problem for future research.

APPENDIX
The parameters of WT, PV, ESS and grouping switching CB are as follows: