Coordination for Multienergy Microgrids Using Multiagent Reinforcement Learning

Multienergy microgrids (MEMGs) have significant potential to offer high energy utilization efficiency and system flexibility. The coordination of these MEMGs poses challenges due to the various system dynamics and uncertainties and the need to preserve privacy. This article proposes a double auction (DA)-market-based coordination framework. As such, MEMGs can not only schedule their own energy components but also trade energy with others in the DA market. After that, we formulate this problem as Markov games and propose a multiagent reinforcement learning method by making use of the DA market public information to enhance the stability with privacy perseverance. Case studies involving a real-world scenario validate the superior performance of the proposed method in reducing both the energy costs and the carbon emissions.


I. INTRODUCTION
A. Background and Motivation P OWER systems are undergoing a significant transition from fossil fuel resources to the decarbonization of renewable energy resource (RES), promising to address the environmental concerns [1]. However, the less controllable and predictable RES introduces new challenges to power system planning and operation [2]. In this respect, there has been a significant increase in developing multienergy systems (MESs) that interact electricity, gas, and heat with each other, constituting a significant opportunity to provide the flexibility of shifting across multiple energy vectors and resulting in a cost-effective and reliable system [3]. Currently, an increasing attention has been made to study the MES inside a microgrid, forming the multienergy microgrids (MEMGs) [4], [5]. An MEMG is composed of various energy loads, generators, storages, and converters under the microgrid This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ concept. Currently, the benefits of using the MEMG have been discussed in many studies [5]. Instead of independently scheduling each energy vector, the integrated manner is more efficient to deal with the complementary and synergistic effects of the MES, therefore boosting the operation efficiency of the MEMG.
Gas and electricity are the two main input energy sources for MEMGs. The gas retail market is normally indifferent to MEMGs, allowing them to buy gas but not sell it back [6]. The electricity retail market under the deregulation is more active and flexible, where MEMGs with the RES can sell electricity back to the grid at feed-in tariff (FiT) [7]. However, under scenarios where MEMGs need to import energy from the grid, the higher rated time-of-use (ToU) prices, compared with the lower FiT issued by the same utility company, can present a dilemma for MEMGs' net import decision making [8]. Furthermore, when MEMGs participate in the traditional market, they act independently to manage their supply-demand balance. This is, however, not optimal as the lack of coordination with others leaves untapped the full potential of energy flexibility for achieving overall system supply-demand balance [9]. To this end, an efficient coordinated management of local MEMGs is urgent to maximize the economic benefit and the system flexibility.

B. Literature Review
So far, the existing literature on the coordinated management of multiple MEMGs can be classified into two categories. The first one focuses on the design of a centralized framework that employs a central operator to manage all the local resources [10]. Although such a framework provides a theoretical solution for social welfare maximization, it exhibits various drawbacks in practice. Specifically, the central operator needs to acquire mathematical models and collect all the technical parameters of local resources, thereby raising privacy concerns. The second one focuses on the design of a decentralized framework that allows the MEMGs to manage their own resources independently with limited information exchange, preserving their privacy. Currently, alternating direction method of multipliers [11], [12], Lagrangian relaxation [13], [14], consensus algorithm [15], and bilateral contract [16] are popular methods in the decentralized framework for solving the coordination management of multiple MEMGs. However, the optimality of solutions is not guaranteed under such a decentralized framework without a central coordinator [10].
To this end, a double auction (DA) market [17] is a kind of framework that takes advantage of both the centralized and decentralized frameworks, which is potential to be considered to form local coordination among a group of MEMGs. More specifically, an auctioneer, as a third-party coordinator, is responsible for clearing the market to ensure the market efficiency, which is close to optimal in a centralized framework [18]. On the other hand, MEMGs can manage their resources independently and submit only the bidding information (i.e., price-quantity bids) to the auctioneer. As such, the privacy can be preserved that is similar to the decentralized framework. However, MEMGs in the DA market are faced with a complex quotation decision process. Thus, an appropriate trading strategy is challenging to select in such a complicated market environment. Zero intelligence (ZI) is a fundamental trading strategy adopted by traders in the DA market [19]. Specifically, ZI selects the price bid uniformly at random values between FiT and ToU and runs a day-ahead self-optimization problem for quantity bid submitted to the DA market. However, the randomized price bid does not capture the market dynamics [20]. Furthermore, preoptimized quantity bid requires the complete MEMG mathematical models, technical parameters, and accurate forecasting information of uncertainties, which are generally impractical in real-world applications [21].
In view of the above drawbacks in the ZI strategy, reinforcement learning (RL) [22] is a model-free and data-driven control method to study the sequential decision-making problem, where the agents within MEMGs gradually learn the optimal trading strategies by utilizing experiences acquired from their repeated interactions with the environment (MEMGs and DA market), without a prior knowledge of MEMGs. In addition, RL as an online learning method can make use of increasing data acquired from the environment to learn the optimal control strategies and to cope with the uncertainties that are encapsulated in the data [23].
Previous works have successfully applied various RL methods to energy management problems in power systems, as reviewed in [24]. The majority of them, however, only consider the energy management problem of a single entity, e.g., a smart energy hub [25] and a residential multienergy home [23], and employ single-agent reinforcement learning methods. On the other hand, the research efforts on the application of multiagent reinforcement learning (MARL) on power systems are still sparse, particularly for our studied MEMG coordination management problem. The most straightforward approach to solve a multiagent problem is independent reinforcement learning (IRL) that each agent trains its independent control policy depending on the local information. Independent deep Q-network [26] and independent deep deterministic policy gradient (IDDPG) [27] have been applied to the energy management problems of the multiple MGs, where each agent treats others as part of the environment and learns its own policy without considering others' policies. However, directly applying IRL methods to a multiagent setting is problematic, since the environment appears nonstationary from the view of every agent [28]. To overcome this issue, multiagent deep deterministic policy gradient (MADDPG), an extension of IDDPG to a multiagent setting, has been proposed to address the energy trading problem among the microgrids [29]. Each agent in MADDPG trains a centralized Q-value function (critic) with access to all agents' observations and actions to stabilize the training performance. During the execution, the decentralized actor of each agent makes decisions based on its local observation value. However, MADDPG mainly suffers from the following: 1) privacy concern: knowing the local observations and actions of all the other agents and 2) stability concern: the learned Q-values may be overestimated, which can lead to the suboptimal polices [30].

C. Article Contributions
To address the limitations of privacy and instability issues discussed above, this article proposes a novel MARL method for multiple MEMGs to provide autonomous control and trading policies for local energy coordination in a DA market. Specifically, a list of contributions can be provided as follows.
1) The flexibility due to the local electricity trading among different MEMGs and the coupled energy conversions in each MEMG is explored. The examined problem is complex because of various system dynamics and uncertainties. A DA-marketbased coordination framework has been proposed to obtain good performance with privacy preservation. To the best of our knowledge, this is the first work to adopt the DA market mechanism to a local energy community with multiple MEMGs.
2) A novel DA-MATD3 method is proposed, which inherits the ability of the multi-agent twin delayed deep deterministic policy gradient (MATD3) to perform well in a multiagent environment with various system dynamics and uncertainties and addresses privacy concerns using a DA market framework. Specifically, the DA-MATD3 method integrates the key information of the DA market into the state-of-the-art MATD3 algorithm by connecting the critic networks of the agents with the DA market order books. To the best of our knowledge, this is the first work to integrate the DA market information into the MATD3 algorithm.

D. Article Organization
The rest of this article is organized as follows. Section II formulates the examined coordination problem of multiple MEMGs in a DA market. Section III proposes the DA-MATD3 method. Section IV presents the case studies to evaluate the effectiveness of the proposed method. Finally, Section V concludes this article.

A. Problem Setting
We focus on a local energy community consisting of a group of MEMGs, as depicted in Fig. 1. In detail, the set of components of the proposed MEMGs includes: 1) two types of consumption loads: electric load (EL) and heat load (HL); 2) two types of RES generators: solar photovoltaic (PV) and wind generator (WG); 3) two types of storage units: electric energy storage (EES), and thermal energy storage (TES); and 4) four types of energy converters: combined heat and power (CHP) engine, fuel cell (FC), electric heat pump (EHP), and gas boiler (GB). The MEMGs are categorized into three groups: 1) residential MEMGs with the energy portfolio of EL, HL, PV, EES, TES, FC, and GB; 2) commercial MEMGs with the energy portfolio of EL, HL, PV, EES, TES, EHP, and GB; and 3) industrial MEMGs with the energy portfolio of EL, HL, WG, EES, TES, CHP, and GB.
In order to incentivize MEMGs to cooperatively participate in local trading, we introduce a DA market driven by its high trading efficiency [17]. As shown in Fig. 1, the options of each MEMG to supply its consumption loads are diverse. First, MEMGs can manage their own installed energy resources to supply EL and HL. Second, MEMGs can trade their electricity with each other in the DA market. Third, MEMGs are allowed to buy/sell their unbalanced electricity with the utility company at the grid buy/sell prices. Finally, MEMGs can purchase natural gas from the gas grid. The decision-making problem is processed for each hour across a daily horizon, with the objective of minimizing energy cost and carbon emission. At each hour, each microgrid central controller (MGCC) [31] equipped in an MEMG can manage its energy schedules and trading decisions based on: 1) the grid information of energy and carbon price signals; 2) the local information of its consumption loads, renewable generations, and the status of controllable components; and 3) the community information of DA market trading prices and quantities.

B. Multienergy Microgrids
This section aims at providing the detailed mathematical models of four energy converters (CHP, FC, EHP, and GB) and two storage energy units (EES and TES). 1) Energy Converters: CHP as a single-input multioutput converter is characterized by its high energy efficiency compared to independent electricity and heat sources, of which the coupled heat and electricity generation can be modeled as where constraints (1) and (2) indicate the efficiency of CHP to convert natural gas into electric and heat power, respectively. The gas input is limited by its power capacity expressed in (3). Like CHP engines, an FC is also a single-input multioutput converter, characterized by its higher combined efficiency and lower emissions. Given the high thermal efficiency and low operating temperature, the FC is more suitable for individual residents with high heat demands. The mathematical model of the FC is similar to the CHP model (1)-(3). Apart from CHP and the FC, the studied MEMGs also include the energy converters of the EHP and the GB. The EHP produces heat energy by consuming electricity, as presented in (4). The GB is a vessel converting natural gas into heat energy. The generation of heat from natural gas via the GB is given in (6). The power inputs of the EHP and the GB are limited by their individual capacity expressed in (5) and (7), respectively 2) Energy Storage Units: The energy storage units with the high flexibility are characterized by their redistribution ability of off-peak and peak loads and the ability to absorb free RES for the future usage when energy prices are at the peak. The mathematical models of an EES unit can be formulated as where equality (8) corresponds to the storage dynamic transition of battery energy content, taking into account the charging and discharging energy losses. Constraint (9) expresses the lower and upper bounds of battery energy content. The following constraints (10) and (11)

C. DA Market
The DA market matches multiple buyers (MEMGs with energy deficit) and sellers (MEMGs with energy surplus) who are interested in local trading and is deemed as a highly efficient mechanism [17]. It is widely used in the trading of a variety of commodities, including equities and electricity. In this article, we apply the DA market to the local electricity trading, while the heat energy cannot be traded in the community. In general, a DA market lasts for a fixed period of time, known as the auction period (1 h). It allows traders to submit their bids/offers at the beginning of each auction period; then, the auctioneer (DA market operator) clears the market and publishes the public market outcomes (trading prices and quantities) at the end of each auction period. More specifically, a DA market comprises the following: 1) a set of buyers B, where each buyer b ∈ B defines its trading price p b and quantity q b , which means that the buyer b would like to buy q b amount of energy at price p b ; 2) a set of sellers S, where each seller s ∈ S defines its trading price p s and quantity q s , which means that the seller s would like to sell q s amount of energy at price p s ; and 3) a public order book managed by an auctioneer, where all the accepted bids and offers are listed. Bids submitted by buyers are sorted by decreasing the submitted buy prices and queue in the buy order book k b (b, p b , q b ), while offers submitted by sellers are sorted by increasing the submitted sell prices and queue in the sell order book k s (s, p s , q s ).
Algorithm 1: DA Market Clearing Algorithm. 1: Collect price-quantity bids/offers at auction period t 2: Allocate order books k b t (b, p b,t , q b,t ) and k s t (s, p s,t , q s,t ) at auction period t 3: Initialize b = s = 1 4: while p b,t ≥ p s,t do 5: match the trading energy: q l t = min(q b,t , q s,t ) 6: calculate the trading price: The pseudocode of the DA market clearing process is given in Algorithm 1. Once an auction period begins, traders submit their order information with a trading price and a corresponding energy quantity to the market, collected by the auctioneer (step 1). All the submitted orders are allocated in the order book (step 2). The clearing process iterates down the order books and attempts to match each buy order with sell order (steps 3-12) until the buy price is less than the sell price or no unmatched order exists anymore (steps 13 and 14). Specifically, when two orders get matched, the auctioneer calculates the trading price between the matched buy price and sell price, using the traditional midpricing method [17] (step 6), while the trading quantity is equal to the lower value between the two matched orders (step 5). Owing to the sorting principle and the clearing algorithm, the clearing results promise the social welfare maximization [17]. Finally, at the end of the auction period, the remaining quantity of energy and the unmatched orders are balanced with the utility company at the grid electricity prices. It should be noted that the submitted prices of all the traders are bounded between the grid sell (FiT) and sell (ToU) prices to guarantee the economic benefits in the DA market instead of directly trading with the utility company [21].

D. Energy Coordination as the Markov Decision Process
The above-introduced DA market can be formulated as a multiagent coordination problem in the form of a finite partially observable Markov decision process (POMDP) [22] with discrete time steps. The POMDP is, then, defined with a set of state S describing the global state of environment E (DA market), a collection of local observations {O 1:I }, a collection of action sets {A 1:I }, a collection of reward functions {R 1:I }, and a state transition function T (s, a 1:I , ω), where ω is the environment stochasticity representing uncertain parameters. The time interval between two consecutive time steps is one auction period (Δt = 1 h). At time step t, each agent i chooses an action a i,t according to its policy π i (a i,t |o i,t ) conditional on its local observation o i,t and executes this a i,t to the environment E. The environment, then, moves into the next state according to the transition function T . Each agent i obtains the reward r i,t and the next local observation o i,t+1 . The objective of each agent i is maximizing the cumulative discounted reward is the discount factor and T is the daily horizon of 24 h. In detail, the components of the POMDP for the proposed coordination problem are defined as follows.
1) Agents: An agent is a computation entity within each MGCC of the MEMG, who can directly manage the controllable components in each MEMG and the trading strategies in the DA market.
2) Environment: The environment includes MEMGs defined in Section II-B, and the DA market defined in Section II-C.
3) Observation: Each MGCC agent i at time step t observes its local observation o i,t that varies for different MEMG categories and can be defined as where the observation o i,t consists of two parts: 1) the exogenous state unaffected by the action includes the sensor data of price representing the grid electricity buy and sell prices, the gas price, and the carbon price, as well as the measured data of consummation loads L i,t = [P l i,t , Q l i,t ] representing EL and HL, the renewable generation of PV P pv i,t , and WG P wg i,t ; and 2) the endogenous state that serves as the feedback signals of agents' executed action and represents the system dynamics, including the energy content of EES and TES E es . 4) Action: Each MGCC agent i at time step t controls its action a i,t that varies for different MEMG categories and can be defined as where the action a i,t consists of two parts: 1) the price decision a p i,t ∈ [0, 1] representing the magnitude of willing price submitted to the DA market as a ratio of FiT and ToU price differen-  = T (s t , a 1:I,t , ω t ), influenced by the combination of the environment state s t , all agents' actions a 1:I,t , and environment stochasticity ω t . In the examined problem, this corresponds to the exogenous states ω t = [L 1:I,t , P pv 1:I,t , P wg 1:I,t ] that are decoupled from the agents' actions and are characterized by inherent variability. In the machine learning area, RL translates this problem to a data-driven approach that learns the stochastic characteristics directly from the data sources [22].
By contrast, the state transitions of endogenous states S ees i,t and S tes i,t are determined by actions a ees i,t and a tes i,t , respectively. Given EES as an example, the mutually quantities P eesc i,t and P eesd t are managed by action a ees i,t and are also restricted by its parameters of the minimum/maximum energy level E ees i ,E ees i , and the charging/discharging efficiencies η eesc i and η eesd i , which are expressed as P eesc where [·] +/− = max / min{·, 0}. Given the charging and discharging powers P eesc i,t and P eesd i,t and efficiencies η eesc i and η eesd i , the state transition of E ees i,t from t to t + 1 can be expressed as Then, the charging and discharging powers Q tesc i,t and Q tesd i,t as well as the state transition E tes i,t of TES can be derived in the similar manner as the EES model (14)- (16).
To this end, the electricity quantity q i,t submitted to the DA market of each agent i at time step t can be expressed as the summation of its individual electric demand and supply power, where the positive value represents the electricity demand to buy, while the negative value represents the electricity generation to sell in the DA market After collecting the price-quantity bids (p i,t , q i,t ) from all the participating agents, the auctioneer allocates the order books , ∀i ∈ S, clears the DA market (see Algorithm 1), and publishes the market outcomes [p l 1:I,t , q l 1:I,t , q g 1:I,t , k b t , k s t ], which comprises: 1) the local information of cleared trading price p l i,t , cleared trading quantity q l i,t , and the remaining/unmatched quantity balanced with the utility company q g i,t for each agent i; and 2) the public market information of updated order books k b t and k s t . 6) Reward Function: The reward function for each agent i at time step t is designed as two parts: 1) the energy and environment costs and 2) the penalty imposed to avoid the constraint violations of the MES operation model. Specifically, for these agents who are successfully matched in the DA market will receive the cleared local trading price p l i,t and quantity q l i,t , then each agent i can calculate its corresponding electricity cost/revenue in the DA market, and the remaining/unmatched quantity q g i,t will be bought or sold with the utility company at ToU λ b t or FiT λ s t . For these agents who are unsuccessfully matched in the DA market, their quantity q g i,t = q i,t (i.e., q l i,t = 0) will be directly traded at λ b t or λ s t . As a result, the reward term corresponding to the electricity cost for each agent i at time step t can be formulated as where the indicator Furthermore, the reward terms corresponding to the gas cost and the environment cost out of the DA market for each agent i at time step t can be, respectively, formulated as where the gas quantity purchased from the natural gas grid varies for three kinds of MEMGs: (17), the electricity demand and supply in each MEMG can always be balanced through the internal system together with the external DA market at each time step. However, the heat demand and supply may not be balanced, since extra heat cannot sell back to the grid. More specifically, the power schedules of components (i.e., FC, GB, EHP, CHP, and TES) controlled by actions only respect their individual operation models (e.g., power capacity). However, they do not make sure that the heat demand and supply are always balanced. The main factor leading to this issue is that the action selections in the RL algorithm for different dimensions are independent, decoupling the correlation in the optimization-based approach. To adequately account for such operation constraints of heat demand-supply balance, we introduce a penalty term r p i,t for each agent in the reward function, which penalizes the extent of violation of the heat demand-supply balance constraint, with κ denoting a large (negative) penalty weighting factor to ensure its feasibility (20) Thus, the final reward function r i,t of each MGCC agent i at time step t can be expressed as

III. PROPOSED MARL METHOD
To solve the POMDP defined above, we propose a novel MARL method named DA-MATD3 with its general flowchart being shown in Fig. 2. DA-MATD3 derives three concrete implementation details that are insightful and particularly critical to our proposed MEMG energy management coordination problem: 1) learning an abstracted Q-value function for each agent through the DA market public order books to protect the private information of each MEMG; 2) forming an actor-critic architecture to handle the high-dimensional continuous state and the action spaces of the MEMGs; and 3) taking advantage of double critic networks in the twin delayed deep deterministic policy gradient (DDPG) (TD3) algorithm [32] to address the Q-value overestimation problem, thereby stabilizing the training performance.

A. Abstracted Q-Value Function
As discussed in Section I-B, it is challenging to directly acquire the local observations and actions by other agents in our proposed problem since the MEMGs are not willing to share their energy portfolios, technical parameters, and energy usage behaviors. This article, thus, assumes that the agents can use the public order books that epitomize the key information of the DA market (thereby abstracting all agents' price-quantity bid information) in the centralized training process. This substantial improvement protects the privacy of all the agents. To this effect, we approximate the centralized Q-value as where k i = {k b j , k s j ∀j ∈ I \ {i}} denotes the combination of buy and sell order books of all the agents other than agent i in the DA market. k i is an embedded function of order books k b j and k s j that not only abstracts all other agents' observations (e.g., E l j , P pv j , and P wd j ) as well as actions of the price bids a p j and the quantity bids resulting from their energy decisions (e.g., a ees j , a fc j , a ehp j , and a chp j ) but also displays the DA market dynamics of local trading activities. As a result, this combination provides a good approximation of agents' observations and actions as well as the DA market dynamics. Incorporating k i into the critic estimation, each agent can make acquainted decisions on the basis of the impact of other agents' actions, albeit not knowing their energy portfolios and usage activities, protecting the privacy of each MEMG.

B. MATD3
MATD3 [30], an extension of TD3 to multiagent setup, addresses the stability concern that occurred in conventional MADDPG by three key features: 1) using a pair of critics that estimate the current Q-value via a separate target value function; 2) updating the policy less frequently (delayed update) than the Q-value function; and 3) smoothing the target policy by using a (noise) regularization technique.
1) Twin Critic Networks: The overestimation bias in the conventional MADDPG method has been discussed in [30]. Inspired by the technique in double Q-learning [33] using a separate target Q-value function to estimate the current Q-value, thus reducing the bias, we introduce for each agent i two separate online critic networks (Q i,1 and Q i,2 ) parameterized by θ i,1 and θ i,2 , along with two target critic networks (Q i,1 and Q i,2 ) parameterized by θ i,1 and θ i,2 . Then, the two target values used to update the critic can be written as However, the values of Q i,1 and Q i,2 cannot be equal, and it is inevitable that the high value may be overestimated. Therefore, we make a slight change on the basis of double Q-learning and take the minimum value between these two estimates to get the target Q-value for each agent i With this improvement, MATD3 can simultaneously train two critic networks and pick the minimum value of them, thus alleviating the overestimation phenomenon.
2) Delayed Policy Updates: Another potential failure in MADDPG is the variance, which generates noisy gradients during the policy update, thus slowing down the update speed and leading to poor performance [30]. Similar to MADDPG, MATD3 also introduces the target networks to achieve stability. Apart from this, the algorithm also proposes to delay the actor network update until the critic network is updated after a fixed number of time steps. As such, the updates of actor and critic networks are decoupled, i.e., the actor network is updated at a lower frequency than the critic network, to first achieve an accurate Q-value before it is used to update the policy. This less frequent policy update will have a Q-value estimate with lower variance, resulting in better policy performance.

3) Target Policy Smoothing Regularization:
The final technique of MATD3 is smoothing the target policy. Deterministic policies trend to produce the high variance of the target when updating the critic; this is caused by overfitting to narrow peaks in the Q-value estimate [30]. MATD3 reduces this variance by adding a clipped Gaussian noise = clip (N (0, σ 2 ), −c, c) to the actions in the critic update: a i = μ i (o i ) + . This serves as a regularization, such that all the actions within this small area have similar Q-values, thereby reducing the variance in the associated estimations. The complete target, then, resolves to

C. Training Process
DA-MATD3 is an off-policy MARL method that requires the past experiences to update the networks. To this end, an experience replay buffer D i is employed for each agent i. The buffer is a cache storing the past experiences of agent i acquired from the environment (an experience is a transition tuple (o i,t , a i,t ,  r i,t , k i,t , o i,t+1 , k i,t+1 ). For each time step t, we sample uniformly a minibatch of N experiences from each agent's cor- to compute the mean-squared temporal difference (TD) error of two online critic networks as where The online actor network employs the delayed update after d critic updates, its policy gradient can be expressed as The target networks of two critic and one actor are also employed as the delayed updates after d critic updates where τ is the soft update rate for their target networks. Moreover, in order to help the agents explore the environment and acquire more valuable experiences, we add a random Gaussian noise N (0, σ 2 t ) to the online policy μ i (o i,t ) of each agent i, constructing an exploration policŷ Finally, the overall training process of the proposed DA-MATD3 is summarized in Algorithm 2.

A. Experimental Setup and Implementation 1) Experiment Setup:
We implement experiments on a realworld dataset recorded from Open Energy Data Initiative [34] and RWTH Aachen University [35]. We collect the corresponding EL, HL, and PV and wind power of residential, commercial, and industrial users with hourly resolution for our experiments. Then, these energy users can be classified and aggregated into three MEMGs. To further account for the uncertainties, we add the Gaussian noise [zero mean and 5% standard deviation (std)] to the original collected data as the train set, while using the original collected data as the test set. The operating parameters of MEMGs' controllable components are derived from [36]. ToU tariff [37] selected as the grid electricity buy price varying for the time: 0.1129 $/kWh at 20:01-17:00 (next day) and 0.2499 $/kWh at 17:01-20:00. FiT as the grid electricity sell price, natural gas price, and carbon price are flat over the day at 0.04 $/kWh [38], 0.0338 $/kWh [39], and 0.0316 $/kg [40], respectively. The averaged carbon emission of using natural gas is 0.245 kg CO 2 /kWh [40]. Algorithm 2: DA-MATD3 for I Agents.
1: Initialize weights θ i,1 , θ i,2 , and φ i for the online networks and copy them to the target network weights θ i,1 , θ i,2 , and φ i for each agent i 2: Initialize replay buffer D i for each agent i 3: for episode (i.e., trading day) = 1 to M do 4: Initialize the environment E and Gaussian noise N (0, σ 2 t ) 5: for time step (i.e., 1 h) t = 1 to T do 6: For agent i, select action a i,t =μ i (o i,t ) in (33) 7: Execute actions a 1:I,t to the DA market, then observe reward r i,t , next observation o i,t+1 , and order books k i,t+1 8: Update local observations for next time step o i,t ← o i,t+1 10: for agent i = 1 to I do 11: Sample uniformly a minibatch of N experiences Compute critic target value in (28) 13: Update two online critic networks in (26) and (27)  14: if t mod d = 0 then 15: Update online actor network in (29)  We compare the proposed DA-MATD3 with the conventional ZI strategy and three state-of-the-art MARL methods of IDDPG, MADDPG, and MATD3. To further evaluate the benefit of the energy coordination architecture, we benchmark the performance against one scenario that each MGCC agent trades independently with the utility company using DDPG without MEMG energy coordination (UDDPG).
3) Implementations and Hyperparameter Selections: For all the examined five MARL methods, we use an Adam optimizer [41] for both the actor and critic networks with the same learning rate α = 10 −3 [30]. The sizes of replay buffer D and batch N are 10 5 and 10 2 [30], respectively. We employ a soft update rate τ = 10 −2 [30] and a discount rate γ = 0.9. The delayed step d = 2 [30] for DA-MATD3. For all the networks, we use multilayer perceptron (MLPs) with two hidden layers with 400 and 300 units, respectively. The sigmoid activation function is used as the actor outputs. The outputs are, then, scaled linearly to their individual action space. For all the examined methods, we run 5 × 10 3 episodes to evaluate their training performance with ten random seeds for both environment and network initialization. The values of the hyperparameters α, τ, and d were set based on the original MATD3 [30] paper. The grid search function [42] was used to determine the value of hyperparameter γ to obtain the best performance.

B. Performance Evaluation
We compare the training performance of five examined MARL methods and the conventional ZI strategy for the test set. Specifically, Fig. 3 illustrates the convergence curve of episodic reward of three MEMGs for different control methods, where the solid lines and the shaded areas, respectively, depict the moving average over 50 episodes and the oscillations of the reward during the training process. The converged performance of mean and std of three MEMGs' aggregated reward are also compared in Fig. 4. Furthermore, their energy (electricity and gas) costs and carbon emissions for the test dateset are also presented in Table I for comparison. Our first observation in Fig. 3 is that all five MARL methods show an upward trend, and their policies are being improved, even for the UDDPG method without considering the energy coordination benefits. On the other hand, IDDPG, the most straightforward MARL method, exhibits the highest oscillation and unstable learning behavior, ultimately failing to reach an optimal policy (the highest carbon emission). As discussed in Section III-A, this is because IDDPG focuses on local information while ignoring the others' behaviors, rendering the environment dynamics nonstationary. As such, MADDPG and MATD3 with centralized training can effectively mitigate such nonstationarity issues and exhibit superior training performance. Furthermore, MATD3 owing to its double critic networks (more accurate Q-value estimation) can achieve a higher reward with regard to MADDPG. However, both the methods suffer from the privacy issue requiring all others' local observations and actions for the centralized critic. Our proposed DA-MATD3 method learns the DA market dynamics directly by extracting the others' observations and actions through the DA market public order books. In addition, the performance of the traditional ZI strategy during the training process is illustrated in Fig. 3. ZI as a static control method does not tend to go up but tends to flatten out over 5000 episodes.
The mean and std of the aggregated rewards of three MEMGs are quantified in Fig. 4. The figure shows that DA-MATD3 has the best performance, since it achieves the highest reward among all six control methods. DA-MATD3 also has lower std compared to MATD3, MADDPG, IDDPG, and ZI, so that it is more effective in stabilizing the training performance. UDDPG obtains much lower reward than DA-MATD3, even though its std is lower than DA-MATD3. The reason is that UDDPG does not consider a DA market; therefore, the economic benefits of energy coordination cannot be obtained. The test results presented in Table I obtain the similar performance as the training results in Fig. 3. The proposed DA-MATD3 achieves 7.31%, 6.50%, 6.25%, 4.67%, and 2.52% lower total energy costs and carbon emissions than UDDPG, ZI, IDDPG, MADDPG, and MATD3, respectively.

C. Analysis of Multienergy Management
To further validate the learned policies in DA-MATD3 for the test set, we provide the energy management schedules of three MEMGs for both the electric and heat supplies in Fig. 5. Residential MEMG features abundant PV production during mid-day hours and high EL peaks during night hours as well as a relatively flat HL profile. As its high combined electricity and heating generation efficiencies, the FC is learned to supply both EL and HL over the day, apart from the mid-day with PV sources. Furthermore, the MGCC learns to use the storage (EES and TES) flexibility to charge power when energy prices are low or PV is abundant and discharge power when the energy price is high or HL is at the peak. Finally, GB is a backup component to supply HL when the FC is not in use. Similar to the residential MEMG, the commercial MEMG also features abundant PV, but its HL is concentrated during the daytime. Without the converter from natural gas, the electricity grid and PV are major sources to supply EL. The EHP is used to supply HL during the mid-day hours by converting the free PV from electricity to heat power, while EES and TES also exhibit their flexibility to charge cheap and free energy and discharge them to the peak demand hours. Finally, GB in the heat sector is used to supply the left part of HL. Unlike residential and commercial ones, the industrial MEMG installs a WG and its energy usage mainly focuses on EL. It can be observed that there is abundant WG production supplying EL and is also used for EES charging power and surplus fed to the grid to obtain extra revenue. The electricity grid partly supplies EL during the mid-day hours with low wind sources. In the heating sector, CHP accounts for the major proportion of HL supply, while TES is learned to discharge to reduce CHP usage when energy prices are high. It can be concluded that the proposed DA-MATD3 is able to learn effective energy management policies for all three MEMGs to various price signals, demand patterns, and renewable output. In addition, the complementary effect among multienergy vectors (interaction between electric and heat supplies) can also be verified based on the above analysis.

D. Benefits of Energy Coordination
Having demonstrated the superiority of the DA-MATD3 method over the state-of-the-art MARL methods and analyzed the energy schedules of three MEMGs, this section aims to compare the trading strategies under the dynamic DA-MATD3 method with the statistic ZI policy and quantifying the benefits of energy coordination among three MEMGs. Fig. 6 shows the net load (positive for consumption and negative for generation)  of three MEMGs under the methods of UDDPG without energy coordination and ZI and DA-MATD3 with energy coordination but in different trading strategies. Dash lines as the baselines represent the aggregated load of electric demand and renewable. Fig. 7 illustrates the local trading quantities and the averaged trading prices under ZI and DA-MATD3 methods.
When energy coordination is allowed in the DA market, MEMGs with energy surplus/deficiency are incentivized to trade locally. As a result, we can observe that compared with UDDPG, the generation and demand of three MEMGs in Fig. 6 are both reduced under ZI and DA-MATD3, since an amount of energy is balanced locally in the DA market, which can also be confirmed in Fig. 7. The figure shows that the DA-MATD3 method trades more frequently and in greater quantities than the ZI method due to the following reasons.
1) For the DA-MATD3 method, the agents are trained to select the suitable trading prices, so that the buyers and the sellers can achieve more trading deals. For the ZI method, the trading prices of the MEMGs are chosen randomly within the range of FiT and ToU, which affects how many times the trading deals are successful. 2) For the DA-MATD3 method, the agents are more likely to trade larger quantities in the DA market to reduce the costs, since each agent considers others' trading strategies. For the ZI method, each MEMG decides the energy trading quantity without considering the trading strategies of the other MEMGs. More importantly, compared with the nonstrategically sampling behaviors in the ZI method, MGCC agents under DA-MATD3 learn to trade a large amount of energy locally, thereby reducing their dependence on the utility company. Such results can also be validated in Table II: 1) there is no internal trading   TABLE II  COMMUNITY DAILY INTERNAL, EXTERNAL TRADING QUANTITIES, AND  ENERGY COSTS UNDER UDDPG, ZI, AND DA-MATD3 METHODS under UDDPG, so the net demand and generation (7382 kWh in total) are all bought at high ToU and sold at low FiT; 2) ZI achieves $89 total cost saving by 1929 kWh internal trading within the DA market; and 3) DA-MATD3 achieves the lowest total energy cost by making the highest internal trading at 7263 kWh. In relative terms, DA-MATD3 achieves 2.82/1.76 times lower external trading with the utility company (higher balance of local demand-generation) and 30.65%/20.54% lower energy cost (more economic benefits of local trading) over UDDPG/ZI methods.

V. CONCLUSION
This article proposed a novel MARL method to address the energy coordination problem of MEMGs local trading in a highly efficient DA market, incentivizing MEMGs to participate in local trading with economic benefits. The examined MEMGs, featuring various demand and renewable characteristics, were categorized into residential MEMGs, commercial MEMGS, and industrial MEMGs. The proposed MARL method named DA-MATD3: 1) constructs the centralized critic by abstracting the others' observations and actions through the DA market public information, thereby preserving MEMGs' privacy and capturing the market dynamics and 2) uses a pair of critic networks to overcome the Q-value overestimation issue and stabilize the training performance. The effectiveness of the proposed DA-MATD3 method was evaluated through simulations using a real-world setting. Specifically, the proposed method achieved superior performance in reducing both energy costs and carbon emissions compared to the state-of-the-art ZI and MARL methods. Finally, the trading strategies and outcomes were also analyzed to show the significant economic benefits of the community by more internal energy trading among three MEMGs within the DA market.
Future work aims at enhancing the proposed work from two directions. First, the DA market introduced in this article focuses on electricity trading. Future work will explore a new market mechanism enabling multienergy trading within a local MEMG community. Second, although this article focuses on a local energy community, the proposed method can be extended to a larger and wider energy community with the following changes: 1) in the system model, the transmission losses need to be considered, as long-distance transmission tends to lose energy; 2) the matching algorithm in the DA market should take the distance into consideration when matching a buyer and a seller; and 3) distribution network constraints need to be considered, since different distribution networks often have different constraints such as transformer and line limitations, phase unbalance, and voltage stability.