Model-Based Approach on Multi-Agent Deep Reinforcement Learning With Multiple Clusters for Peer-To-Peer Energy Trading

Peer-to-peer (P2P) energy trading system has the ability to completely revolutionize the current household energy system by sharing energy among residents. As the number of customers employing distributed energy resources (DERs) such as solar rooftops increase, innovation in the double auction market (DA) system is becoming more significant. In this paper, a novel model-based, multi-agent asynchronous advantage actor-centralized-critic with communication (MB-A3C3) approach is carried out. Previous studies are limited since they suffer from unpredictable behavior in renewable energy resources and a large number of prosumers in the peer-to-peer market. As for the model-based strategy, we forecast the trading price and trading quantity in the daily energy trading system in order to overcome unpredictable issues. For the large number of prosumers, the multi-agent and multithreading RL has been chosen as our backbone since the prosumers’ behavior can be diverse; time-series clustering is introduced based on their daily trading behavior. With its environmental model and multi-threaded mechanism, MB-A3C3 is seen to be most efficient in carrying out tasks regards time and precision. The model is conducted on a large scale real-world hourly 2012–2013 dataset of 300 households in Sydney having rooftop solar systems installed in New South Wales (NSW), Australia. Results reveal that the MB-A3C3 approach outperforms other reinforcement learning methods (MADDPG and A3C3), producing lower community energy bills for 300 households. When internal trade (trading among houses) increased and external trade (trading to the grid) decreased, our multiple agent RL (MB-A3C3) significantly lowered energy bills by 17%. In closing the gap between the real-world and theoretical problems, the algorithms herein aid in reducing customers’ electricity bills.


I. INTRODUCTION
The energy sector is constantly innovating. Recently, however, it is continually being disrupted by the ''four Ds'' of energy: decarbonisation, decentralisation, digitalisation, and democratisation. It is noted that multi-agent structures (MASs) can deal with grid disruptions caused by renewable energy sources, and the system's widely dispersed nature [1]. In the energy economy more effort is needed to establish a comprehensive system for the volatile structure of the market.
The associate editor coordinating the review of this manuscript and approving it for publication was Mouloud Denai .
Digitalization of the energy sector entails greater use of technology, data and advanced systems to better manage energy.
Peer-to-peer (P2P) energy trading involves a participant submitting bids to a trading system that requires a market operator to manage transactions based on the available data, the quantity required, and the price. Then, employing the double-sided auction approach, all orders are matched where traders can specify the quantity and price at which they want to trade within the boundaries of the price set directly from the grid [24], [25]. Based on previous studies in the P2P market, traders in the double auction (DA) market frequently use a zero intelligence (ZI) trading strategy [26], [27], [28], [29], [30]. The order price that ZI traders determine is the random surplus offset from the value of a particular range, e.g., FiT and ToU. Considering the strategies of all participants along with trading prices, energy supply, and energy consumption, the energy market is seen to be incredibly dynamic. Many previous works have attempted to address the DA market as an optimization problem, using the reinforcement learning (RL) framework [30], [31], [32], [33], [34], [35], [36], [37].
Deep reinforcement learning (DRL) is a subfield of machine learning that combines RL with deep learning. DRL is a fully automated approach that uses a range of inputs from current energy markets to determine maximum profits for intraday market bidding [38], [39], single sided energy markets [40], and power trading competition [41]. In the P2P market, traditional Q-Learning has been applied as a management algorithm to maximize profits through participation in P2P energy trading [42], [43]. The deep Q learning (DQN) algorithm based on the LSTM model has also been used to analyze time-dependent information. The deep deterministic policy gradient (DDPG) has also been put forward to probe strategic bidding in the energy market [44], [45], [46].
To obtain optimum learning for multi-agent decisionmaking in dynamic and uncertain environments, multi-agent reinforcement learning (MARL) algorithms for collaborative Markov decision processes (MDPs) have been introduced and examined [47]. In energy trading, MARL is capable of optimizing and reducing costs [31], [32], [33]. Thus, multiagent deep deterministic policy gradients (MADDPG) are enhanced, improving peer-to-peer energy trading in the double auction market [30], [34], [35], [36], [37]. A3C3, which outperforms MADDPG, has been introduced as a distributed asynchronous actor-critic algorithm in a multi-agent setting with differential communication and a centralized critic [48].
Recently, MBRL, a model-based reinforcement learning method has demonstrated promising results in a variety of domains, resulting in a superior bidding strategy. In conjunction with MBRL, Dyna-architecture has been used to improve interaction in a modeled environment through learning and planning based on real-world and simulated experiences [49]. As for the multi-joint dynamics with contact (MuJoCo) benchmark, advanced MBRL algorithms have been able to optimize the reward function [50], [51], [52]. In the energy sector, it is significant that MBRL has been applied in wind energy bidding for a single-agent system, achieving minimized energy costs [53]. As for energy trading tasks, both MBRL and A3C3 have not yet been applied to P2P, but such algorithms may successfully outperform standard benchmarks [54], [55].
In this paper, a novel multi-agent deep reinforcement learning algorithm called ''the model-based asynchronous advantage actor-centralized-critic with communication (MB-A3C3)'' has been introduced. MB-A3C3 aims to investigate P2P energy trading in solar-installed households, as a multi-agent decision-making model for both competitive and cooperative tasks. Contributions are summarized, as follows: • MADRL: multi-agent deep reinforcement learning. This technique can effectively handle complex data and many agents because it utilizes a deep learning architecture and a multithreaded framework with communication channels. To reduce training time, it can also be scaled horizontally.
• Agent's daily trading behavior clustering: According to previous research, little consideration has been shown towards prosumers' behavioral traits. This problem is addressed by classifying prosumers into clusters based on their daily trading habits.
• Model-based framework: The model-based concept has been integrated with MADRL viz. ''MB-MADRL''. As such, MB-MADRL can tackle the problem of a lack of local knowledge by allowing agents to build a functional representation of their environment. MBRL was established on Dyna architecture with multivariate-LSTM to anticipate the whole environmental states for 24 hours ahead, allowing for better policy execution than the model-free reinforcement learning (MFRL). Having a robust forecasting technique, MB-A3C3 represents each cluster as a centralized environmental model. This information enables MB-A3C3 to optimize processes in an accurate manner.
This study sets out to forecast the trading price and trading quantity in the daily energy trading system. Through application of RL algorithms, we are able to use the data to predict both trading price and trading quantity. Little research has been done in this field previously. Nevertheless, the clustering and forecasting methods used in model-based RL shows that our work is new and authentic. Facilitating local power and energy balance, we hope to transform the current household energy system by enabling households to have lower energy bills. Algorithms have been developed, one contribution at a time and applied to actual dataset, revealing their potential to optimize the distribution network.

II. P2P ENERGY TRADING
In the traditional market paradigm, producers and consumers deal with merchants depending on their net consumption. Peer-to-peer trading, however, necessitates the use of new VOLUME 10, 2022 technology and business models having market regulations that govern the P2P archetype [56]. Before trading with a retailer, producers share their production and consumption in local markets at an internal price that is typically set between export and retail prices. Consumers can be thought of as a subset of producers who do not own any local power operations. Producers and consumers confront a complicated quota-decision process because renewable sources of energy like solar photovoltaic (PV) generation are stochastic. Choosing a suitable trading strategy is challenging since all players' strategies are updated in real time.

A. THE DOUBLE AUCTION MARKET MECHANISM
The double auction (DA) market connects many customers and producers who are engaged in the energy market [57], [58]. In the electricity market, the auction term is set at a specific length of time, i.e., an hourly resolution [59]. Procedures are as follows: 1) Traders send their directives to the market whenever an auction period begins. Directives involve a trading price and energy quantity. 2) Purchase orders have to match sale orders. An algorithm is used to match the orders. 3) When two orders are matched, the auctioneer uses the classic mid-pricing approach to determine the market clearing price. The transaction quantity is equal to the minimum quantity between the matched orders. At the end of the auction, the auctioneer balances the remaining amount of energy and unmatched orders with the utility company at grid pricing for time-of-use (ToU) and feed-in tariff (FiT) [58], [59]. All traders' pricing schemes are constrained by FiT and ToU to guarantee economic benefits. The prices for bids and asks are always within the grid prices. The buy-sell gap is at the center of the clearing price [60].

B. PROBLEM STATEMENT AND FORMULATION
The above-mentioned double auction market clearing procedures are a model for multi-agent decision-making, defined as a decentralized, partially observable Markov decision process (Dec-POMDP) with discrete time steps [61]. Each agent n selects an action (a n,t ) based on its policy and private observation (o n,t ) at time step t. N agents include a set of global states: S, a collection of private observations: O, a collection of action sets: A, a collection of reward functions: R, and a state transition function: T. One auction period (t = 1h) is the time span between two sequential stages. The trading of solar energy, in modified form, can be expressed, as in Eqs. (1), (2), and (3) [30]: where s n,t is the state of agent n at time step t. P inf n,t and E es n,t is the inflexible load information and energy storage (ES) battery energy content at time step t. λ b t and λ s t is the grid information for ToU and FiT at time step t. q da actual n,t−1 and λ i actual n,t−1 is the previous trading quantity and price at time step (t − 1). q da forecast n,1 and λ i forecast n,t is the forecast trading quantity and price at time step (t). a n,t = a q n,t , a p n,t (2) where a n,t is the action of agent n at time step t. Both a q n,t and a p n,t represent the energy and price decision submitted to the DA market at time step t.
where r n,t is the immediate reward that the agent n at time step t obtains when the action is executed according to s n,t . At step t, agent n receives its reward r n,t in the form of a negative cost of energy bill, as a result of the DA market clearing procedures. Thus, the agents who are successfully cleared will receive the local price λ n,t and the cleared quantity q da n,t . Next, each agent n can calculate its corresponding cost in the DA market; the remaining unmatched quantity q grid n,t will be bought or sold through the utility company at ToU λ b t or FiT λ s t . The agents' quantity q grid n,t = q da n,t will be immediately exchanged at ToU λ b t or FiT λ s t if they are unable to be cleared in the DA market.

III. RELATED WORKS
Traders in the double auction market utilize the ZI strategy as a fundamental and popular trading method whereby they can determine their order price as a random surplus offset from its value, based on uniform distribution from a relevant interval viz. ToU and FiT [58], [59]. Because the actual market is quite dynamic in real time, participants are confronted by a complicated process, involving quotation decisions. Choosing an appropriate trading plan in such a complex market situation is difficult.
MARL is a framework for investigating the sequential decision-making problems of agents (producers and consumers) [62], [63]. MARL can also be applied to smart grid applications, i.e., P2P energy trading in the DA market [54].
A. MULTI-AGENT DEEP DETERMINISTIC POLICY GRADIENT (MADDPG) [64] MADDPG unites the multi-agent actor-critic (MAAC) method with the DDPG algorithm. The algorithm utilizes a multi-agent policy gradient that involves decentralized agents to develop a centralized critic based on all agents' observations and behaviors. Each agent has its own actor and critic network, similar to a single-agent actor-critic architecture. The actor network takes the current state of the agent and suggests an action. However, a critic component differs somewhat from a standard single-agent DDPG. Each agent's critic network can see information concerning all actions and observations of all other agents. A critic network has a better perspective of what is going on whereas an actor network only has access to an agent's observations. A critic network's output is based on an estimated reward having full observation input as well as full action input of all agents. An actor network's output is a suggested action for that particular agent. Only during training time is the critic network active. At execution time, this network will not be available. [48] Extended from the asynchronous advantage actor critic (A3C), network updates are carried out by multiple workers using a distributed approach [48]. As depicted in Fig. 1a, A3C3 has distributed worker threads that use actor-critic methods to asynchronously optimize value, policy, and communication networks for agents. By generating periodic local copies of the networks, utilizing them to compute gradients, and applying the gradients on the global networks, multiple workers asynchronously update all networks for each agent. In Fig. 1b, the agent's architecture in A3C3's worker is composed of three networks: 1) a policy (or actor) network, which outputs an action, 2) a communication network, which outputs an outgoing message, and 3) a value (or critic) network, which outputs a value estimation.
It is acknowledged that A3C3 can learn policies that are very successful and can attain goals in a shorter time than MADDPG [48]. Although A3C3 has never been applied in P2P energy trading, it has been adopted as our core model since A3C3 is seen to outperform MADDPG.

IV. PROPOSED METHOD
In this paper, the model-based deep reinforcement learning algorithm called MB-A3C3 is implemented. In Fig. 2, the schema of MB-A3C3, consisting of three modules, is demonstrated. In Module 1, A3C3 is employed to collect environmental data, including agents' information, actions, and energy bills for the trading period. In Module 2, agents, whether buyers or sellers, are classified according to their daily trading behavior and environmental data from Module 1. After that, agents' trading quantity and price are predicted via a forecasting module (Module 3), using clusters' centralized data obtained from Module 2. For the testing phase, the current state is then utilized to formulate the predicted trading quantity and price. Finally, the model assesses the current state and provides a policy that will result in action in the double auction market: amount of trading quantity and price. The energy bill is determined using all of the variables. Finally, MBRL is utilized to forecast future trading quantities and prices. The convolutional 1-dimensional A3C3 network was enhanced via application of A3C3, giving A3C3-Conv1D. To develop an optimal trading strategy for each agent, model parameters have been utilized and updated using experience and reward information. Such a strategy has been carried out as a policy model having a customized P2P energy trading Furthermore, policies have constraints (a maximum bound) to make them more realistic: 1) A household's energy storage is determined by offering trading quantity with minimum and maximum energy levels between 2 and 10 kWh [65]. 2) In Table 1, trading prices are provided. As highlighted in the grid, ToU is the flexible purchase price for the period; FiT is the set sale price for the entire day. The agent's trading price output, whether buying or selling, is limited to grid prices. 3) When trading in the double auction market, the network capacity threshold is considered peak demand. The algorithm maintains a daily peak demand of 600 kW, which satisfies the capacity of the network [34].   A3C3's agent is represented by an actor, a central critic, and an additional communication network, as detailed below:

1) ACTOR NETWORK
As depicted in Fig. 3, local policy is learned by the actor network. For instance, the actor receives all agents' observations and broadcast messages as input. The output layer of the network generates a probability distribution for the agent's actions. The output layer is directly based on the action space of the environment.

2) CENTRALIZED CRITIC NETWORK
In Fig. 4, the agent's centralized network is given, combining all observations of other agents with some additional information from the environment. If the environment allows access  to its underlying state, the centralized observations become the entire environmental state s t . Thus, policy is evaluated by the centralized critic.

3) COMMUNICATION NETWORK
In Fig. 5, the communication network of the agent is depicted. The output layer has a rectifier or ReLU activation function to generate messages. Other output architectures such as continuous valued messages are supported. A communication protocol between agents is learned by the communicator network.
After receiving trading information from all agents, the mechanism of the double auction market matches orders and calculates energy bills, which are defined as a reward for each agent. Unmatched orders trade their energy at the price listed, as in Table 1.

B. AGENT'S DAILY TRADING BEHAVIOR CLUSTERING
For day to day trading, agents are grouped together using dynamic time warping (DTW). Then, each group is assigned a similar trading behavior as a centralized dataset for environmental modeling. DTW is utilized to measure the similarity between an agent's daily trading quantity. Because of its one-to-many determinations, the lowest distance between all points is calculated by DTW, allowing for a one-to-many match. DTW is a more precise way of determining distance than Euclidean distance; data points are moved between each other and focus on the shape rather than the geometry. Two time series do not have to be of identical length, which is a condition of Euclidean distance. Euclidean distance compares two data points with one another [66]. The optimal k for k-means is selected based on the elbow [67] and silhouette method [68].
Due to the large number of agents and their diverse behavior, it is assumed that an agent's daily behavior differs hour by hour. Accordingly, 300 agents are organized into four clusters based on their daily trading behavior. In the literature, DTW is frequently used in conjunction with k-medoids and hierarchical approaches; in some articles, DTW is used in conjunction with k-means [69]. DTW has also been coupled with randomswap and hybrid among non-traditional approaches [70].

C. MODEL-BASED MULTI-AGENT DEEP REINFORCEMENT LEARNING (MB-MADRL) FRAMEWORK
In Fig. 6, the multivariate-LSTM is depicted, and consists of six time-dependent variables: Herein, the hidden output layer (h 1 , . . . , h 6 ) is passed from one step of the network to the next. The algorithm LSTM takes into account not just the preceding hour of the input sequence, but also the prior 24h. Such a technique is used to calculate the state's predicted trading quantity (q da n,t ) and price (λ i n,t ). Because of its ability to multiply the output of hidden states by trainable weights, the multivariate-LSTM is used to forecast an agent's trading quantity and price whereas the typical LSTM network simply utilizes the latest hidden state as output [71].

D. THE OVERALL PROCESS OF MB-A3C3
In Algorithm 1, the MB-A3C3 algorithm is demonstrated. The process begins with A3C3-Conv1D having to collect the environmental data. Then, ten random runs of the training process continue until energy bills from the DA market mechanism, calculated in accordance with Eq. (3) using actual Algorithm 1 MB-A3C3 Algorithm 1. Initial inputs r t , η v , η u , η w , T max , t max , γ , β, and output π 2. Initial environments: state, action, and reward 3. Assume global shared parameter vectors (θ µ , θ v , θ w ), global shared counter T = 0, and thread-specific parameter vectors (θ µ , θ v , θ w ) 4. Run A3C3 with D train to collect agent's trajectories D env = (s, a, r) 5. Time-series clustering with DTW on D env. 6. Aggregate each cluster's dataset to D centralized 7. Train env. model on D centralized for each cluster. 8. Run A3C3 with env. model from 7. for D test for <episode = 1: T max > do 1. Reset gradients of actor, centralized critic, and communication net- Perform a t according to policy π a t | s t ; θ µ and constraints.
3. Every actor send its (s t , a t , r t ) to env. model according to agent's cluster. 4. Env. model predicts a t and send to A3C3 model. 5. A3C3 action with r t = argmax r t a 1 t , a 2 t , . . . , a n t , receive r t , then transfer to new state if <T /T max == 0:> then Accumulate gradients with θ v , θ µ , and θ w .
9. Update the gradient of networks: θ µ by dθ µ , θ v by dθ v , and θ w by dθ w 10. Reset gradients of actor, centralized critic, and communication network dθ µ ← 0, dθ v ← 0, and dθ w ← 0 11. Reset θ µ = θ µ , θ v = θ v , and θ w = θ w where r t is the reward function. η v , η u , and η w are the actor's, centralized critic, and communication network's learning rates. T max is the maximum training episode and t max is the updated time-step. γ is the discount factor. β is the entropy regularization term. π is the policy. D train , D env. , D centralized , and D test are training, environment, centralized environment, and testing datasets, respectively. data, stabilize. Then, DTW is applied for the time-series clustering to categorize the agents. For the next 24 h trading quantity and price forecasting, each cluster's data is collected to combine the environmental data with the multivariate-LSTM. After that, the MBRL process is applied for the testing phase.

V. EXPERIMENTAL SETUP
The experiment was carried out after assessing the data from Ausgrid's electricity network [72]. The publicly available dataset contained load and rooftop PV generation for 300 residential customers in NSW and the adjacent rural areas. Data was collected over a three-year period; both load and PV generation measurements were taken at 30 min intervals. The algorithm was conducted in a real-time simulation manner [30], [34], [35], [36], [37], [73], [74]. Although the dataset's period took place in 2012 and 2013, the data VOLUME 10, 2022 obtained from the 300 households was integrated and processed in real-time through the sharing economy.
A. EXPERIMENTAL DATA Between July 1, 2010 and June 30, 2013, data was gathered from the 300 randomly selected solar customers in NSW. Customers had a gross metered solar system installed and were invoiced on a domestic tariff. From June 1, 2012 to May 31, 2013, data was utilized to evaluate performance with the baseline. Various types of data matching the annual statistics of solar home datasets are described [75].

B. HYPERPARAMETERS SETTING AND DETAILS
In Fig. 2, MB-A3C3 hyperparameters are specified [76]. In addition, both TensorFlow and OpenAI Gym were implemented having a Dyna framework in order to evaluate reinforcement learning algorithms; a customized environment using P2P energy trading data was provided.

C. EVALUATION
In Eq. (3), by employing the MB-A3C3 model, the reward function is utilized to reduce the agent's energy bill. It is noted that such action taken by the MB-A3C3 algorithm during training can determine the energy bill, which is the cumulative reward for each episode consisting of 24 steps. Accordingly, over 4,000 episodes, 10 independent runs with 10 random seeds were carried out for random initialization. During training, for every 100 episodes, following the baseline paper, the effectiveness of the households' energy management strategies regarding the test dataset was examined. MB-A3C3 was duly employed to evaluate how well it performed against the policy model under the three modules: 1) clustering: agent trading behavior, 2) forecasting: trading quantity and price, and 3) the MBRL framework.

A. OVERALL RESULTS
In Table 2, the average community's internal trade, external trade, and net energy bills per day of 8 and 300 households are compared to MARL algorithms. For the multi-agent model, there is one agent per household for the experiment with 8 households (no clustering is applied). For the experiment with 300 households, there is only one agent per cluster. It is assumed that internal trade within communities should increase while external trade directly with the main grid should be reduced by the algorithm. The baseline MADDPG was extended (section III-A) from 8 to 300 households to ensure the validity of the algorithm [30]. Subsequently, of all the 12 algorithms, the MB-A3C3 (LSTM)-DTW algorithm was found to be the winner ($654.95). As a result, when compared with MADDPG ($789.85), household energy bills are seen to have fallen by more than $100. Energy bills turned out to be 17% lower than trading with the grid ($790.51). At the end of the trading day, the community's net energy bills were greatly reduced via the algorithm. Meanwhile, internal trade increased and external trade decreased while peak demand for energy dropped from above 600 to 589.26 kW. In Figs. 7 (a and b), the training time of the multi-threaded algorithms (A3C3 and MB-A3C3) was compared with the single-threaded (MADDPG). It is acknowledged that despite consuming more training time, the MB-A3C3 (LSTM)-DTW algorithm assessed all relevant data via an agent's clustering and model of the environment. When the number of households increased from 8 to 300, training time reached 1,767.38 min (one model per each agent). However, when assigned to the model-based MB-A3C3, the outcome proved to be 149.36 min. Such an outcome is seen to reduce the time taken for forecasting.
In Fig. 8, the community's average energy bill per day for the training set is presented. The reward of the five RL algorithms' convergence during the training phrase is depicted to illustrate the superior performance of MB-A3C3 (LSTM)-DTW over other algorithms; providing faster convergence and lower energy bills. When trading within a community, the algorithm is optimized under certain constraints and environments. An agent's energy bill is reduced by having a price incentive scheme in the algorithm. It is seen that the reward tends to be lower as agents have no knowledge or experience of how to trade during the first stage. After the training phase, the optimized network parameters, which result from multi-threaded mechanisms, deep learning networks, agents' clustering, and environmental models, efficiently lower the community's energy bills, as shown by the green line in the graph.

B. EFFECT OF MULTITHREADED AND DEEP LEARNING IN POLICY MODEL
In this section, it is seen that A3C3-FF can outperform the baseline MADDPG. Performance is further improved by  applying deep learning techniques, e.g., Conv1D rather than the feed-forward architecture: FF. In Table 2, it is noted that performance of the A3C3-Conv1D model is found to be superior to that of the single-threaded MADDPG, attaining a reduction in energy bills of 9.86% (from 34.59 to 31.18) and 7.25% (from 789.85 to 732.61) for both 8 and 300 households, respectively.
In Fig. 9a, by comparing results with the different network architectures, the A3C3-Conv1D algorithm outperformed A3C3-FF and A3C3-LSTM, revealing much lower energy bills in both 8 ( Fig. 9) and extended 300 households (Fig. 9b). When a policy model considers the correlation between observations in a short timestep to take proper action, CNN is seen to perform better than LSTM because LSTM is usually applied for processing sequences of data. CNN is designed to exploit ''spatial correlation'' in data.

C. EFFECT OF AGENT'S TRADING BEHAVIOR TIME SERIES CLUSTERING
In this experiment, 300 agents (households) are clustered based on their trading behaviors. In our comparison, there are three strategies: (1) eight households: randomly selected, (2) location: based on postcode, and (3) time-series: k-means. In Fig. 10, the optimal number of k results are shown: k=4 was chosen.
For the k-means method, two clustering algorithms were compared: namely, DTW and Euclidean. DTW was chosen as our clustering algorithm since its silhouette scores, i.e., clustering performance measures proved to be higher than the Euclidean scores: 0.23 and 0.17, respectively. In our paper, the time series k-means is also called ''DTW''. Due to DTW's calculations, each household is classified into a  cluster. Fig. 11 depicts the diverse trading quantities among four clusters, which vary in time.
In Fig. 12, a comparison is made of the three clustering methods applied to our winner from the previous experiment viz. MB-A3C3 (LSTM), as seen in Table 2. Results demonstrate that DTW proved to be the winner, revealing the cheapest energy bill ($654.95) for the 300 households.

D. EFFECT OF FORECASTING MODELS IN MB-MADRL FRAMEWORK
In Table 3, it is noted that the multivariate-LSTM excelled in terms of both RMSE and MAPE on the testing set over GRU and the transformer. The winner, the multivariate-LSTM, shows a marginal error of only 0.0344 dollars per kWh (15.82%) and 0.0263 kWh (10.39%) for the trading price and the trading quantity, respectively. Such an outcome reveals that the multivariate-LSTM algorithm proved to be the best since it provided less error than others in forecasting. In forecasting both trading prices and trading energy, our research has broken new ground.
In Fig. 13, the predicted trading price and quantity forecasting results, as determined by the winner (multivariate-LSTM) for one randomly selected household, is depicted. In Fig. 13a, it is seen that both predicted trading price and trading quantity differ quite dramatically due to fluctuation in householder's decisions. Rather than the trading quantity, which results from the agents' consumption-generation activity, the trading price volatility makes it more difficult to capture patterns. In Fig. 13b, the forecasting result of the trading quantity is very promising since there is a pattern in energy usage (trading quantity), signifying its stable trend. According to the accurate forecast, the policy model can learn to act and minimize energy bills more efficiently. As shown in Table 2, MB-A3C3 (LSTM)-DTW outperformed other algorithms by providing higher internal trade, lower external trade, and reduced community energy bills for both 8 and 300 households.

A. MBRL WITH FORECASTING MODEL
As investigated in Section IV, the MBRL framework begins by collecting environmental data and training the model to forecast. It is a requirement for MBRL that the forecasting model be accurate to ensure precise information for agents. The algorithm must be able to utilize the productive information to optimize the reward for the community's energy bill.

B. NUMBER OF K IN CLUSTERING METHOD
The clustering method was introduced to reduce the number of forecasting models (one model per cluster), assuming that homes in the same cluster behave similarly. Since it is quite costly to develop a forecasting model separately for each household (a total of 300 households), three clustering techniques were tested to determine the winner: random matching, location-based clustering, and k-means (DTW) clustering.
The results of clustering depend on the number of clusters (k); bias-variance trade-off determines the cluster number. If overfitting is taken into consideration, a large cluster will produce a tiny bias while a small number of clusters will produce a minor variation (sometimes favorable for generalization or interpretation) and is typically great for prediction. In Fig. 14, the community's energy bill and peak demand for 300 households diverge between k = 4 and 7; FIGURE 14. The inspection of energy bill and peak demand from k = 2 to 10 using the winner's clustering method (k-means (DTW)). between k = 8 and 10, a tight race begins. It is projected that if k is increased to 300, the result will remain the same while requiring a significant amount of computational resources. It is significant that the winner of the selected number of clusters (k = 4) exhibits the lowest energy bill and peak demand.

VIII. CONCLUSION
In this paper, a model-based multi-agent deep reinforcement learning algorithm called MB-A3C3 is presented. Firstly, the baseline A3C3 was enhanced by using the 1D convolutional network. Secondly, RL can support a large number of households (agents) by clustering those houses based on their trading behaviors using dynamic time warping (DTW). Thirdly, the environment was forecasted using multivariate LSTM; this is called model-based RL. Besides, both the multivariate-LSTM and CNN network are seen to improve multi-agent deep reinforcement learning. For large-scale households, the time-series clustering strategy based on trading behavior was utilized as an agent-based model. The experiment was conducted on the Ausgrid data set based on 300 households in NSW, Australia. Results demonstrate that our MB-A3C3, being less time-consuming and less complex, proved to be superior to other RL algorithms, producing costs 17% lower than traditional grid trading. It is significant that MB-A3C3 leveraged internal trading between households, thereby decreasing external trading under the grid's price incentives and constraints. Herein, the algorithms are seen to potentially aid in reducing customers' electricity bills. Further research must investigate various regulations to embrace more real-world scenarios of electricity consumers, producers, and power system operators to create more opportunities for P2P energy trading. Moreover, by adding other related factors, e.g., weather and system information, we can further improve the approach to make it more accurate. Training agents with more factors can provide more optimized policies.

DECLARATION OF COMPETING INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article. VOLUME 10, 2022

DATA AVAILABILITY
Datasets related to this article can be found at an opensource online data repository hosted at Data to share (see: https://www.ausgrid.com.au/Industry/Our-Research/ Data-to-share/Solar-home-electricity-data).