A Multiagent Federated Reinforcement Learning Approach for Plug-In Electric Vehicle Fleet Charging Coordination in a Residential Community

The increasing penetration of distributed renewable energy and electric vehicles (EV) in local microgrids/residential-community has brought a great challenge to balancing system stability and economic benefits. This paper proposes a decentralized framework based on an efficient federated deep reinforcement learning method for plug-in electric vehicle (PEV) fleet charging management in a residential community, which is equipped with a photovoltaic and battery energy storage system and connected to a local transformer. Firstly, the framework of PEV charging management is described as a virtual EV charging station coordinating charging tasks through sharing public information with distributed agents. Then, an individual preference model of PEV is developed considering heterogenous PEV charging anxiety, battery degradation, and collective penalty. Subsequently, we propose an attention-weighted federated soft-actor-critic method to efficiently seek the co-ordinational scheduling of the PEV fleet charging in a distributed way, where scalability and privacy protection can be ensured with attention-based information sharing. Finally, a real-world case study is conducted to validate the effectiveness and feasibility of the proposed approach.

Charging power for PEV k at time slot t (kW) P total t Total load for the whole residential community at time slot t (kW) c t Upper bound power to avoid overload at time slot t (kW) P g,t Power supply from grid at time slot t (kW) P pt,t PV generation power at time slot t (kW) P b,t Discharging power of BESS at time slot t (kW) P l,t Non-PEV demand of household load at time slot t (kW) P ava,t Available power for PEV fleet charging at time slot t (kW) t α,k , t β,k PEV arriving and departure time slots of a charging task ρ t Historical electricity price ($/kWh) ω k Weight factor to evaluate the contribution of local model k to the global model t With the prevalence of environmental preservation aware- 28 ness and the boosting willingness to live a low-carbon life, 29 electric vehicles (EVs) have gradually been popularized 30 among ordinary families as a clean and efficient means of 31 transportation [1], [2]. However, the uncertainty in stochas-32 tic human behavior and charging inconvenience make it 33 tough to grasp the charging needs of individual EVs, which 34 might induce instability in the local power system with-35 out well-ordered management. Other ambient factors, such 36 as electricity price [3], photovoltaic (PV) [4], and weather 37 conditions [5], further increase the uncertainty of energy 38 management [6]. As a result, EV fleet charging scheduling 39 is crucial in improving energy efficiency and flattening the 40 possible peak load given these stochastic factors. 41 Generally, traditional research tends to formulate the EV 42 charging scheduling issue as an optimization problem to 43 maximize the owners' profits by scheduling charging or 44 discharging strategies [7]. The conventional model-based 45 research focuses on modeling the EV charging scheduling 46 issue as a particular optimization problem, solving it by 47 linear programming [8], mixed-integer linear programming 48 techniques [9], or dynamic programming [10]. A robust opti-49 mization approach is applied to PEV charging scenarios with 50 multiple uncertain factors [11]. Besides, a Markov decision 51 process (MDP) model is proposed in [12] to describe the 52 dynamic evolvement of energy supply and demand for EV 53 charging. To sum up, these existing researches regarding 54 EV charging scheduling depend on an explicit optimization 55 model and solve the problem under the assumption of a fully 56 observable environment. However, the optimization perfor-57 mance might be heavily influenced by the model's accuracy. 58 With the development of artificial intelligence, deep rein-59 forcement learning (DRL) techniques have been imple-60 mented in a variety of applications, especially in EV charging 61 strategies, by automatically interacting with the changing 62 environment. For instance, Vandael et al. [13] proposed 63 a batch RL technique to find a day-ahead consumption 64 plan based on the learned charging behavior of EVs. 65 Chiş et al. [14] proposed a fitted Q-iteration batch rein-66 forcement learning algorithm to learn an optimum cost-67 reducing charging policy. Facing the uncertainty from both 68 real-time price signals and traffic conditions, Wan et al. [15] 69 developed a model-free DRL framework to adaptively learn 70 the dynamics of the changing environment and determine 71 the optimal scheduling strategy. Wang et al. [16]  charging policy under in an uncertain environment. How-103 ever, along with the increasing privacy concern of individual 104 data, an efficient and decentralized DRL framework with 105 privacy-conserving property needs further investigation.

106
As a promising solution for privacy-conserving, 107 Zhuo et al. [25] proposed a federated-learning (FL) based 108 reinforcement learning framework to instruct multi-RL 109 agents training in a distributed way. The foundational the-110 ory is derived from the FL structure proposed by google 111 in 2016 and has been widely applied in many aspects of 112 information and communication science [26]. Recently, fed-113 erated deep reinforcement learning (FDRL) has been applied 114 in energy management and achieved a desirable privacy-115 preserving performance. Specifically, Lee et al. [27] pro-116 posed a novel FDRL approach for the energy management 117 of smart homes with home appliances. Additionally, they pro-118 posed a privacy-preserving FDRL framework for maximizing 119 the profits of multiple smart EVCSs integrated with PV 120 and energy storage systems (ESS) under a dynamic pricing 121 strategy [28]. However, the existing applications of FDRL 122 to energy management rely on equal weights for distributed 123 agents when jointly building the global model and fail to 124 consider the differentiated contribution of heterogeneous 125 agents.

126
To narrow the research gaps above, the present work 127 proposes a distributed FDRL framework for coordinating 128 plug-in electric vehicles (PEV) fleet charging in a residen-129 tial community. A virtual EV charging station (EVCS) is 130 regarded as a global agent that aggregates local household 131 PEVs to make coordination charging and prevent transformer 132 overload events. Further, we propose an efficient attention-133 weighted federated reinforcement learning algorithm embed-134 ded with the SAC engine (AWFSAC) to solve the problem. 135 Combining offline training and online implementation, 136 the individual local agents gradually learn to make a 137 co-ordinational policy under the instruction of EVCS.

138
The contribution of this paper is as follows:

139
(1) We propose a partial observable MDP (POMDP) archi-140 tecture to model the dynamics in the single PEV charging 141 scheduling problem. A comprehensive preference setting for 142 PEV charging is introduced, which considers charging anxi-143 ety, battery degradation, and collective overload penalty.

144
(2) We propose a multi-agent framework for the PEV fleet 145 charging problem and model it as a decentralized POMDP 146 (Dec-POMDP). More specifically, each agent arranges its 147 charging scheduling strategy independently while interacting 148 with a virtual EVCS to avoid overload in a coordinated 149 manner.

150
(3) To preserve the privacy concerns of each agent, 151 we design an attention-based federated reinforcement learn-152 ing algorithm to solve the problem, which produces a coor-153 dinating strategy for the PEV fleet charging in a distributed 154 manner.

155
The rest of the paper is organized as follows. Section II 156 describes the overall structure and conducts modeling of the 157 virtual EVCS and distributed PEVs with charging prefer-158 ence. A Dec-POMDP model is formulated, and an efficient 159 multi-agent AWFSAC approach is proposed in section III. 160 The real-date-based case study is conducted in section IV to 161 prove the effectiveness of the proposed method, followed by 162 the conclusion in section V.

164
As shown in Figure 1, we propose a PEV fleet charg-165 ing framework on a residential community scale to achieve 166 VOLUME 10, 2022 where SOC t is the remaining energy in a battery at times-210 lot t; p t is the charging or discharging power; η c and η d 211 represent the energy transfer efficiency of charging and dis-212 charging, respectively; p max is the rated power of charging 213 and discharging. SOC min and SOC max refer to the lower and 214 upper limits of the battery capacity. The PEV battery is fully 215 charged when SOC t reaches SOC max .

216
Unlike the charging process of BESS, PEV charging has 217 to fulfill additional energy requirements of storing sufficient 218 electricity for the next trip before departure. Accordingly, the 219 total energy stored during the charging process from arrival t α 220 to departure t β is represented by, Notably, other charging preference requirements for PEV 223 charging are built in the following section B. Range anxiety refers to the concern of failing to reach the 227 planned destination before the battery power gets exhausted. 228 The range anxiety is influenced by range evaluation depend-229 ing on the driver's experience. As the driver's experience with 230 the EV grows, the drivers gradually draw a precise prediction 231 in range evaluation and eventually prevent overestimation of 232 range requirement [31]. This means that drivers become more 233 experienced in evaluating range during trips, reducing range 234 anxiety.

235
The PEV charging anxiety (CA) is defined as a com-236 prehensive evaluation model implying individual charging 237 satisfaction at each time slot. The preference considers both 238 the driver's charging anxiety in remaining energy (RE) and 239 remaining charging time (RCT). 240 Specifically, RE describes the driver's concern about fail-241 ing to reach the PEV charger with limited remaining energy. 242 RCT implies the driver's worry of emerging uncertain travel 243 events during the battery charging period before departure. 244 Activated by [31], the modeling of CA, including RE and 245 RCT, can be described with an index as: where SOC β,k is the expected SOC level for agent k at the 259 departure time.
where is the battery aging cost; κ and υ are the model wherep k,t is the approximation of the collective power con- the EVCS. Accordingly, each agent can acquire information 298 regarding the collective actions of other agents and take them 299 as a reference when making its own decision. When overload 300 occurs, each household agent will receive a penalty depend-301 ing on its energy consumption during the overload hours. 302 It can be treated as the cost for coordination described by [33], 303 where P coordinate,t is the cost for coordination; c t refers to the 306 upper bound power at each time slot; P t,k and P total t represent 307 the energy consumption of each agent and total load at time 308 slot t, respectively. In this section, EVCS aims to coordinate the household PEVs 311 charging in a residential community subjected to the local 312 energy resources and transformer load limitation. Its overall 313 objective is to minimize the overall cost of the PEV fleet 314 charging under a series of constraints. Due to a group of PEVs 315 plugged in the distribution network concurrently, transformer 316 overload (TO) can happen with peak load, which will cause 317 a penalty to each participant. Thus, the TO is described as a 318 physical constraint in the EVCS model. It should be noted 319 that we consider an upper bound P max TO of the transformer 320 capability constraint and an overload safety factor of EVCS 321 η t [34]. The overall model and constraints are listed below, 322 where P t is the total charging power of the PEV fleet at time 328 slot t; P ava,t is the available power for the PEV fleet charging 329 each time slot; P g,t , P pv,t , P b,t refer to power supply from 330 grid, PV, and BESS, respectively. P l,t is the power demand 331 of household load; P max TO is the TO constraint for power pur-332 chased and transmitted from the grid; p k,max and p k,min refer 333 to the upper and lower bound of each PEV charging power. 334 The model of the overall charging scheduling problem: where the first item in (15)    Specifically, the observation for individual PEV agent k at 375 timeslot t is defined as where ρ t−T +1 , . . . , ρ t represents the historical T electricity

391
where p t,k is the charging power of agent k at time slot t. 392 3) REWARD

393
The reward setting r 1,t considers both energy cost and charg-394 ing anxiety penalty. Before the anxiety period, the drivers care 395 more about budget-saving when making charging decisions. 396 When docking during the anxiety period, the drivers gradu-397 ally become worried about their next trip with the CA index 398 increases. Eventually, they prefer to get fully charged as soon 399 as possible before the departure time. 400 where SOC t,k and SOC x,k refers to the current SOC at time 402 slot t and expected SOC during the charging anxiety period, 403 respectively. t x,k is the initial time slot for charging anxiety. 404 The reward setting r 2,t is the cost of battery degradation. 405 As described in section II, charging at rate power or frequent 406 shift between charging and discharging mode will be detri-407 mental to battery life span, which is regarded as a running 408 cost as k .
Another reward r 3,t is trigged when the TO event happens. 412 This common reward is designed as a peak load penalty 413 sharing among agents when a large number of PEVs are 414 charging or discharging at the same time. For the collective 415 charging behavior, the penalty is assigned to each local agent 416 depending on its power contribution during the overload 417 period. Unlike the single-agent formulation setting, the dynamic evo-421 lution of the environment in a multi-agent setting is influ-422 enced by the joint decision action of all the distributed agents. 423 As a result, each PEV agent who aims to maximize profits 424 needs interaction with not only the environment but also other 425 PEV agents. This work concentrates on multi-agent schedul-426 ing domains considering participants with either coopera-427 tive or competitive manners under a partially observable 428 environment.

429
With the multi-agents considered in the same environ-430 ment, the single POMDP problem turns out to be a Dec-431 POMDP model, which is suitable for coordination and 432 decision-making among multiple agents [35]. The model 433 is a multi-agent extension of POMDP, including five parts 434 S set , O set , A set , T (·) , R set : a state set S set at the present 435 time slot, an observat i on set O set with partial information, 436 action sets for all the agents A 1 , A 2 , . . . , A N , a transition function T (·), and a reward function set for all agents R set .

438
• n is the number of agents.

439
• S set = S 1 × · · · × S n is the shared state space of each where H (π (· |s t )) is the policy entropy. α represents the 467 temperature parameter that balances the trade-off between 468 entropy and reward.

469
The optimization aim of SAC is to find the optimal policy 470 that can maximize the objective.
where π * is the optimal policy among all the possible options. For the policy evaluation, the soft Q-function can be com-483 puted by where Vθ (s t+1 ) refers to the value of the target Q network. 486 In the policy updating process, the policy parameter is 487 updated with the exponential of the soft q-function to guar-488 antee an improvement in performance. The policy function 489 is parameterized as φ. The parameters of the policy can be 490 updated by minimizing the expected Kullback-Leibler (KL) 491 divergence as where φ represents the weights of the policy network.

495
The training process of SAC is described in the following 496 flow chart. Specific equations and deductions can be found 497 in [37] and [38]. In the paper, SAC is considered as the engine 498 applied to both the centralized optimization and local agent 499 training in the proposed distributed framework. Observe state s and choose action a t from the policy a t ∼ π φ (a t |s t ) 5: Execute a t in the environment 6: Observe next state s and reward r 7: Store s, a, r, s in replay buffer D 8: for each gradient step do 9: Update the Q-function parameters θ i ← θ i − λ Q · ∇θ i J Q (θ i ) for i = 1, 2 10: Update the policy parameter φ ← φ − λ π · ∇ φ J π (φ) 11: Adjust the temperature coefficient α with α ← α − λ · ∇ α J (α) 12: Update the target network parameters θ i ← τ · θ i + (1 + τ ) · θ i 13: end for 14: until convergence model building and federated interaction. Firstly, the pre-557 defined indicators are sent and encoded as keys to sharing 558 parameters. Then, the attention weights are achieved through 559 query calculation. After that, the parameters are shared 560 based on the calculated attention-based weights to build the 561 global network through weighted summation. Sequentially, 562 the global network broadcasts its parameters to local agents, 563 with which the local models update their parameters to pre-564 pare for the next round of local training. Finally, one interac-565 tion among distributed local agents is completed through the 566 federated learning framework. Specifically, the whole train-567 ing procedure of the proposed distributed PEV fleet charging 568 algorithm is described in the following flow chart.

569
The whole training procedure of the proposed distributed 570 PEV fleet charging algorithm is described in the following 571 flow chart. At time slot t, each distributed agent k makes 572 its charging decision a k,t based on the local observation o k,t 573 and local policy π φ,t . After the actions are executed, the 574 environment changes to a new state with the joint action 575 of all agents a t = [a 1,t , a 2,t , . . . , a N ,t ]. The local agent 576 receives the real energy consumption broadcasted by EVCS 577 and stores the prices and the total consumption into the reply 578 buffer D 1,k . Meanwhile, the transition (o k,t , a k,t , r k,t , o t,k ) is 579 stored in buffer D 2,k . When it comes to the gradient update 580 step, and the collective behavior model P σ,k is updated with 581 the date in D 1,k . The transition in D 2,k is updated as well 582 with the collective power consumption of D 1,k . Then the local 583 agent starts to train its model with the updated transition data 584 in D 2,k . After a certain number of iterations, the training of 585 local agents is suspended, and all the local agents jointly build 586 their federated global model with the attention-based weight 587 and the shared network parameters at each communication 588 round. Subsequently, the global model broadcasts its network 589 parameters to all distributed agents, based on which the local 590 agent proceeds with the training in the upcoming iterations. 591 Finally, the whole process iterates until convergence.

594
The real EV trip data of 10,000 cars are adopted from 595 the UK National Travel Survey datasets [41]. As shown in 596 Fig. 1, the randomness of travel demand is more significant 597 on weekends than that on weekdays. The distribution of 598 EV charging events is fitted with kernel density estimation 599 Algorithm 2 Proposed Multi-Agent Algorithm for PEV Fleet Charging Scheduling 1: Initialize the charging preference setting for each LA 2: Initialize the Q-value, advantage, weights θ of actions of each LA 3: Initialize the global server's model θ G and the sharing batch [ψ 1 , ψ 2 , . . . , ψ N ] 4: Initialize two empty reply buffers D 1,k and D 2,k for each agent 5: Initialize the collective policy model P σ,k for each agent 6: for communication round=1, Max round do 7: for training episode j=1, periodic training episode do 8: for each state transition step do 9: Get decision action a k,t from policy a k,t ∼ π φ (a k,t |s k,t ) for each agent 10: Execute action a t = (a 1,t , a 2,t , . . . , a N ,t ) 11: Obtain next state s and reward r k,t 12: Calculate the real energy consumption 13: Store electricity price and real energy consumption into D 1,k 14: Store transition (o k,t , a k,t , r k,t , o k,t ) in replay buffer D 2,k 15: end for 16: for each gradient update step do 17: Update the weights of collective policy model P σ,k 18: Update the buffer D 2,k with current P σ,k 19: Calculate the loss of the value network, then the losses of the actor and critic networks 20: Update the Q-function parameters θ i ← θ i − λ Q · ∇θ i J Q (θ i ) for i = 1, 2 21: Update the policy parameter φ ← φ − λ π · ∇ φ J π (φ) 22: Adjust the temperature coefficient α with α ← α − λ · ∇ α J (α) 23: Update the target network parameters θ i ← τ · θ i + (1 + τ ) · θ i 24: end for 25: end for 26: Transmit the local model ψ k to the GA; the GA stores all the local models from the LAs in batch If the batch is filled with all the local models, then: 28: Calculate Attention-based weight w k with Q and K k 29: Update the old global model with the new one: Broadcast the global model to all local agents 31: Replace the old local model for the PEV charging with the updated global model: ψ k ← ψ G 32: end for and resampled as the datasets of EV departure time and arrival assumed to have a maximum capability of 40kWh, holding 604 a neighborhood of 20 households where each with one EV. 605 Nissan Leaf is considered a typical EV prototype with a 606 24kWh battery [44]. We assume that the travel plan of each 607 EV follows the UK driving pattern based on the dataset men-608 tioned ahead, with the same arriving and departure custom. 609 Besides, we assume that each household PEV charger can 610 shift continuously between charging and discharging modes 611 with a rated charging power of 3.3kW. The time resolution is 612 defined as 60min with 24-time slots per day. The distribution 613 of other variables is listed in Table 1.

616
The implementation of the DRL framework includes two 617 main parts: offline training and online application processes. 618 The training process is critical for the neural network to learn 619 sequential decision-making skills based on data derived from 620 accumulated interaction with the environment. The training 621 VOLUME 10, 2022 is conducted on a computer with Intel CoreTM i7-7700HQ 622 CPU @ 3.80GHz × 4. All the algorithms are implemented 623 on the Python 3.6.10 platform, and the DRL-based algorithms 624 are implemented using Tensorflow 1.15.0. Besides, the multi-625 agent environment is developed as a custom environment 626 based on the OpenAI Gym [45]. The hyperparameters for 627 our SAC training engine are filled in Table 2. Besides, the 628 architecture of the actor NN and critic NN for each PEV agent 629 is summarized in Table 3 and Table 4 Figure 4, the episode 640 reward drew a random research in the beginning when 641 insufficient experience is learned with limited interaction 642 data. Then, the training curve encountered a fluctuation and 643 converged to a stable policy gradually with the iterations 644 increasing. Besides, the algorithm losses displayed the same 645 converging trends in Figures 5-7, where they initially went 646 through a random fluctuation until they fell into a stable 647 convergence. They proves that the SAC engine is suitable for 648 solving the PEV charging problem described in our paper.   As a comparison, the training performance of the proposed 650 distributed FDRL algorithm is illustrated in Figure 8 and 651 Figure 9. Due to the convenience of illustration, we only dis-652 play the episode reward curve for one typical distributed agent 653 during the whole training process. As shown in Figure 8, 654   As shown in Figure 11, the PEV agent executes charging or 687 discharging with the fluctuation of the changing electricity 688 prices from 9 p.m. to 11 a.m. the next day. Before reaching the 689 charging anxiety period, the agent manages to discharge dur-690 ing the high price hours and charge when the price decreases, 691 which means the driver is price-sensitive and prefers to 692 attend V2G for budget-saving. Subsequently, the PEV driver 693 is getting through the anxiety period with approaching the 694 departure time. In order to relieve the charging anxiety, the 695 battery is required to be charged to a desirable capacity level 696 for the next trip. Accordingly, the PEV agent chooses an 697 eagerly charging strategy to get through the anxiety period 698 and complete the charging task before the departure time. 699 With the learned charging strategy, the agent has successfully 700 managed the EV charging power by responding to the chang-701 ing electricity price.

702
In comparison, the charging performance of another PEV 703 with a different preference setting can be seen in Figure 12. 704 In this charging scenario, the PEV arrived home at roughly 705 6 p.m. and planned to depart at around 7 a.m. in the following 706 day. As illustrated, the charger always chooses the charging 707 actions while ignoring the remarkable fluctuation of the elec-708 tricity prices. That means the PEV driver is not price-sensitive 709 and prefers to make quick charging and remain the battery 710 energy at a high level in case of any unexpected travel event 711 soon. 712 VOLUME 10, 2022 FIGURE 11. The charging strategy within a typical day for PEV agent I.   , in which this approach is adopted solve a simi-751 lar electric vehicle charging strategy optimization problem. 752 Specifically, the benchmarked algorithm is based on an accu-753 rate PEVs scheduling model assuming perfect observation of 754 the environment. At every time slot, the environmental infor-755 mation is fully known to each agent ahead of time, including 756 future electricity prices, driver's behavior, and other uncertain 757 factors. The problem is solved by the computation engine 758 (Cplex). Notably, the outcome is regarded as theoretically 759 optimal but hard to realize.

760
Centralized SAC (CSAC): The benchmark DRL algorithm 761 adopts a centralized framework embedded by SAC to con-762 sider all the PEV charging decisions with a single global 763 agent, which deals with multiple observations and actions 764 within a neural network. It is worth noting that the benchmark 765 RL algorithm is conducted based on the same reward func-766 tions as the proposed approach, except that the whole process 767 is under the training of a single NN. Besides, the overload 768 penalty for each PEV agent is assigned centrally depend-769 ing on the contribution to the overload incidents. In this 770 framework, the global agent has the information of all PEVs 771 to make coordination. However, it is not friendly to realize

818
Case studies based on real-world data demonstrate that 819 the proposed distributed approach outperforms the bench-820 mark central DRL algorithm in terms of providing a 821 co-ordinational charging strategy. Besides, due to the avoid-822 ance of directly sharing local individual data, the privacy 823 concern of data leakage is well addressed by the proposed 824 method. Finally, the introduction of the attention-weighted 825 technique improves the learning performance and efficiency 826 of the FedAvgSAC method. Furthermore, the limitation of the 827 proposed FDRL-based approach in terms of the capability of 828 accommodating more realistic situations and other applica-829 tions in smart grids will be further investigated in future work. 830