I. Introduction
Intelligent Transportation Systems (ITS) are expected to support the integration of EVs into the grid network through advanced information and communication technologies (ICT) powered by AI mechanisms and V2G schemes. Indeed, the prospect of the future ITS infrastructure including connected EVs and V2G is attracting significant attention from researchers. One aspect of such attention is devoted to the AI-based mechanisms for the integration of EVs with the smart grid to flatten the peak load [1], [2] [3], and [4]. AI mechanisms such as reinforcement learning (RL) are particularly considered promising where EVs are modeled as ITS agents with the capability to learn from the environment and perform actions to receive rewards [5]. The underlying concept of the RL is that agents can be autonomous, i.e., their behavior is learned independently from interacting with the environment [6]. Such a distinctive attribute makes RL particularly suitable for DR applications [7], [8], [9], specifically peak load shaving. The future integration of ITS and smart grids gives utility companies the opportunity to engage EVs of all kinds to effectively reduce the peak load. However, in practice, it is not feasible to assume that all brands of EVs can communicate with each other to make a coordinated decision, but rather take local actions based on partial observation of the entire system [8]. To this end, it is important that the employed mechanism does not need to fully observe the entire environment to reach a decision. Furthermore, centralized optimization mechanisms require full control of EV data to make an optimized decision which is not necessarily maximizing the reward of participating agents. As the number of EVs increases, the overhead communication and the complexity of computing resources make finding an optimal solution intractable. To this end, it is more logical to treat the problem using a distributed mechanism where each EV agent can learn and take a decision to maximize its reward. Although distributed solutions may not lead to the optimal social welfare of the participating EVs, the proposed MARL allows the agent to learn cooperatively to maximize the reward function. This paper capitalizes on this fundamental idea and proposes a novel multi-agent reinforcement learning system to efficiently and effectively schedule multiple EVs to reduce the peak load. The proposed MARL approach is based on the actor-critic framework for an optimized day-ahead discharging scheduling and coordination of EVs.