Reinforcement Learning-Based Control of Signalized Intersections Having Platoons

Smart transportation cities are based on intelligent systems and data sharing, whereas human drivers generally have limited capabilities and imperfect traffic observations. The perception of Connected and Autonomous Vehicle (CAV) utilizes data sharing through Vehicle-To-Vehicle (V2V) and Vehicle-To-Infrastructure (V2I) communications to improve driving behaviors and reduce traffic delays and fuel consumption. This paper proposes a Double Agent (DA) intelligent traffic signal module based on the Reinforcement Learning (RL) method, where the first agent, the Velocity Agent (VA) aims to minimize the fuel consumption by controlling the speed of platoons and single CAVs crossing a signalized intersection, while the second agent, the Signal Agent (SA) proceeds to efficiently reduce traffic delays through signal sequencing and phasing. Several simulation studies have been conducted for a signalized intersection with different traffic flows and the performance of the single-agent with only VA, DA with both VA and SA, and Intelligent Driver Model (IDM) are compared. It is shown that the proposed DA solution improves the average delay by 47.3% and the fuel efficiency by 13.6% compared to the Intelligent Driver Model (IDM).


I. INTRODUCTION
Human-driven vehicles are exposed to experience sudden traffic changes on many occasions resulting in consecutive vehicle stops named as ''Traffic Oscillation''. These oscillations include negative impacts such as increasing safety risks and maximizing fuel consumption [1]. The lack of data sharing among the drivers is one of the reasons for the triggering of these traffic oscillations. As shown in [2], vehicles on congested highways are forced to repeatedly deaccelerate and accelerate. Moreover, signalized intersections are another reason for traffic oscillations as they organize the traffic flow by alternating between green and red phases which results in a stop chain of vehicles in red phases [3]. Variable Speed Limit (VSL) is one solution that regulates the moving speed using real-time traffic data. VSL is implemented in signalized intersections [4], though its The associate editor coordinating the review of this manuscript and approving it for publication was Nuno Garcia . performance is dependent on the compliance of drivers and the variance in vehicle dynamics [5].
The emergence of smart cities has urged the need to implement smarter transportation systems that depend on Connected and Autonomous Vehicles (CAVs). Controlling CAVs to travel through signalized intersections has been a focus of research as a way of improving transportation safety and efficiency by employing Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communications [6], [7]. Introducing platoons of CAVs is more promising in the future as a group of CAVs is moving in the same direction with a single CAV in the front as the ''leader'', and CAVs succeeding as ''Followers'' maintaining a gap distance to the preceding vehicle while following the leader. CAVs are able of lane changing using different controllers such as longitudinal control and lateral control in [8] and PID controllers in [9]. That is required to merge into platoons, as in [10], in which a nonplatoon CAV is merged into a cooperative adaptive cruisecontrolled platoon. Also, unmerging is possible as shown VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in [11], which includes merging and unmerging scenarios using distributed and consensus approaches. Platooning is intended to improve traffic management, shorten travel times, and enlarge traffic capacity [12]. The first main controller in intersections for the CAVs is responsible for speed reference. In unsignalized intersections, different controllers for CAVs are proposed, aiming to reach optimization through scheduling the crossing orders of all CAVs. Autonomous intersection management is proposed in [13] that splits the intersection into resources and ascribes them to CAVs in a First-In-First-Out approach. The system was later modified in [14] to account for all vehicle agents' dynamic information instead of simply applying FIFO. In [15], a decentralized energy-optimal control framework is proposed for CAVs, and the approach is extended in [16] to include turns and account for the joint energy-time optimal solution. Multiple intersection scenarios are simulated in [17], [18] using the optimal control approach.
In signalized intersections, trajectory optimization in [19] provides a smooth path for CAVs to cross the signals without stopping at red signals in static environments. Also, optimal eco-driving control is presented in [20] which employs a data-driven approach to account for the uncertainty in signalized time phasing based on dynamic programming for optimization. Lastly, RL-based velocity agents have been developed to locally control CAVs for avoiding obstacles [21] and risky behaviors [22]. In [23], a deep deterministic policy gradient RL-based approach is used to control CAVs behavior, which shows major improvements for various signalizing scenarios. It is also shown in [24], [25] that RL-based schemes can be utilized to improve traffic performance.
The second main controller is the traffic signal phasing and sequencing controller. The main aim of a traffic signal controller is to reduce the average delay of vehicles crossing an intersection, and consequently, increase the traffic throughput. While traditional fixed signalized intersections have poor performance, especially in asymmetric traffic flows, different methods have been developed to smartly control the traffic signals to reduce human involvement, reduce traffic delays and congestion, and most importantly, to keep pace with the development of smart cities in terms of communication. Signal controllers based on fuzzy logic with neural networks are presented in [26], model predictive controller in [27], and Reinforcement Learning (RL) controllers, which is considered a cheaper smart solution in India [28]. The large figure of states in such RL systems as signalized intersections motivated the researchers to find generalization techniques as linear function approximation in [29], state complexity reduction using self-organizing maps in [30], and deep learning in [31], [32].
This paper addresses the problem of combining two smart systems working simultaneously together in signalized intersections in smart cities, including platoons and single CAVs, as shown in Fig. 1, using the RL approach. Toward this goal, this study combines smart signal adaptation with CAVs speed controller to provide a full system and observe the results for a Double Agent (DA) setup working simultaneously. The DA setup decomposes into two agents. The first agent is the Velocity Agent (VA), which is responsible for providing the speed reference for CAVs and platoons inside the intersection. The second agent is the Signal Agent (SA), which is accountable for providing the change in traffic signal sequencing and time phasing for each traffic phase. In this paper, Reinforcement Learning (RL) is used as the main control scheme for both managing an intersection's traffic signals and specifying an optimum speed reference for each individual and platoon of CAVs crossing the intersection.
The main contribution of the paper is that the proposed RL method combines signal sequencing to minimize all vehicle delays and speed trajectory referencing to minimize fuel consumption, and it is shown that in comparison with the benchmark of the Intelligent Driver Model (IDM), the proposed solution has significant improvement in both decreasing the vehicle delays and fuel consumption. To the best of the authors' knowledge, the simultaneous design of both CAVs speed control and traffic signal phasing control has not been considered in the literature, and previous works only investigated each problem separately.
The rest of this paper is organized as follows. In Section II the problem formulation is introduced, and Section III presents the proposed double agent RL-based methodology. Section IV provides the training procedures for both velocity and signal agents. Section V demonstrates the performance of the DA solution, and Section VI shows the results of SA and DA while comparing them with the IDM benchmark in various scenarios. Finally, the conclusion and future research directions are provided in Section VII.

II. PROBLEM FORMULATION
A signalized intersection is a solution to prevent potential crashes and organize the traffic flow. Traditional fixed signalizing has poor performance when running a high unsymmetrical traffic flow, resulting in huge delays. Also, introducing CAVs and platoons into smart cities requires speed referencing to cross the signalized intersection nonstop to minimize fuel consumption. These two problems can be solved by adopting a smart signalized intersection utilizing V2I and V2V communications to ensure the speed trajectory is provided for the CAVs and platoons alongside with the signal sequencing and time phasing of the traffic signals. In this work, the Mcity environment presented in [33] and [34] is considered an intelligent transportation city formulated for CAVs using different control methodologies and communication analogies such as V2V and V2I. The model has been modified to account for platoons and will be reviewed in this section. Table 1 shows the list of notations that are used in this paper for reference.

A. ENVIRONMENT MODEL
The intersection consists of four road segments: West (W), South (N), East (E), and North (N). Each individual segment is represented by a single lane. The traffic signal-enabled directions associated with this design are W-E and S-N. The intersection has a Control Zone (CZ), which is the whole area covered by the smart signal controller that is referred to as the ''Signal-Coordinator'', and the CZ lane length is R. The central square area of the intersection is called the Merging Zone (MZ), with width S, which of possible lateral platoon-CAV collisions, and its traffic flow is controlled by the traffic signals. The entry position of the i-th CAV at the Entry Point (EP) is denoted as x i = 0. The overall model is presented in Fig.2(a). Let M (t) ∈ N be the overall number of CAVs entered the CZ following a first-in-first-out queue system. Before a platoon enters the intersection, the agent employs I2V communication with the platoon leader and assigns each platoon a unique ID which is an integer value i = M (t) + 1, and M (t) is updated by adding N i which is the number of CAVs in the i-th platoon as M (t) = M (t) + N i . Continuously, integer i p will be used to represent the platoon preceding the platoon i in the same lane as shown in Fig. 2(b). The system deployed V2V and V2I communication as shown in Fig. 2(b) where the signal coordinator continuously communicates with the platoon leader using V2I to send the vehicle information and receive speed reference. However, V2V is used continuously when two platoons exist in the same lane to prevent potential accidents.

B. PLATOON MODEL
One of the important aspects in controlling CAVs' speed is the safety distance between the vehicles. In general, the safety distance depends on the speed of the succeeding vehicle and is expressed in terms of time. Precisely, the 2-second rule applies to vehicles traveling at a speed below 12.5 m/s, whereas the 3-second rule applies to all vehicles with no speed limit [35].  The dynamics of a platoon of CAVs is assumed to be optimal, hence a platoon is represented by a length attribute, where each vehicle's average length is denoted as V L , and the gap between two vehicles is G, and N is the number of CAVs in the platoon (see Fig. 3). Accordingly, the length L i of a platoon i can be continuously calculated as where G i = 2v i + S 0 is a variable gap that depends on the velocity of the platoon v i and calculated maintaining the 2-second rule with a minimum gap distance S 0 . Based on (1), each platoon can be represented by a long vehicle, where the followers in each platoon are following the leader using the 2-second rule.

C. INTELLIGENT DRIVER MODEL
The Intelligent Driver Model (IDM) will be used for comparison purposes and is illustrated by Treiber in [36].
IDM is used to model a human driving behavior deploying where the instantaneous velocity and position of the ego vehicle is v i and x i , respectively, v f is the maximum road velocity, S 0 is the least distance gap between vehicles, S * i is the desired gap between the ego vehicle and the preceding vehicle, and T is the safe headway time which depends on the reaction time of the driver. Consequently, u max and u min are the highest and lowest acceleration of the vehicle. The distance between the ego vehicle and the preceding vehicle is represented as The calculated u i is the acceleration that is extracted using S * i . It should be noted that the same notations can be referred to the platoon leader.
Equations (2) and (3) can be implemented in signalized intersections through 4 cases as follows: Case 1) No preceding vehicle and light is green in which we put S * i = 0. Case 2) No preceding vehicle but the light is red in which we substitute v i p = 0 in (2) and V L = 0 in (3). Case 3) when a preceding vehicle exists, while the light is green, or the vehicle has already passed the signals. This case uses (2) and (3) as it is. Lastly, Case 4) when the preceding vehicle exists, and the light is red. In this case, it combines Case 2 and Case 3, both calculations presented in those two cases must be evaluated and the reference acceleration result will be the maximum acceleration result among the two evaluations.

III. METHODOLOGY
The proposed traffic signal scheme employs a reinforcement learning approach to construct a DA system. The DA subsists into VA and SA. The SA controls the traffic signal phasing, and the VA sends a speed reference to the platoon leader. This section contains 4-sub sections as I-RL Background, II-Velocity Agent, III-Signal Agent, and IV-Override System.

A. REINFORCEMENT LEARNING BACKGROUND
Reinforcement learning (RL) is the process of learning machine learning models to make a set of decisions in a specific environment. The RL agent learns through trial and error to attain a goal and find the optimal sequence of decisions. The basic theory of reinforcement learning is shown in Fig. 4, where at each time step t, the agent gets an observation s t from the environment's state space S. Consequently, the agent will determine the next action a t from the action space A based on the state s t and apply it to the environment E. The environment then responds to this action, resulting in a transition to a new state s t+1 ∈ S, for which the agent receives a reward r t+1 .

1) Q-LEARNING
QL is an algorithm that is widely used in reinforcement learning for finding the optimal policy π * through maintaining a Q-Matrix consisting of state-action pairs denoted as Q(s, a) which is a matrix that contains the value action of a given action in each state. The action value is an estimation of future rewards that will be collected if this particular action is taken. The Q(s t , a t ) is estimated from multiple updates performed at time step t + 1 after receiving r t+1 from performing action a t in state s t according to Q (s t , a t ) t+1 , a t+1 ))) − Q (s t , a t )] (6) where α is the step size and γ is the discount factor of the future rewards. The actions are chosen using different policies such as ε-greedy where at each state there is a probability ε to perform an action that does not have the highest Q-value. However, a probability of 1 − ε for the agent to be greedy and choose the highest Q-value action.

B. VELOCITY AGENT
The VA is the main agent in the system which has all the information of the vehicles in the system. As shown in Fig. 5, once a platoon is at the EP, the leader receives its unique identification number i, and sends the CAVs count inside the platoon N i , and its lane to the VA. The velocity agent then continuously sends a reference speed v r according to its current speed and position using RL with a time step t v . Next, a state, action, and reward definitions of the VA are elaborated in this section. As a result of the constant flow of information inside the VA, it also computes the signal state which will be illustrated in the second part of this section.

1) VELOCITY AGENT STATE DEFINITION
The VA receives information from all platoons inside the CZ. The state definition of the VA is composed of a 4-element vector as: where X i v,1 refers to the current platoon lane signal light status whether it is green or not, X i v,2 represents the golden binary state that evaluates the possibility of the platoon's ability to pass as follows: 1 t i < t lane i left and light is green 0 otherwise (8) where is the time required for the last vehicle inside the platoon 'i' to arrive at the MZ, t lane i left represents the time left for switching the platoon's lane lane i ∈ {W − E, S − N} signal status, and x i is the current position of the platoon leader.

2) VELOCITY AGENT ACTION-REWARD DEFINITIONS
The VA action definition A v is simply the linear distribution between 0 and the road speed v f i.e. A v = [1, 2, . . . , (v f + 1)] and mapped to velocity reference v r = a v − 1 where a v ∈ A v is the chosen action. The reward system is a normalized weighted sum of different rewards based on the platoon's velocity as r v,1 and its weight as w 1 , reaching the golden state as r v,2 with weight w 2 , and whether the platoon crossed a green or red signal as r v,3 and r v,4 with weights w 3 and w 4 , respectively. The velocity reward is calculated as where m is the reward exponent factor that can control the exponential level of reward for different velocities. The VOLUME 10, 2022  weights as shown in Table 2 are modified according to the position of the platoon leader in the intersection. It should be noted that in the first half distance of the lane, the reward is mainly biased to give all the weight to the velocity since the golden state is either achievable with the maximum speed, or unachievable in the case of the green signal phase and the long distance to the MZ. However, after exceeding the halfway through, the golden state has more weight as it is the main goal of the agent to find the golden state with the minimum delay, in other words with the highest speed possible. Moreover, a reward is added for crossing a green signal, and punishment for passing a red signal, and each has weight 1 since the platoon can either pass a green or a red signal. Finally, the reward function can be expressed as R i s = 4 j=1 r v,j w j .

C. SIGNAL AGENT
In this paper, an RL agent is also used to control the traffic signals sequencing. Since this system is built for platoons of CAVs in the intersection, so the signal timing must be known prior to the platoon leader to determine if the platoon can pass within the green phase or not, and this information is utilized by the VA to find the optimum reference speed in case of not being able to pass. This leads to the action definition A s as a fixed-sequencing variable timing of {10s, 15s, 20s, 25s}. So, the traffic signals will be alternating between S-N and W-E according to the action a s ∈ A s . The SA state vector is composed of five elements. The first element is the current signal phase whether S-N or W-E is green. The remaining elements represent the number of vehicles in the CZ that have not reached the MZ yet. We split each lane into two independent areas to count the vehicles inside each area, as the first area is 35% of R around the MZ, and the second area is the remaining area which mostly contains the vehicles that entered the MZ recently. We chose this proportion as the VA is expected to reach its minimum speed in the first area. Therefore, the second and third elements of the state vector is the number of vehicles in W/E first area and second area respectively. Consequently, the fourth and fifth elements are the number of vehicles in S/N first area and second area respectively. Lastly, the reward was the negative sum of the delay of all platoons which did not cross the MZ yet. The time step of actions taken is t s which is a variable that equals the action a s at the previous time step (t-1).
The delay for the i-th platoon is calculated by evaluating the expected arrival time as (11) at the EP where v i e is the entry velocity of the platoon i, then compare it with the new expected arrival time at each time step t s . The delay is then multiplied by N i to have a reward function as R s = − z i=j N i (t i a − t i n ) where j and z are the earliest and latest platoon ID in the CZ which has not entered the MZ yet.

D. OVERRIDE SYSTEM
An override system for the VA is implemented for three main reasons as 1) to maintain a safe distance toward the preceding vehicle, 2) to assist the VA with finding the golden state faster, and) 3) to assure no red signal passing. The override system is built based on the condition that if the golden state is not achieved after passing 75% of the control zone width R, then keep reducing the speed until the golden state is achieved. Furthermore, to maintain a safe gap distance, a continuous test for the two-seconds rule must always apply, and once the test fails, the platoon leader sets its own speed to follow the preceding vehicle speed using V2V communication. Lastly, the system should ignore any accelerating speed action from the VA if the platoon will not be able to pass a green signal: To summarize, the override system takes control from the VA at any time whenever the following occurs: 1. Two Consecutive CAVs break the 2-seconds rule in x.

A red signal crossing might occur.
3. The golden state is not achieved after crossing 3 4 R distance from the EP. The override system takes control often in the range of (0-70%) depending on the lane traffic and the episode scenario (traffic signals phasing).
The full communications between the proposed DA intersection controller and the platoons-traffic signals are shown in Fig. 5. Note that light status is the current signal phase status and v r is the velocity reference sent to the platoon leader from the VA set of actions.

IV. TRAINING SETTINGS
In this paper, we use Q-Learning to train the DA as illustrated in reinforcement learning background with Q-Learning review.

A. VELOCITY AGENT TRAINING
The VA is trained firstly through the following environment specifications: 1-The step size is α v and the discount factor is γ v . 2-Signals phasing are random actions with an equal probability from A s . 3-There is a single platoon per lane so it's accident-free. 4-The entry velocity v e is constant and is equal to v f . 5-Number of vehicles in a platoon N i is random between [N min , N max ] with equal probability. 6-The actions are chosen using ε-greedy policy The training is divided into four phases and done as follows: Step 1: The ε value is set to ε s for n v1 episodes (1 st phase) and then started decaying with a learning rate l v for n v2 episodes until ε hits approximately zero (2 nd phase).
Step 2: The override system is activated for n v3 episodes (3 rd phase).
Step 3: The override system is deactivated for n v4 episodes and the system is trained in the final stage with ε = 0 (4 th phase). The episode starts by entering the platoon from the EP and ends by entering the MZ. Each training step is indeed essential as the first step is to ensure the agent should have enough initial exploration for all the actions in each state, the second step is to assist the agent to exploit more potential optimal actions and put the agent on the right path. Lastly, the third step is to make sure after the override system that any bad greedy actions should be penalized and its Q-value reduced. The values of n v1 , n v2 and n v3 are chosen through trial and error. The training results are shown in Fig. 6 where the average reward per step is calculated by dividing the total reward of the episode by the number of steps in the episode. The average reward per step is calculated since episodes are random for each vehicle due to the randomness of the signal phases which also resulted in big fluctuation in the training. Initially, with a high value of epsilon, the agent is crossing red signals in some episodes which leads to a significant drop in the average reward to an average reward of −0.2846. After Step 3, the drops are eliminated, and the reward averaged at 0.7021 and there is no red signal crossing detected.

B. SIGNAL AGENT TRAINING
As the VA is optimized and tested, it is required to optimize the SA to handle different traffic flows from all directions. The SA training is done through repeated training sessions where a training session is a session of constant traffic flow for n sa steps.
The following SA training assumptions are followed 1) The SA step size is α s and its discount factor is γ s .
2) The entry velocity v e is constant and is equal to v f . 3) Traffic flows from each direction are constant for each training session. VOLUME 10, 2022   Fig. 7 which is the negative sum of delay divided by the number of vehicles that got delayed for each step. The same training steps are repeated for different traffic flows. In Fig. 7 which shows the training for Case-1 traffic flow in Table 3, the bad actions lead to huge delays and caused massive spikes, while by the end of the training these spikes are eliminated.

4) The normalized reward is shown in
The signal agent is trained for urban intersections with a maximum traffic flow of 700veh/h for each direction. However, there are still small spikes since the normalized reward is only considering the vehicles that are being delayed not all the vehicles in the CZ. The values of S 0 , A and T are obtained from [36] and shown in Table 3 with other used values for training both velocity and signal agents.

V. SYSTEM PERFORMANCE
We use the modified Mcity as the MATLAB/SIMULINK environment for measuring the performance. The model is supposed to work under various traffic demands from all directions. The N i is generated between [N min N max ] with an   equal probability. In this section, we will look deeply into one scenario which is Case I from Table 4.
The simulation trajectories corresponding to the position of CAVs-Platoons are plotted with respect to time for the double agent in W, S, E, and N lanes in Figs. 8, 9, 10, and 11, respectively. The platoons have learned to reduce their speed in advance before reaching the MZ (at 400m) to avoid traffic oscillations. Also, it can be noticed that the traffic signals phasing is changing depending on the traffic demand by   having the signal agent as it provides more green phases to the W-E than the S-N lane as it has higher traffic. The average velocity of all the vehicles in the four lanes that have not crossed the MZ yet is shown in Fig. 12. The minimum gap for each platoon is calculated with respect to the preceding platoon is continuously recorded and updated to ensure its an accident-free and follow S 0 constraint as shown in Fig. 13.  One limitation of the DA is that the SA will only look to minimize the total delay in all lanes, giving longer signal phases to the higher traffic flow lanes can trigger a dropdown in the average speed of the low traffic flow lanes in which SA gives a very small phase, this can be seen in Fig. 12 North/South lanes at (Time=25s), where the platoon was forced to minimize its speed due to the low phase signal, and since it was the only platoon in the lane at that time ( Fig. 9 and Fig. 11), the average speed was equal to the platoon's reduced speed which was significantly dropped. However, in most scenarios, this should not trigger a traffic oscillation because it happens with low traffic flow lanes, except if there is a succeeding vehicle that will be forced to stop and reach a minimum gap of S 0 as shown in several platoons in Fig. 13.

VI. SYSTEM EVALUATION COMPARISON WITH BENCHMARK
In this section, there are three different systems to evaluate the performance of the developed systems as: 1-The main benchmark that is the IDM with a fixed signalizing. 2-The single-agent that is the developed VA with a fixed signalizing. VOLUME 10, 2022 3-DA setup using the combination of the velocity and signal agents.
To choose a suitable fixed signalizing period to compare with our system, we analyzed the different options available in A s with IDMs for the four cases mentioned below where each case with a fixed time has been run for 1hour of simulation and its results are shown in Table 5. The best performing time was 25s in Case I, Case II, and Case IV. However, 10s was sufficient for Case III. In order to compare with a constant fixed signalizing instead of alternating between the best timing since alternating would be considered as a smart system itself, hence we chose the 20-second phasing as the average best action. The traffic signals have 3 seconds yellow light between switching. There are four different traffic flow scenarios are simulated as Case I: Unsymmetrical considerably high traffic flow, Case II: Symmetrical considerably high traffic flow, Case III: Unsymmetrical low traffic flow, and Case IV: Extremely unsymmetrical traffic flow as shown in Table 4.
The comparison among different solutions is mainly focused on measuring the average delay, and average fuel consumption. The simulation is conducted for each scenario for 1 hour and the corresponding results are presented in Table 6. The delay is calculated by subtracting the estimated arrival time t a,i at the entry point from the actual arrival time to MZ, and the fuel consumption is accumulated as briefly illustrated in the previous sections until passing the MZ. The  improvement is measured compared to the IDM system as the benchmark. Table 6 shows that VA and DA perform better than the IDM approach in three out of four cases. In case III, the traffic flow is low and symmetric as well which is the perfect scenario for IDMs with the fixed signalizing. However, it is not a practical scenario as most urban intersections have unsymmetrical traffic flows. In the remaining cases, the single/double-agent outperform the IDM approach with a significant delay/fuel efficiency improvement as a result of two main reasons as I) The platoons are more efficient in maximizing the throughput of the green signals where more CAVs are passing the green signal together since they are arriving together as a platoon and II) The velocity agent eliminates the traffic oscillations caused generally in human-driven vehicles which causes huge delays and extra fuel consumption. Fig. 14 shows the traffic oscillations caused by the IDM approach. Fig. 15 and   The average speed of IDM, Single-Agent and DA is plotted in Fig. 17, Fig. 18 and Fig. 19 respectively. We can see that the average speed of the single agent (Fig. 18) and DA (Fig. 19) does not have the limitation mentioned earlier of low phases, which eliminated those sharp drops in the average speed. This is mainly due to having high symmetrical traffic flow, wherein the single-agent signal phases are always 20seconds, and in the DA, the signal phases reach 25 seconds in many states.
The final results of single-agent average delay improvement is 43.7%, and fuel consumption average improvement is 8.8% based on the four cases compared to the benchmark. Consequently, implementing the DA system leads to more efficient results especially in the unsymmetrical traffic flow cases which is the most practical scenario. The DA delay  average improvement is 47.3%, and fuel consumption was 13.6% average more efficient than the benchmark.

VII. CONCLUSION
In this study, we proposed a double QL agents RL based to control the platoons of CAVs into signalized intersections in the promised smart cities. The training of the double agent is performed in a decentralized manner where the velocity agent is trained and executed to train the SA. There are two main improvements in the presented system: 1) The first improvement is reducing the average delay of CAVs passing urban intersections with an average improvement of 47.3%.
2) The second improvement is the fuel efficiency of an average of 13.6%, which is a critical part to consider in the long term. Two main points are drawn from the results, the first point is that introducing platoons in higher traffic flows can effectively reduce fuel consumption and average delays. The second point is that the SA can adapt to symmetrical/unsymmetrical traffic flows which is highly needed. In this proposed design, the override system has played a major role in assisting the VA to avoid potential accidents and find the golden state. Our aim in the future is to add neural networks to both agents in order to remove the override system in the VA to make it a fully smart system and reach a better optimal policy in the SA. Also, we are looking to simulate traffic conflicts for accident detection.

ACKNOWLEDGMENT
The findings achieved herein are solely the responsibility of the authors. The open-access publication of this article was funded by Qatar National Library.