Learning from Oracle Demonstrations – A new approach to develop Autonomous Intersection Management control algorithms based on Multi-Agent Deep Reinforcement Learning

Worldwide, many companies are working towards safe and innovative control systems for Autonomous Vehicles (AVs). A key component is Autonomous Intersection Management (AIM) systems, which operate at the level of traffic intersections and manage the right-of-way of AVs, improving flow and safety. AIM traditionally uses control policies based on simple rules. However, Deep Reinforcement Learning (DRL) can provide advanced control policies, with the advantage of reacting proactively and forecasting hazardous situations. The main drawback of DRL is training time, which is fast in simple tasks, but not negligible when we address real-world problems with multiple agents. Learning from Demonstrations (LfD) emerges to solve this problem, speeding up training significantly, and reducing the exploration problem. The challenge is that LfD requires an expert to extract new demonstrations. Therefore, in this paper, we propose to use an agent, previously trained by imitation learning, to act as an expert to leverage LfD. We name this new agent Oracle, and our new approach is called Learning from Oracle Demonstrations (LfOD). We have implemented this novel method over the DRL TD3 algorithm, incorporating significant changes to TD3 that allow the use of Oracle demonstrations. The complete version is called TD3fOD. The results obtained in the AIM training scenario show that TD3fOD notably improves the learning process compared with TD3, and DDPGfD, speeding up learning to 5–6 times, while the policy found offers both significantly lower variance and better learning ability. The testing scenario also shows a relevant improvement in multiple key performance metrics compared to other vehicle control techniques on AIM, such as reducing waiting time by more than 90% and significantly decreasing fuel or electricity consumption and emissions, highlighting the benefits of LfOD.


I. INTRODUCTION
Deep Reinforcement Learning (DRL) has demonstrated a remarkable ability to master several complex real-world sequential decision-making problems [1]-[6]. However, this success is currently limited due to the extensive training process required by traditional DRL algorithms, which need days to months of training and tens or hundreds of graphics cards in parallel [7], [8].
A field of work where DRL has been extensively studied is the control of Autonomous Vehicles (AVs). This field achieves a perfect symbiosis combined with computer simulators, making them an excellent framework for new advanced control systems. The management of these AVs in cities should be done collectively with centralized information to be as efficient as possible. This centralized control would make it possible to obtain an intelligent system capable of controlling all AVs simultaneously, in real-time, and guaranteeing a high degree of security.
The study of Autonomous Intersection Management (AIM) [9], [10] began a few years ago, even before the large-scale deployment of the first AVs. However, these AIM based their operations on simple rules, unable to achieve advanced control policies that could obtain truly intelligent and proactive behavior. That was the scenario until the development of RAIM [11], an AIM that bases its operation on DRL along with other advanced techniques of deep learning. However, the main problem with this approach is that it needs a lot of interactions with the environment to achieve good performance.
Recently, researchers are studying new alternatives that allow to speed up the training of the DRL algorithms, either by imitating a behavior (Imitation Learning, IL [12]), imitating some observations (Imitation from Observation, IfO [13]), or using an initial phase of supervised learning of demonstrations (tuples of {state, action, new_state, reward}) offered by an expert to progressively train the pre-trained policy using DRL (Learning from Demonstrations, LfD [14]). Several previous efforts have demonstrated that LfD can significantly accelerate the training of DRL algorithms in environments where there are expert demonstrations [14]- [16]. However, although we consider that LfD has potential for improvement, there are environments where the demonstrations offered by an expert cannot be extracted (totally or partially), such as in traffic simulators used to train new AV control systems. (e.g., SUMO [17]). In these simulators, each vehicle has its own internally modeled controller and it is not possible to extract (in a straightforward way) demonstrations from each vehicle.
Due to the above limitations, this work proposes a new approach within the field of LfD that uses an IL-trained agent that will model the hidden expert controllers. This agent allows us to extract the knowledge of a hidden expert (new experiences) and ask about each state, what action the hidden expert would have taken. In other words, the IL-trained agent can ask for each state: "What would my expert say I should do in this state?" In this way, in environments where there is no expert (or one is hidden) from which to extract the demonstrations, we can train an agent that imitates the hidden expert behavior and can be considered as such to leverage LfD to further train another agent via DRL. We call this new agent an Oracle. To enable the use of demonstrations offered by Oracle, we proposed several modifications to the DRL algorithm used, TD3: i) A modification to the error equation that updates the TD3 actor so that the error produced by the RL action is gradually taken into account; ii) the introduction of several parameters for a smooth and progressive transition between LfD and RL (τ 1 and 2 ); and iii) the use of two replay buffers, one for demonstrations to train Oracle and the other for TD3, in addition to the use of Prioritized Experience Replay (PER) to accelerate learning. Following the nomenclature used in previous works, we have called this new approach Learning from Oracle Demonstrations (LfOD), and the subsequent DRL algorithm is called TD3fOD. TD3fOD is used to train an AIM algorithm previously proposed in an autonomous vehicle traffic scenario [11]. The results show a notable improvement over not using LfD, speeding up by 5-6 times and reducing the variance noticeably in the policy obtained. The results compared with DDPGfD show that the use of Oracle allows to triple the training speed and considerably reduce the variance in the control behavior during training the policies. Finally, in a test scenario, the AIM algorithm trained with TD3fOD is compared with other autonomous vehicle control algorithms, and the results show an improvement in all the studied metrics, reducing the waiting time by about 95%, among other factors.
Thanks to our proposal, it is possible to extract the hidden agent (learned by imitation in Oracle) in those simulators where it is not possible (or too complicated), in order to take advantage of the benefits offered by LfD (training acceleration and more robust policies) for the development of new complex control algorithms.
The primary motivation of this work lies in the development of a new algorithm to accelerate agent training using DRL and LfD, which can be applied to any problem where there is no agent from which to extract demonstrations. Furthermore, the development of advanced cooperative control systems for AVs and Multi-Agent DRL-based systems could be speeded up from the contributions of this work.
Therefore, the main contributions of this work are: (i) propose a new LfD approach that can be used in environments where there are no experts from which to extract demonstrations, taking advantage of hidden agent demonstrations; (ii) demonstrate that using an Oracle in LfOD speeds up the training of DRL algorithms, reducing training time and variance in trained policies; and (iii) development of a new AIM trained with the proposed algorithm (RAIM with LfOD), capable of improving the performance of AIM algorithms in a cooperative autonomous vehicle control scenario.
The rest of the article is organized as follows. Section II provides the background of DRL, IL, and LfD. A review of previous related work on both AIM and LfD is discussed in Section III. Section IV details our proposal, TD3fOD. The experimental setup, simulation scenarios, and the explanation of the modifications made to TD3 and AIM are included in Section V. Section VI shows the results obtained in both the training and the testing scenarios. Finally, Section VII concludes the paper.

II. BACKGROUND
This section explains DRL, Multi-Agent DRL (MADRL), and the algorithm modified in this work, TD3. In addition, we detail how IL works, and, finally, we explain the basics of LfD.

A. DEEP REINFORCEMENT LEARNING
Reinforcement Learning (RL) is an area of machine learning in which an agent learns to complete a task in an environment where it can take an action and receive a reward for the action. The agent's goal is to find a policy that performs actions that maximize the rewards accumulated during the entire task, which are known as expected discounted total rewards. The environment where the agent is located is usually modeled by a Markov Decision Process (MDP) because many RL algorithms employ dynamic programming techniques to solve these MDP. An MDP is defined by the tuple 〈 , , , , 〉, where represents a set of states of an environment, represents the set of actions that the agent can take, is the transition function : × × → [0,1] that determines the transition probability from any state ∈ to any state ′ ∈ when the action ∈ is taken. is the reward function : × × → ℝ and ∈ [0,1] represents the discount factor that adjusts the trade-off between immediate and future rewards.
Resolving an MDP generates a policy ∶ → , which will map the states ∈ to the actions ∈ . An optimal policy * maximizes the expected discounted total reward for all states. This approach to finding the optimal policy can be formulated by the state-action value function (Q-function): ]. This Q-function determines the expected reward by starting from the state , taking the action , and following the policy .
If we focus on DRL, the introduction of Neural Networks (NN) in traditional RL algorithms has been a great revolution, considerably speeding up the learning of these algorithms and allowing them to be applied to tasks that seemed impossible before, because NNs can act as approximating functions of the policy to be learned.
Within DRL, there are several approaches, but the one that has attracted the most attention in recent years is based on actor-critics for continuous control problems. Both the actors (a policy that decides what action to take for each state) and the critic (given a state and an action, it gets what expected reward, or Q-value, is obtained, indicating to the actor whether the action is going to be good or not) are modeled by NNs.

B. MULTI-AGENT DEEP REINFORCEMENT LEARNING
MADRL [18] is a subset of RL problems in which multiple agents interact with each other and their environment, each of which attempts to learn a policy and learns jointly.
Within MADRL, there are two different learning approaches depending on the problem to be solved: cooperative multi-agent learning [19], in which agents cooperate to maximize the total cumulative reward; and competitive multi-agent learning [20], in which agents compete with each other to obtain the highest possible reward individually (or from the group they belong to).
For the training and execution of these algorithms, different techniques have been developed that take advantage of the benefits offered by collective learning. One of them is centralized training and decentralized execution [21]. Here, training is performed centrally, in an environment where each agent sends its information, and the control policies of each agent are obtained. Then, each agent obtains the policy and executes it decentralized. Another approach is centralized training and centralized execution [22]. In this case, agents are trained centrally and then executed individually. Last, we have decentralized training and decentralized execution. In this case, agents are trained in a decentralized manner and execute policies individually. The benefits of centralized training are better knowledge of the entire environment, so the policy is found more quickly and robustly. However, it is not always possible to centralize knowledge for training due to communication limitations, so decentralized training is needed. In decentralized training, each agent only has local knowledge for training. Thus, each agent updates its policy individually while sharing its policy with other agents. Finally, decentralized execution means that agents are executed on a decentralized controller individually while considering all agents simultaneously.

C. TD3
Our Clipped Double-Q Learning: Instead of using only one critic network, TD3 adds a second critic network that reduces the estimation bias by selecting the smallest Q-value of the two critic networks, encouraging underestimation of Q-values. This underestimation bias is not a problem since low values will not propagate through the algorithm, unlike overestimated values. Thus, it provides a more stable approximation and improves the stability of the whole algorithm.

Target Policy Smoothing:
To reduce the overfitting produced by high variance target values when updating the critics, TD3 adds a small noise to each selected action. In addition, it performs a double clipping, first to the aggregated noise and then to the noisy action. This feature reduces the variance of the selected actions and results in more stable Qvalues.
Delayed Policy Updated: TD3 updates the policy and target networks less frequently than the Q-functions, providing more stable and efficient training. The original paper [23] suggests updating the policy and target networks every two updates of the Q-functions. In addition, the policy network π θ is updated with a gradient ascent step simply by maximizing (1)

D. IMITATION LEARNING
IL arises from the need to train an agent more efficiently than through RL alone. With IL, an agent learns a policy by "imitating" an expert who knows what action to take for each state. Through Supervised Learning (SL), IL is more effective than RL when an expert demonstrates the desired behavior and thus teaches a policy [30]. Besides, the problem of reward shaping is eliminated, where it was necessary to carefully select the reward received by the agent or design a hand-coded function that changes smoothly to achieve a stable and consistent policy. Within IL, there are several alternatives: is the simplest form of IL, where a policy is trained based on expert demonstrations; that is, there is an expert agent who is capable of producing pairs of state-action demonstrations. These demonstrations are used in traditional SL and can obtain a policy that achieves a behavior that clones the expert. BC can work excellently for some applications where the entire state-action space is explored. However, in most cases, BC can be problematic. The main concern is that SL assumes that the samples used are i.i.d. (independent and identically distributed), but that assumption cannot be guaranteed due to the nature of the samples' capturing process.
Further, when the trained agent takes control, it can make mistakes in predicting actions that can lead to states never seen before during the expert-supervised training. In these states, the agent's behavior can lead to hazardous situations known as compounding errors [12], from which the agent can never recover. An example of compounding errors is depicted in Fig.  1. Several alternatives have been proposed to obtain more samples online while testing the trained policies and resolving the problems presented by BC.

Direct Policy Learning (DPL) is an enhanced version of BC.
It is an iterative process where expert feedback is collected during the training loop. The whole process starts by collecting demonstrations from the expert, which serve to train the agent. After the first training, the trained policy is rolled out, and the new states that are visited are stored. Then, the expert is asked what actions it would take in those new states, collecting new demonstrations. These new demonstrations (feedback) allow obtaining more data to train the agent using again SL. This loop continues until converging (see Fig. 2). It is important to store and use all the collected demonstrations to remember the mistakes the agent made in the past.
Within this group of IL, there are several powerful algorithms, most notably: SEARN (Search-based Structured Prediction). However, all these methods present a big drawback. They only use SL to obtain behaviors similar to those of the expert and do not employ any RL technique to obtain superior behavior. In addition, they need an online expert from which to obtain feedback about the states visited during rollout.

E. LEARNING FROM DEMONSTRATION
Learning from demonstrations (LfD) was introduced to overcome the limitations of DPL. LfD appeared in DeepMind's Deep Q-Learning from Demonstrations (DQfD) work [14], elegantly unifying IL and RL.
LfD employs an expert's experiences (demonstrations, which can be potentially suboptimal) to pre-train an agent through SL, and then uses RL algorithms to improve the learned policy. However, using SL in the demonstration data and applying RL in the pre-trained policy is not ideal. What LfD does is employ the demonstration data throughout the training process. Thus, the agent finds a policy that eventually surpasses the expert. In summary, LfD allows to initialize (pre-train) an agent through expert demonstrations and then use RL to discover a better policy by interacting with the environment and using the previous expert demonstrations to not forget the base policy. What differentiates IL from LfD is that the latter only has expert demonstrations, a sequence of ( , , , +1 ) and not an online expert to obtain feedback, as well as the training stage via RL to improve the learned policy.
LfD was presented using Deep Q-Network (DQN) algorithm. DQN is used to control agents in discrete space actions, and the transition from a discrete space action to a continuous one is not trivial. A modification of DDPG that allowed the use of demonstrations for its training was introduced in [15], namely, DDPGfD.
DDPGfD stores both the demonstrations collected from the expert and the experiences collected by the agent in a replay buffer. In addition, it proposes a set of improvements such as the use of a mix of 1-step and n-step return losses, learning multiple times per environment step, and L2 regularization losses. However, DQfD and DDPGfD have the main drawback of misadjusting the internal parameters of the agent obtained during pre-training when the agent begins to take control, which can lead to forgetting everything that was prelearned.

III. RELATED WORKS
This section summarizes works from the related literature that address AIM and LfD.

A. AUTONOMOUS INTERSECTION MANAGEMENT
AIM has emerged as an alternative control for AVs at trafficlight-regulated intersections. In [34], Dresner et al. proposed the first AIM that regulated the crossing of AVs at intersections using a reservation-based method following a "First Come, First Served" (FCFS) policy and eliminating traffic lights. This policy worked as follows: When a vehicle approached the intersection, it requested to reserve the spacetime the vehicle needed to cross the intersection. If the reservation did not conflict with another vehicle's reservation, the intersection accepted it, and the vehicle followed the route it had requested. Otherwise, the vehicle received a reservation denial and slowed down to request another reservation later in searching for available space-time slots.
The first results obtained showed that FCFS could outperform traffic light control in terms of flow and delay. Later works [35], [36] proposed alternative control protocols that included non-autonomous vehicles (FCFS-LIGHT) and emergency vehicles (FCFS-EMERG). The authors also proposed a mechanism to switch among policies (FCFS, FCFS-LIGHT, and FCFS-EMERG) depending on intersection conditions, improving performance by using the policy that best suited each situation [37]. Their results outperformed traditional traffic light control.
A more detailed study on the FCFS protocol was proposed in [38], where it was tested against an optimized traditional traffic light signal. The results showed that FCFS reduced the delay against the traditional traffic light signal by more than 90%. An improvement of FCFS also based on reservations was proposed by Huang et al. [39]. They suggested that when the intersection sent the denial of a reservation, it also sent a recommended deceleration speed to reach the stop line when the vehicle stopped. Furthermore, this algorithm separated the vehicles into 3 groups according to their current and past status. The proposed algorithm was compared to a roundabout and a traffic light but not the original FCFS. The results showed a reduction in delay of 85% and a reduction in fuel consumption of 50%.
Because FCFS does not consider any mechanism for grouping (batching) requests that have the same direction, several enhancements were proposed [37], [40] where batching of requests was used to improve the flow of the intersection, either by having more requests to make smarter decisions or by allowing vehicles in the same flow to pass in batches. The results showed an improvement in both FCFS and traffic light control, doubling the flow and reducing the delay by 85%.
Other approaches to AVs control use mathematical optimization to obtain right-of-way [41]- [45]. The results achieved by these proposals were similar to the previously shown algorithms. However, due to the solving characteristics of these algorithms, the resolution complexity increases significantly when the number of vehicles increases. Consequently, these algorithms face a sizeable computational complexity problem, making them unfeasible for the real-time control AIM requires. Within this approach, we can find the work developed by Wu et al. in [46]. In this work, the authors allowed all movements in all lanes and developed two modules, one in charge of deciding the temporal instant at which the vehicle should enter the intersection and another one in charge of deciding in which entry lane the vehicle should be placed and which exit lane it should take. This work followed a Mixed-Integer Linear Programming (MILP) problem approach to solve the proposed set of equations and constraints. The results showed the goodness offered by AIMs; however, due to the approach followed, there are many open problems. A heuristic approach is followed in the paper [47]. In this case, the authors use this approach to resolve spatiotemporal conflicts in AVs. In this case, they follow an approach that models the conflict points within the intersection as points of interest. The SUMO tool was used to simulate the behavior of the vehicles. The results showed that the proposed system offered a shorter vehicle waiting time than other IM schemes and a traffic light-based system. Additionally, there are other novel algorithms motivated by different fields, such as those inspired by auctions [48], those that use ant colonybased optimization [49], or those that use Monte Carlo Tree Search to obtain the order of priority to be assigned to vehicles [50].
Although the previous works showed promising results, Levin et al. [51] demonstrated that further study of the proposed algorithms is necessary since, under certain situations, FCFS may present inappropriate behaviors that can lead to inappropriate results. The control policies require a detailed and in-depth study before they become operational in real control systems. Furthermore, as can be seen, all the proposed algorithms are based on simple approaches, which do not analyze the past or future behavior of the intersection or consider the consequences of actions taken in future states. One approach that may be considered for the above would be the use of RL for vehicle control.
If we focus on RL, very few works have applied this technique in AIM, although it has been widely studied to address traffic lights control [52]- [58]. It is very interesting the work proposed by We et al. [59]. Their proposal calculates a priority order to be assigned to each vehicle using Multi-Agent RL (MARL). The results compared with FCFS and with a variant proposed by the same authors, namely, Longest-Queue-First (LQF), showed that the use of MARL could obtain a sequence of decisions that reduced the delay by more than 60%.
An approach based on DRL can be found in [60]. In this case, the authors employed an ego-centric policy trained by reinforcement and attention learning mechanisms to develop the intersection control system. The results showed that the policy outperformed other control systems under different traffic conditions. However, this approach leaves the control to each vehicle individually, and therefore, it cannot exploit all the advantages that AIMs can offer, i.e., the benefits of centralizing the knowledge in a centralized agent of all AVs. Another work applying DRL is found in [61]. In this case, the proposal models different types of AVs with different behaviors and, through a game model based on cognitive hierarchy, allows the AVs to adapt to the reactions of the other AVs. Although the results they show are promising, it is necessary to study the performance of the proposed solution in more complex environments, with more lanes and higher vehicular flow.
In our humble opinion, it makes perfect sense to use RL to control AVs in intersections. Using RL, the system can learn and acquire in-depth knowledge of AV control through trialand-error. Additionally, we expect that RL will provide a safer and faster solution that helps overcome the limitations of existing AIM algorithms.

B. LEARNING FROM DEMONSTRATIONS
RL allows solving complex problems and can provide advanced control policies. Although there are many techniques for agent optimization, one technique has generated significant interest in recent years, called Learning from Demonstrations (LfD). This technique allows to pre-train a policy quickly utilizing demonstrations of an expert and later apply RL to find another policy that improves the ' policy as described in [14]. In that work, a DQN algorithm was adapted to incorporate expert demonstrations. The results showed a notable boost in speed up training, allowing to fulfill the tasks of Atari Games much earlier and finding a better policy than those offered by human demonstrations.
DDPG from Demonstrations (DDPGfD) was proposed in [15] to control agents in continuous action spaces, incorporating demonstrations. The demonstrations and the actions taken by the agent were stored in a replay buffer for an unlimited time. The results showed the benefits of learning from demonstrations; the obtained policy performs the tasks more efficiently than demonstrations and solves the tasks between 2-4 × steps less than DDPG.
Another approach proposed over DDPG was presented by Nair et al. [ . However, due to the inherent design of these simulators, there is no trained expert agent from which to obtain demonstrations. For this reason, we decided to investigate the opportunity to train an agent that we can ask (Oracle) while training a controller (through TD3) to control AVs. As a result, we propose in this paper a new method called TD3 from Oracle Demonstrations (TD3fOD).

IV. TD3 FROM ORACLE DEMONSTRATIONS -TD3FOD
Our method combines TD3 with demonstrations extracted from an expert (Oracle). The Oracle is trained by BC to continuously obtain demonstrations, optimizing the extracted knowledge and improving and speeding up the learning process. Below, we describe our method and evaluate these insights in our experiments.
Our algorithm presents as a novelty an Oracle from which to obtain new demonstrations. This Oracle is trained by BC with the collected experiences of the expert and modifies the parameters of π θ (TD3 actor) using soft_update (soft-copy of the parameters). This soft_update is inspired by the one used by Mnih et al. [65]. In this case, the weights of π θ network ( ) are updated as depicted in (2).
By employing soft_update, we force the actor to learn more slowly than the Oracle, increasing the stability of the training. To adjust the importance that the Oracle has on the actor, we assume that the parameter 1 decreases smoothly along the simulations following (3).
The ℎ parameter adjusts the smoothness and the number of simulations from which we consider that the learning through RL has more importance than the learning done by the soft_update of the Oracle.
As it can be seen in (3), at the beginning of the training ( ∈ 1,2, . . . , ), the parameters of π θ will be very similar to Oracle (being 1 practically 1). However, the importance that Oracle has in the changes to π θ is reduced progressively until the number of simulations ( ) >> ℎ, where 1 is practically equivalent to 0 and cancels the first term of (3), canceling the changes in π θ due to Oracle soft_update. This evolution can be verified in Fig. 3.
In addition, our algorithm improves TD3 in several aspects: 1) A modification of the error equation so that the importance of the error produced by RL actions increases progressively. More specifically, we modified (1) (which was used to update θ ), and the new form is shown in (4), where the factor 2 controls the relevance of 1 (Q-values of critic Q1) over the update of π θ . 2 is shown in (5). The evolution of both τ 1 and 2 throughout the simulations can be seen in Fig. 3.
2) Incorporation of two replay buffers, one for Imitation (Oracle training) and one for RL. Moreover, these buffers use PER to speed up training by using experiences from which each network can learn more information. While PER is a technique used for RL, the original PER paper [66] suggests that it can also be employed in supervised learning.
3) The Oracle has been added to provide an expert agent to obtain new demonstrations since the simulator does not have that feature. This Oracle is trained through BC from the experiences extracted from the simulator. These experiences are stored in the Imitation replay buffer (with a fixed size). The new experiences replace the older ones when the buffer is full. 4) Finally, we add an exponential increase factor ( ) that allows θ to control the vehicles spontaneously and in an incremental way; meaning that for each timestep, there is a probability that θ carries out the control of the vehicles instead of the simulator (expert). This probability increases smoothly over time until a simulation where θ always controls all vehicles. This operation offers more stability at the beginning of the training and a gradual and smooth transition from BC to RL. Furthermore, because there is a small probability that the action will be taken by θ and not by the expert (being able to consider a kind of "sticky actions" or noise actions), the proposed procedure allows the Oracle to explore a large set of states at the start of the training, with all of the benefits that this can offer and reduce the compounding errors. The complete TD3fOD algorithm is broken down into Algorithm 1, Algorithm 2, and Algorithm 3.

V. EXPERIMENTAL SETUP
Our algorithm focuses on simulators or environments where there is no expert to ask or obtain feedback about past experiences (although it is possible to obtain new demonstrations).
However, because the controller is internally modeled and is not possible to access due to the nature of the simulator (lack of an API and/or closed-source software licensing), our algorithm exploits new demonstrations to build an expert to ask (Oracle) and then train a new control system by RL via LfD. By taking advantage of the benefits offered by LfD, training is greatly accelerated and improved. For example, a simulator that has these characteristics is SUMO [17]. SUMO is a microscopic simulator where each vehicle is explicitly simulated and is widely used by the scientific community and urban planners to obtain better traffic controllers or optimize the existing ones. For this reason, it was decided to use this simulator for this study. TD3fOD was programmed in Python 3.7 and Pytorch 1.5.0. A 16-core processor was used as a CPU, together with an Nvidia 2080TI GPU.

A. RAIM OVER TD3FOD (RAIMFOD)
TD3fOD was used to train RAIM [11], an algorithm developed for AIM systems. RAIM can control the speed of AVs in the surroundings of intersections so that the flow and safety of these vehicles can be notably increased, significantly reducing waiting time, pollutant emissions, and consumption of both fuel and power electricity. RAIM leverages the advantages provided by MADRL, to find a policy to control vehicles intelligently, collectively, and collaboratively. RAIM belongs to MADRL's Centralized Trained and Centralized Execution Cooperative approach, where vehicles send their states to the AIM. It is the AIM itself that obtains the action for each AV (centralized training and execution in the AIM) with a common goal [19].
In the original article, RAIM was trained through curriculum-based learning, increasing simulated vehicle flow when some stability in the results was achieved. The proposed solution optimized the system but required a large number of simulations and a considerable amount of training time. By using TD3fOD, we aim to accelerate learning and reduce the number of simulations without using curriculum-based learning. The new approach is called RAIM from Oracle Demonstrations (RAIMfOD). The network architecture, for both the actor and the critics, consisted of four fully connected layers (448, 128, 50, and 1 neurons for each layer). The input of the network included the characteristics of the vehicles to be controlled (e.g., position, speed, route, lane, etc.) and the output indicated the speed at which the vehicle should drive during the next time interval. These features, as well as the inner workings of RAIM, can be seen in more detail in the original RAIM article [67].

B. TRAINING SCENARIO
The training scenario was used to optimize TD3fOD/RAIMfOD. This scenario consisted of an intersection of four branches and three lanes for each branch, where it was allowed to turn left, turn right, and go straight, one movement for each lane. A representation of the simulated intersection can be seen in Fig. 4.
As a reward signal, the following sparse reward was designed. Each agent (vehicle) received each timestep: +10 (strong positive reward) when the vehicle crossed the intersection, -10 (strong negative reward) when the vehicle collided with another vehicle, and -timestep otherwise to promote crossing the intersection as fast as possible. Table I includes the values of the hyperparameters. These hyperparameters and the reward values were empirically selected, offering notable performance and a stable training process.

C. TESTING SCENARIO
A test scenario was incorporated to test the ability of our algorithm to face never-before-seen scenarios. For this purpose, a scenario with a traffic distribution that presents multiple variations was proposed, with low flows (500 veh/h), medium flows (1000 veh/h), and high flows (2000 veh/h). In addition, it presented asymmetric and symmetric traffic regarding the branches of origin, North/South (N/S) and West/East (W/E). The intersection was the same as in the previous scenario, with three lanes, where left, right, and straight turns were allowed. The time distribution of the simulated flow can be seen in Fig. 5 The following key performance metrics were used to compare the different algorithms: travel time, waiting time, time loss due to congestion, and pollution and consumption metrics (CO, CO2, HC, PMx, NOx, and fuel and electric). The vehicle distribution used was: 35% of diesel cars, 35% of gasoline cars, and 30% of electric cars with zero emissions.

VI. RESULTS
This section shows the results obtained in both the training and test scenarios.  ideal speed and is defined as follows: total_duration * (1speed/ideal_speed). The results show a significant improvement in several aspects.

A. TRAINING SCENARIO
The main improvement is the increase in training speed, reducing by 5 to 6 times the number of simulations needed to achieve even better performance than the original DRL TD3 (RAIM), with reduced variance of the results and a much more stable and robust policy. Compared to DDPGfD, it can be seen that the use of an Oracle from which to extract feedback and the improvements proposed in TD3 allow to increase the speedup threefold and to reduce considerably the variance in the control behavior of the trained RAIM policy. If we look at Fig. 6b, we can see that RAIM with TD3fOD (RAIMfOD) is able to reduce the time loss to below 20 seconds over the 200 simulations. Comparatively, the original RAIM must train more than 1500 simulations to obtain a policy that reduces 20 seconds of time loss. Fig. 6b depicts the reward metric. As it can be seen, it reduces the number of simulations by more than x6, accelerating the training of new advanced control systems. Finally, comparing the training results with those obtained with traditional control techniques shows how the metrics are much better with RAIM and RAIMfOD.
In Fig. 6a and Fig. 6b, there are three different phases of the RAIMfOD. The pre-training begins between simulations 0 and 100, filling the Imitation and RL replay buffers. The pretraining ends from simulation 100 to simulation 250, and RAIMfOD starts. This situation can be seen in the shift in the metrics' trend over iteration 100. In this range of simulations, the transition between learning by "soft-copying" the Oracle and the RL of TD3 begins, with the simulator taking more actions and the actor in TD3 acting as a "sticky action." From simulation 250 onwards, the τ 1 and τ 2 curves cross each other, and most of the actions are carried out by the TD3 actor, allowing to find a better control policy to optimize the results further. This highlights the notable performance of LfOD, allowing to achieve a policy that outperforms the one offered by the expert through a smoothed step from a pre-trained policy made by an Oracle to the policy learned by RL.

B. TESTING SCENARIO
We demonstrate h ' ability to generalize and adapt to new situations in the test scenario. This is how we illustrate the benefits offered by MADRL and LfOD. The VOLUME XX, 2017 10 results obtained are shown in Table II. In this table, it can be seen that RAIM with TD3fOD and original RAIM obtain very similar results, demonstrating that both algorithms find solutions with very notable performance, but RAIMfOD finds a policy much earlier and with a much lower variance because of LfOD, as can be seen in Fig. 6. From the results included in Table II, we confirm a reduction of time loss between 95% and 86% and a reduction of waiting time between 97% and 93%. This represents a reduction in travel time of between 72% and 53%. Regarding the emission of pollutant gases, we see that a significant improvement has been achieved in all the metrics, decreasing all studied variables by up to over 50%. Finally, in terms of h ' fuel and electricity consumption, a reduction of between 29% and 5% is achieved for combustion vehicles, and between 34% and 24% for electric vehicles. These results conclude the potential of LfOD and MADRL systems for the control of AVs in a centralized approach.

C. DISCUSSION
The results obtained in this work demonstrate the benefits of using MADRL with TD3fOD to solve different real problems. This solution has the potential to solve these problems in realtime, decreasing the time needed to train new systems and improving the performance of existing ones. The TD3fOD method is able to find a policy that significantly improves the results obtained by traditional training techniques, resulting in more stable and robust policies.
When the performance of RAIM with TD3fOD (RAIMfOD) is compared to the performance of the original RAIM, we notice that the learning performance of RAIMfOD is between ×5 and ×6 faster, and the variance of the results is substantially smaller. Furthermore, when compared to RAIM with DDPGfD, we can see that, thanks to the advantages offered by TD3, as well as the use of Oracle, we can reduce the number of simulations by up to ×3, also significantly reducing the variance in the policy while training. Finally, if we look at the test scenario, we see that the analyzed metrics outperform the original RAIM, showing the robustness of the proposed algorithm in scenarios never seen before, obtaining an improvement of between 72% and 53% in travel time and a reduction in waiting time of between 97% and 93%. Furthermore, most cases reduce pollutant gas emissions by more than 50% and energy or fuel consumption by nearly 30%.

VII. CONCLUSIONS
The success of AVs depends on advances in various driving and control systems components and the understanding and handling of unpredictable situations that can arise in complex driving environments. The application of MADRL allows the development of dynamic systems capable of adapting to many conditions and acting collectively and proactively, anticipating dangerous situations and ultimately preventing accidents and increasing flow. LfD can provide a simple way to find adaptive control policies capable of solving highly complex tasks in different fields of work, such as teaching robots how to walk, navigate through dangerous traffic, manage the presence of obstacles, avoid collisions with other road users, perform safe and efficient maneuvers at intersections, and more.
To enable the use of an expert's demonstrations in environments where the expert is not accessible, we have proposed in this work the use of an Oracle in LfD, obtaining LfOD.
The Oracle is trained by Imitation Learning; thus, it can be employed to teach from demonstration an agent using RL. This original approach facilitates using LfD in environments with no expert from whom to obtain feedback. In this way, an agent can be trained much more quickly, achieving a better policy than the expert could offer and presenting a lower variance in the results. TD3 was the algorithm modified to use the demonstrations offered by Oracle. Following the nomenclature used in the algorithms developed for LfD, we called the new proposal TD3 from Oracle Demonstrations (TD3fOD). The modifications made to TD3 were: i) incorporation of an Oracle trained by Imitation Learning from the states extracted from the simulator; ii) include several parameters for a smooth and progressive transition between LfOD and RL; and iii) use of two replay buffers, one for demonstrations to train Oracle and the other for RL, in addition to the use of PER to speed up learning. TD3fOD was applied in the SUMO traffic simulator to speed up the learning of an AIM system. The only AIM to date that used RL was RAIM, and for this reason, this algorithm was used above TD3fOD (RAIM over TD3fOD). The results obtained in the training scenario demonstrated that TD3fOD achieved more efficient learning compared to TD3, finding a faster control policy and speeding up the training by 5-6 times. In addition, the policy found offered a significantly lower variance, providing more robust results. Furthermore, the policy obtained outperformed the policy shown by the expert. TD3fOD also achieved good results in the testing scenario, improving those obtained by RAIM in all studied metrics and with a lower variance. These results highlight the benefits offered by LfOD. RAIM over TD3fOD reduces waiting time between 97% and 93%, resulting in a reduction of up to 50% in the emission of contaminating gases compared to other traditional vehicle control techniques such as traffic lights and other advanced techniques such as iREDVD. Finally, if we look at consumption, combustion vehicles reduce their fuel consumption by up to 29% and electric vehicles by 34%. Furthermore, if we compare TD3fOD with another LfD algorithm such as DDPGfD we show that the use of the proposed LfDO approach allows to speed up the training, reducing the number of interactions with the simulator by up to ×3.
Thanks to our proposal, it is possible to extract the hidden agent (learned by imitation with an Oracle) in those simulators where it is not possible (or too complicated), in order to take advantage of the benefits offered by LfD (training acceleration and more robust policies) for the development of new complex control algorithms. The proposed approach for LfOD is applicable to different RL algorithms, being one of the main contributions of this work the development of an expert agent (Oracle) in environments where it does not exist to take advantage of LfD, as well as the incorporation of this approach in the TD3 DRL algorithm, making severe modifications in TD3 to adapt the training process to the presence of Oracle.
As future work, we plan to include multiple hierarchical systems that enable level-based control of the different actors in a complete network of traffic intersections and explore the development of new algorithms that can use LfOD in other domains.