An End-to-End Curriculum Learning Approach for Autonomous Driving Scenarios

In this work, we combine Curriculum Learning with Deep Reinforcement Learning to learn without any prior domain knowledge, an end-to-end competitive driving policy for the CARLA autonomous driving simulator. To our knowledge, we are the first to provide consistent results of our driving policy on all towns available in CARLA. Our approach divides the reinforcement learning phase into multiple stages of increasing difficulty, such that our agent is guided towards learning an increasingly better driving policy. The agent architecture comprises various neural networks that complements the main convolutional backbone, represented by a ShuffleNet V2. Further contributions are given by (i) the proposal of a novel value decomposition scheme for learning the value function in a stable way and (ii) an ad-hoc function for normalizing the growth in size of the gradients. We show both quantitative and qualitative results of the learned driving policy.

and level 5 (full automation) by the SAE J3016 standard [1]. High-to-full autonomous vehicles must master tasks known as perception, planning, and control [2], [3]. Perception refers to the ability of an autonomous system to collect information and extract relevant knowledge from the environment. In order to do so, the autonomous vehicles need to understand the driving scenario (environmental perception), to compute its pose and motion (localization), and to determine which portions of the driving space are occupied by other objects (occupancy grids). Planning relies on the output of the perception component to devise and obstacle-free route, that the vehicle has to follow to avoid any collision while reaching its indented destination. The planned route is made of high-level commands that do not tell the vehicle's software how to actually implement them in terms of torques and forces. Finally, motion control does account for this, converting high-levels commands to low-levels actions, consisting of specific torques and forces values to be applied to the vehicle's actuators in order to make it move and steer properly. For such purpose, both level 4 and 5 autonomous vehicles are equipped with a variety of exteroceptive sensors, like cameras, LiDAR, RADAR, and ultrasonic sensors, to perceive the external environment including dynamic and static objects, and proprioceptive sensors, like IMUs, tachometers and altimeters, for internal vehicle state monitoring [4]. Moreover, high sensor redundancy along with sensor fusion are often necessary to achieve improved performance and high robustness especially in degraded driving and weather conditions. The tasks of perception, planning and control can be solved in isolation or jointly. In the isolated approach it is interpreted with a modular pipeline in which each module is separate and performs a specific task [4]. The resulting system suffers from error propagation: the modules are designed by humans and therefore potentially imperfect; every small error would propagate in the system joining with any errors in other modules. Basically, the isolated approach is not optimal and not reliable. These weaknesses motivate the choice of the end-to-end driving paradigm. With end-to-end guidance, the perception, planning and control tasks are solved jointly and are not presented explicitly. These systems have a more functional design and are easier to develop and maintain.
In general, we can distinguish various categories of system architectures for autonomous vehicle design which also accounts (or not) for connectivity among vehicles [4]: • Ego-only systems (or standalone vehicles) do not share information among other autonomous vehicles.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ A standalone vehicle uses only its knowledge to devise driving decisions. The lack of connectivity, makes this category of AVs simpler to design compared to vehicles that are connected together. • Connected systems are able to distribute the basic operations of automated driving among other autonomous vehicles, thus forming a connected multi-agent system. In this way, vehicles can share detailed driving information and use such additional information to perform better decisions. Communication among vehicles requires a specific infrastructure and communication protocols, other than being able to efficiently transmit and store large amounts of data. • Modular systems are structured as a pipeline of separate components (as discussed previously), each of them solving a specific task. The main advantage is that the complex problem of autonomous driving can be decomposed in smaller and easier-to-solve set of problems. • End-to-end driving generate ego-motion directly from (raw) sensory inputs (e.g. RGB camera images), without the need to design any intermediate module. Egomotion can be either the continuous operation of steering wheel and pedals (i.e. acceleration and breaking) or a discrete set of actions. End-to-end driving is simple to implement, but often leads to less interpretable systems. Imitation Learning [5], [6] is the preferred approach for endto-end driving, given its design simplicity and optimization stability, despite requiring a considerable amount of expert data for learning a competitive policy. Deep Reinforcement Learning (RL) is gaining interest for its encouraging results in the field [7], [8], without requiring to collect expert trajectories: just a real or simulated environment (e.g. CARLA [9], or AirSim [10]) is needed, instead. Moreover, RL can potentially discover better-than-expert behavior since it maximizes the agent's performance with respect a designed reward function.
In this paper, we provide the following contributions: • We combine the Proximal Policy Optimization (PPO) [11] algorithm with Curriculum Learning [12], showing how to learn an end-to-end urban driving policy for the CARLA driving simulator [9]. • We evaluate our curriculum-based agent on various metrics, towns, weather conditions, and traffic scenarios. To our knowledge, we are the first to demonstrate consistent results on all towns provided by CARLA, by just training the agent on only one town. • Moreover, we point out two important sources of instability in reinforcement learning algorithms: learning the value function V (s), and normalizing the estimated advantage function A(s, a). • Finally, we provide two novel techniques to solve these issues. The two methods can be applied to any valuebased RL algorithm, as well as actor-critic algorithms. More notably, the same technique we use to learn the value function is general enough to be employed in almost any ML regression problem. The paper is organized as follows: Section II defines and describes the related works in the topic, categorizing them in (i) Autonomous Driving approaches based on Deep Learning techniques, (ii) Reinforcement Learning for Autonomous Driving and (iii) Autonomous Driving Simulators. In Section III several formalisms and definitions are proposed, so to allow the reader to ease the understanding of the background context of the paper. In Section IV the proposed approach is presented. Sections V shows the obtained results on the CARLA Towns. Finally, Section VI concludes the paper.

A. Deep Learning-Based Autonomous Driving
Deep learning-based end-to-end driving systems aim to achieve human-like driving simply by learning a mapping function from inputs to output targets, so being able to imitate human experts. These inputs are often (monocular) camera images, while the targets can be quantities like the steering angle, the vehicle's speed, the route-following direction, throttle and breaking values, or even high-level commands.
Reference [13] trained a convolutional neural network to map raw pixels from a single front-facing camera directly to steering commands. The authors managed to drive in traffic on local roads, on highways, and even in areas with unclear visual guidance. To correct the vehicle drifting from the ground-truth trajectory, the authors employed two additional cameras to record left and right shifts. The authors evaluated their system by measuring the autonomy metric, being autonomous 98% of the time. To mitigate this shifting problem, [14] developed a sensor setup that provides a 360-degree view of the area surrounding the vehicle by using eight cameras. Their driving model uses multiple Convolutional Neural Networks (CNNs) as feature encoders, four Long-short Term Memory (LSTM) recurrent networks [15] as temporal encoders, and a fullyconnected network to incorporate map information. Their system is trained to minimize the mean squared error (MSE) against speed and steering angle.
Reference [5] propose to condition the imitation learning procedure on a high-level routing command (i.e. a one-hot encoded vector), such that trained policies can be controlled at test time by a passenger or by a topological planner. The authors evaluated the approach in a simulated urban environment provided by the CARLA driving simulator [9] and on a physical system: a 1/5-scale truck. For goal-based navigation they recorded a success rate of 88% in Town 1 (training scenario), and of 64% in Town 2 (testing scenario); two of the simplest towns available.
End-to-end behavioral cloning is appealing for its simplicity and scalability but there are limitations [6], such as: dataset bias and overfitting when data is not diverse enough, generalization issues towards dynamic objects seen during training, and domain shifting between the off-line training experience and the on-line behavior. Despite these limitations, behavioral cloning can still achieve state-of-the-art results as demonstrated by [6]. In fact, the authors proposed a ResNet-based [16] architecture with a speed prediction branch. According to them, in presence of large amounts of data a deep model can reduce both bias and variance over data, also having better generalization performances on learning reactions to dynamic objects and traffic lights in complex urban environments. The authors also proposed a novel CARLA driving benchmark, called NoCrash, in which the ability of the ego vehicle is tested on three urban scenarios with different weather conditions: empty town with no dynamic objects, regular traffic with a moderate amount of cars and pedestrians, and dense traffic with a large number of vehicles and pedestrians.
Reference [17] proposed the first direct perception methodan emerging paradigm that combines both end-to-end learning and control algorithms -named Conditional Affordance Learning (CAL), to handle traffic lights and speed signs by using image-level labels, as well as smooth car-following, resulting in a significant reduction of traffic accidents in simulation. Their CAL agent consists of a neural network that predicts six types of affordances from input observation, and a lateral and longitudinal controller which predicts the throttle, brake, and steering values.
Reference [18] proposed the first interpretable neural motion planner for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. Their model employs a convolutional backbone to predict the bounding boxes of other actors, as well as a space-time cost volume for planning. The input representation consists of Lidar point clouds coupled with annotated HD maps of the road. The space-time cost volume represents the goodness of each location that the self-driving car can take within a planning horizon. Their model is trained end-to-end with a multi-task objective: the planning loss encourages the minimum cost plan to be similar to the trajectory performed by human demonstrators, and the perception loss encourages the intermediate representations to produce accurate 3D detection and motion forecasting. According to the authors, the combination of these two losses ensures the interpretability of the intermediate representations.

B. Reinforcement Learning-Based Autonomous Driving
Reference [8] demonstrated the first application of deep reinforcement learning to autonomous driving. Their model is able to learn a policy for lane following in a handful of training episodes using a single monocular image as input. The authors used Deep Deterministic Policy Gradient (DDPG) algorithm [19] with prioritized experience replay [20] with all exploration and optimization performed on-vehicle. Their state space consists of monocular camera images compressed by a learned Variational Auto-Encoder (VAE) [21] together with the observed vehicle speed and steering angle. The authors defined a two-dimensional continuous action space: steering angle, and speed set-point. The authors utilize a 250 meter section of road for real-world driving experiments. Their best performing model is capable of solving a simple lane following driving task in half an hour.
Reference [7] proposes Controllable Imitative Reinforcement Learning (CIRL) to learn a diving policy based only on vision inputs from the CARLA simulator [9]. CIRL adopts a two-stage learning procedure: a first imitation stage pretrains the actor's network on ground-truth actions recorded from human driving videos, and the subsequent reinforcement learning stage employs DDPG [19] to improve the driving policy. According to the authors, the first imitation stage is necessary to prevent DDPG to fall in local optima due to poor exploration. CIRL uses a four-branch network with a speed prediction branch, similar to [6]. The authors conducted experiments on the CARLA simulator benchmark, showing that the CIRL performance are comparable to the best imitation learning methods, such as CIL [5], CAL [17], and CIRLS [6].
Often, training a competitive driving policy from highdimensional observations is often too difficult or expensive for RL. [22] propose to visual encode the perception and routing information the agent receives into a bird-view image, which is further compressed by a VAE [21]. To reduce training complexity the authors employed the frame-skip trick, in which each action made by the ego-vehicle is repeated for subsequent k = 4 frames. The authors evaluated their approach on CARLA [9], specifically on a challenging roundabout scenario in Town 3. They compared three RL algorithms: Double DQN [23], TD3 [24], and SAC [25]. The latter achieved the best performance.
Reference [26] proposed a multi-objective DQN agent motivated by the fact that a multi-objective approach can help overcome the difficulties of designing a scalar reward that properly weighs each performance criteria. Furthermore, the authors suggest that when each aspect is learned separately, it is possible to choose which aspect to explore in a given state. In particular, they learned a separate agent for each objective which, collectively, form a combined policy that takes all these objectives into account. The authors trained the agent on two four-way intersecting roads with random surrounding traffic provided by the SUMO traffic simulator [27], demonstrating very low infraction rate.

C. Autonomous Driving Simulators
Autonomous driving research requires a considerable amount of diversified data, collected on a variety of driving scenarios with different weather conditions as well. Collecting such amount of data in the real-world is difficult, timeconsuming, and costly. Moreover, driving datasets often focus only on specific aspects of the driving task, also collected with specific sensor modalities (e.g. RGB cameras vs Lidar sensors).
An increasing popular alternative to real-world data are autonomous driving simulators. Modern driving simulators like CARLA (Car Learning to Act) [9], and AirSim [10] provide realistic 3D graphics and physics simulation, traffic management, weather conditions, a variety of sensors, pedestrian management, different vehicles, and various driving scenarios as well. In particular, AirSim also supports autonomous aerial vehicles, like drones. These kind of simulators are very flexible providing an easy way to collect data in different driving scenarios, weather conditions, with different vehicles and sensor modalities. TORCS (The Open Racing Car Simulator) [28] is a modular, multi-agent car simulator that focus on racing scenarios, instead. Compared to CARLA and AirSim, TORCS has a lower-quality graphics, no traffic and pedestrian simulation, and a limited set of sensors. Other kind of driving simulators focus solely on traffic simulation. SUMO (Simulation of Urban Mobility) [27] is a microscopic traffic simulation tool that models each vehicle and their dynamics individually. In particular, SUMO can even simulate railways and the CO 2 emissions of individual vehicles.

III. BACKGROUND
In this section we provide basic formalism and results about Reinforcement Learning [29], Generalized Advantage Estimation [30], and Proximal Policy Optimization [11], needed for understanding and developing the subsequent sections.

A. Reinforcement Learning
Reinforcement Learning (RL) [29] is a learning paradigm to tackle decision-making problems that provides a formalism for modeling behavior, in which a software or physical agent learns how to take optimal actions within an environment (i.e. a real or simulated world) by trial and error, guided only by positives or negatives scalar reward signals (sometimes called reinforcements).
Formally, an environment is a Markov Decision Process (MDP) represented by a tuple S, A, P, r, γ , in which: S is the state space, A is the action space, P(s | s, a) is the transition model (also called the environment dynamics) with which is possible to predict the evolution of the environment's state, r : The state space defines all the possible states s ∈ S (of the environment) that can be experienced by the agent. Instead, the action space depicts all the possible actions a ∈ A that the agent can predict. If the state space is not fully-observable, the agent perceives observations o ∈ O, instead, which are yield by the environment itself. The observation space O contains only a partial amount of information described by S, the others (such as the environment's internal stat) are hidden. In order to recover such hidden information the agent usually retains (or processes somehow) the full (or partial) history of the previous observations, i.e. o 1:t , until the current timestep t. This setting is usually referred to be a partially-observable Markov decision process (POMDP).
The agent derives actions according to its policy π: S → A which can be either deterministic a t = π(s t ) or stochastic π(a t |s t ) mapping states s t to actions a t . Note that in a partially-observable setting (i.e. POMDP) the true states are not available to the agent, which derives actions by conditioning on (one or more) past observations instead: π(a t | o j :t ), where the index j ( j ≤ t) indicates how much past observations are considered. For our purposes, we restrict the policy to be a Deep Neural Network (DNN) [31], π θ with learnable parameters θ , that samples actions from a probability distribution, i.e. a t ∼ π θ (·|s t ). In our case, our agent predicts two continuous actions so we need to sample them from a continuous probability distribution like a Gaussian. Motivated by [32], we use a Beta distribution instead, which, apart from outperforming the Gaussian distribution, it is particularly suited for continuous actions that are also bounded.
In order to learn the desired behavior, the agent has to interact with the target environment: at the first timestep (t = 0) the environment provides the agent with an initial state s 0 ∼ ρ(s 0 ) sampled from the initial state distribution ρ(s 0 ), usually implicitly defined by the environment. Then, the agent uses its policy to predict and execute the actions a 0 affecting the environment, resulting in state s 1 according to the environment dynamics, i.e. s 1 ∼ P (· | s 0 , a 0 ). Consequently, the environment evaluates the newly reached state s 1 with its reward function also providing the agent the respective immediate reward r 0 = r (s 0 , a 0 ). Then, the interaction loop repeats for the next timestep until either a final state or the maximum number of timesteps have been reached. In general, the interaction loop proceeds as follows: at a generic timestep t the agent experiences a state s t , then it computes actions a t resulting in state s t +1 for which it receives a reward r t = r (s t , a t ) from the environment. In practice, we consider finite horizon episodes of maximum length T .

B. Proximal Policy Optimization
Proximal Policy Optimization (PPO) [11] is a model-free RL algorithm from the policy optimization family that aims to learn policies in a faster, more efficient, and more robust way compared to vanilla policy gradient [33], and TRPO [34]. In general, the aim of RL algorithms is to indirectly maximize the performance objective J (θ ) in order to maximize the agent's performance on the given task: Maximizing the performance objective J (θ ) means maximizing the expected sum of discounted rewards, seeking for a policy π = arg max π J (θ ) that achieves maximal performance (i.e. t r t is maximal). The objective (1) is stochastic (since the rewards results from sampled states and actions by following π), apart from being not directly differentiable. Hence, policy optimization algorithms (likewise other RL methods) optimize a surrogate objectiveJ (θ ), instead, called the policy gradient: where π θ is a policy parameterized by θ , and A(s, a) is the advantage function. The PPO algorithm optimizes a slightly different policy gradient objective to maximize J (θ ). In particular we utilize the following clipping objective variant (borrowing notation from [11]): where: ratio t (θ ) = π θ (a t |s t ) π θ old (a t |s t ) denotes the probability ratio between the current π θ and old policy π θ old ,Â t represents the advantages estimated by using GAE, lastly the function clip(·) truncates ratio t (θ ) at 1 − if the advantagesÂ t are negatives, otherwise the ratio is clipped at 1 + . Lastly, the hyper-parameter is usually set to 0.2. In practice, we also add an entropy regularization term H[π θ ] to the objective (3) to ensure diverse enough actions.
Equation (3) depicts the clipping objective used by PPO to improve the policy's parameters θ , moreover the clipping function ensures the current policy to be not too different from the old policy so that divergent behavior is less likely. Finally, the agent's parameters θ got updated by performing multiple gradient descent steps (usually by using the Adam optimizer [35]) with respect to (3). The update rule looks like the following: where θ are the new parameters, and η is the learning rate.

C. Generalized Advantage Estimation
Many RL algorithms belonging to the policy optimization family -REINFORCE [33], TRPO [34], PPO [11], and A3C [36] -require to estimate the advantage function A(s, a) in order to learn the desired behavior. The advantage function

A(s, a) = Q(s, a) − V (s) is defined as being the difference between the action-value function Q(s, a) and the state-value function V (s):
Intuitively, the advantage function tells us how much the action a is better or worse than the average action while being in state s. In particular, the action a is better-than-average if Q(s, a) > V (s), and is worse-than-average if Q(s, a) < V (s). To estimate the advantage function is only necessary to learn either the state-value function V (s) or the action-value function Q(s, a), since both functions can be defined in terms of the other. In particular we use the Generalized Advantage Estimation (GAE) [30] technique, which is an exponentiallyweighted estimator of the advantage function that only requires a learned state-value function. The GAE estimator has two hyper-parameters γ ∈ (0, 1] and λ ∈ [0, 1] which allows us to trade variance for bias. Finally, the generalized advantage estimator GAE(γ , λ) is defined as follows: it builds from the n-step return estimator A (n) t which is defined as the sum of n TD-residual δ t +k terms: γ k δ t +k = −V φ (s t ) + r t + γ r t +1 + γ 2 r t +2 + · · · + γ n−1 r t +n−1 + γ n V φ (s t +n ) where V φ is a learned state-value function.

A. Learning Environment
The learning environment E = S, O, A, P, r, γ , formally a Partially Observable Markov Decision Process (POMDP), defines the task that agent has to complete. This environment E was built by combining the CARLA driving simulator (version 0.9.9) [9] and the OpenAI's gym library [37]: • State space S is implicitly defined by CARLA, containing ground-truth information about the whole world. The agent cannot observe the environment's state s t ∈ S, thus the states are hidden. At each timestep t, the state s t yields the corresponding observation o t , which is what the agent observes. , where: I is a 90 × 360 × 3 image obtained by concatenating (along the width axis) three 90 × 120 × 3 images from left, middle, and right RGB camera sensors, G is a 9-dimensional vector that encodes road features, V is a 4-dimensional vector that embeds vehicle features, and, lastly, N is a 5-dimensional vector that contains navigational features. The road features G comprise: three Boolean values (is_intersection, is_junction, and is_at_traffic_light), the speed limit (a float), and the traffic light state (a 5-dimensional one-hot vector). The vehicle features V contains: the current vehicle speed, the actual throttle and brake values, and the vehicle similarity score v sim ∈ [−1, 1] w.r.t. the next planned route waypoint (center of a road segment). Lastly, the navigational features N include n = 5 distances between the actual location of the vehicle and the next n planned route waypoints. • Action space A:composed of two continuous actions with value in the range [−1, +1]. These two actions are the accelerator or brake value, and the steering angle. • Transition dynamics P(s t +1 | s t , a t ): defines how the environment's state s t ∈ S evolves in time due to the application of the predicted actions a t ∈ A. The transition dynamics is defined by CARLA, and is not explicitly learned by the agent (model-free setting). • Reward function r : penalizes any collision, as well as following the wrong planned route. Compared to other methods [7], [22] our reward function is simple, as it relates vehicle's speed, direction, infractions, and collisions in an intuitive way, thus avoiding the need to optimally weight many different terms: where v speed is the vehicle's speed, v sim is the vehicle's (cosine) similarity with next waypoint w, d w is the l 2 distance between the vehicle's position and w, s limit is the speed limit, lastly c p is the penalty for colliding with objects, vehicles, and pedestrians. • Discount factor γ ∈ (0, 1]: future rewards are discounted by a factor of γ at each timestep.

B. Data Augmentation
As demonstrated by previous work [5], [17], [39] data augmentation is crucial to let the agent generalize across different towns and weather conditions. Similarly to [5], the augmentations used are: color distortion (i.e. changes in contrast, brightness, saturation, and hue), Gaussian blur, Gaussian noise, salt-and-pepper noise, cutout, and coarse dropout. Each augmentation function is applied with a certain probability and intensity (see Fig. 1).
Geometrical transformations, commonly used for image detection tasks, including horizontal or vertical flipping, rotation, and shearing, are not applied in this case since they would significantly alter the driving scene.
Note that data augmentation have been only used in the last two stages of the reinforced curriculum learning procedure (more details in section IV-F).

C. Agent Architecture
The agent is implemented by a deep neural network [31] that takes the current observation o t as input, and outputs the next action a t ∼ π θ (z t ) along its value v t = V φ z t ), where z t = P ψ (o t ). The deep neural network represented by the agent has two branches: the policy branch π θ with parameters θ (the actor), and the value branch V φ with parameters φ (the critic). The policy branch samples a actions from a Beta distribution as motivated by [32]. The value branch outputs the value v of the states s that are used to estimate the advantage function A(s, a) with the GAE [30] technique. Both branches share a common neural network P ψ with parameters ψ, that processes observations o into an intermediate representation z. , the network P ψ is applied sequentially on each o i t , yielding four z i t which are aggregated by Gated Recurrent Units (GRUs) [40] to obtain z t . Moreover, P ψ embeds a ShuffleNet v2 [41] to process image data. Finally, both V φ and π θ are feed-forward NNs with two layers with 320 units SiLU-activated [42] and batch normalization [43].
The overall architecture of the agent is depicted in Fig. 2. The blue rectangles indicate fully-connected (or dense) layers. The blue circle, i.e. ⊕, denotes layer concatenation along the first dimension (or axis), where the batch dimension is Fig. 2. The neural network architecture of the proposed agent (with minor omissions). The first half depicts the shared network P ψ , while the second half shows, respectively from top to bottom, the value V ψ and policy π θ branches. At the center, the outputs of the first half of the network is first concatenated and then linearly combined, before feeding it to both the value and policy branches.
at axis zero. The shared network P ψ (first half) processes each component of the observation tensor o i t separately, which are independently aggregated by GRU layers [40] into single vectors. Then, the output of the concatenation is linearly combined and fed to the two branches. Lastly, values are decomposes into two numbers, bases b and exponents e, as motivated in the following section.

D. Learning the Value Function
The value function is learned by minimizing the squared i=t γ i r i is the discounted sum of rewards from timestep t to the end of the episode T − 1.
Let's notice that when the quantity v − R 2 2 is large, because the estimate v is far from the ground-truth R, also (the norm of) its gradient ∇ φ L v (φ) is large, and so the parameters φ got a big update that can cause training to be less stable. Both values and returns are normalized to have zero mean and unitary variance, this is a commonly used practice to reduce variance, so that the magnitude of the error is always small. It is not known in advance to what proportion the values and returns are normalized to avoid bias, for this reason this approach is biased: the scale of such quantities changes as the performance of the agent improves.
The following outlines the approach used to learn the function solidly and accurately, even without any normalization bias: both values v and returns R are respectively divided into Fig. 3. Example of value function learned through base-exponent decomposition. In the leftmost plot, the learned value function compared to returns; in the center plot, the regression of bases; in the rightmost plot, the regression of exponents.
where k ∈ N is a positive constant that should be large enough to represent even the largest returns. For example, we set k = 6 so that even returns up to ±10 6 can be properly depicted. With such base-exponent decomposition, learning the value function is a matter of regressing both bases and exponents; the new loss function L v (φ) is defined as follows: Hence, even large errors now lie in a small interval because both the base b and exponent e take values in a small interval, and so the gradient ∇ φ L v (φ) is always reasonably small, resulting in more stable training. Note that the bases b have a different scale from the exponents e, so we normalize them (by respectively dividing by 4 and k 2 ) such that they equally contribute to the loss value, once again avoiding the need to weight these two error terms. The normalizing coefficients are obtained by considering the worst case in the squared differences. Since the bases b ∈ [−1, 1], the worst case (i.e. the larger error value) is given by (1 − (−1)) 2 = 4: supposing b v t = 1 and b R t = −1 (or vice-versa). Similarly for the exponents (0 − k) 2 = k 2 , since e ∈ [0, k], again supposing e v t = 0 and e R t = k (or vice-versa).

E. Sign-Preserving Advantage Normalization
The estimated advantagesÂ t directly affect the norm of the gradient ∇ θ L clip (θ ) of the PPO's policy objective (3) as being a multiplicative factor. Consequently, if the advantages are large also the norm of ∇ θ L clip (θ ) is large, resulting in a considerable change of the policy's parameters: resulting in a probable change of the agent behavior, which may easily diverge; otherwise we could lower the learning rate by several factors, potentially slowing-down training. Note that the magnitude of the advantages strictly depends on the quality of the learned value function, thus: poorly estimated values imply large advantages, sinceÂ ≈ V φ (s) − R, where V φ is a learned value function and R the true returns. So, it is important to scale the advantages in a reasonable range without introducing any bias to stabilize learning (Fig. 4).
For such purpose we propose the sign-preserving normalization function which separately normalizes positive values from negative ones. The function is defined by the following TensorFlow 2 [44] code: Advantages normalized with the above function have the benefit to have the same sign (and, thus, meaning) of the original advantages (Fig. 5), while having a small and controllable scale which we argue contribute to stabilize training. Preserving the sign is an important property which avoids detrimental gradient flipping issues that cause ambiguity in the policy between better-than-average actions against worse-than-average actions, which are mismatched and viceversa: for example, widely used normalization techniques like min-max normalization and standardization (i.e. zero-mean unit-variance normalization) lack this property. In particular, min-max normalization transforms values to be in range [0, 1] such that the minimum value corresponds to 0 and the maximum to 1. Such normalization would make the normalized advantages to be always positive: thus, the sign is lost. Similarly, standardization would change the sign to be negative for those values which are below the mean value.

F. Reinforced Curriculum Learning
Since the problem of autonomous driving is extremely complex we adopt a stage-based learning procedure for our PPO agent, inspired by Curriculum Learning [12]. We divide the whole reinforcement learning procedure into five stages of growing difficulty, such that the agent is guided to learn increasingly complex behavior. Each stage has a version of the learning environment (E) that emphasizes specific aspects of the driving tasks. All the following stages occur in Town 3: one of the most complete and challenging town available in CARLA.
• Stage 1: the agent starting point is sampled from a fixed set of n = 10 locations (determined by fixing the random generator's seed). the agent's initial position is sampled from n = 10 positions. Also, in this situation the agent has to respect the speed limits and there are no other dynamic objects other than the one controlled by the agent (no traffic scenario). • Stage 2: n is set to 50, to let the agent experience more diverse starting locations. In addition, the simulator tries to randomly generate a maximum of 50 pedestrians walking freely across the map, possibly crossing streets.

V. RESULTS
In this section we provide both quantitative (Table I) and qualitative (Fig. 6) results of the driving policy resulting from our reinforced curriculum learning procedure [45].

A. Evaluation Procedure
We perform extensive evaluation of our agent against six metrics, on all CARLA's towns with different weather conditions, as well as three traffic scenarios as proposed in the NoCrash benchmark [6]: • Metrics: collision rate, similarity, waypoint distance, speed, total reward, and timesteps. • Towns: in CARLA each town has its own unique features.
We trained our agent only on one town, Town03, and evaluated it on Town01, Town02, Town03, Town04, Town05, Town06, Town07, Town10. • Weather: we evaluate on two disjoint sets of weather presets. The first set (described in section IV-F) has been only used for training, the other is novel for the agent: [WetCloudyNoon, WetCloudySunset, CloudySunset, HardRainNoon, MidRainyNoon, MidRainSunset]. • Traffic: as in the NoCrash benchmark [6], we evaluate our agent on three different traffic scenarios: no traffic (without any pedestrian nor vehicle), regular traffic (50 pedestrians and 50 vehicles), and dense traffic (200 pedestrians and 100 vehicles). We also evaluate the benefit of curriculum learning, comparing the same agent with and without curriculum: we refer to the former agent as curriculum (C), and the latter as standard (S). Moreover, we also provide (non-trivial) baseline performance of an agent with the same architecture as the other two but with random weights being fixed for the entire evaluation procedure: we refer to this agent as untrained (U). Notice that the untrained agent is a stronger (but still naive) baseline compared to a purely random-guess agent, which completely discards the input observations it receives solely sampling actions uniformly. Relative performance, aggregated over the three traffic scenarios as well as the two weather sets, are shown in Table I. Qualitative results are provided by Fig. 6.

B. Discussion
From the detailed evaluation results, we point out two major weaknesses of our approach: (1) the agent struggles at coordinating acceleration and breaking, and (2) at recognizing obstacles. This results in low speed (about 9 km/h) and many collisions as well. Such behavior could be due to lack of exploration, network capacity and/or architecture, as well as various difficulties in optimizing the policy gradient.
We also demonstrate the following: (1) emerging driving behavior without leveraging any domain knowledge, that is (2) robust and consistent across towns and weather conditions, furthermore (3) the stage-based reinforcement learning procedure has proven to be competitive, even better, compared to plain reinforcement learning.

VI. CONCLUSION
Deep reinforcement learning is still a relatively new field with lots of unexplored research directions, that enable us to solve even complex decision-making problems in a completely end-to-end fashion, thus without leveraging any domainspecific knowledge or expensive sets of highly-annotated data. On the contrary, imitation learning is a stronger approach for autonomous driving that heavily relies on high-quality and high-quantity datasets, which also should provide demonstrations of recovery from driving mistakes in order to learn a reliable driving policy.
Although our approach is not yet competitive with the stateof-the-art (CIRL [7], CAL [17], and CIRLS [6]), we demonstrate emerging driving behavior that is consistent across all CARLA towns and robust to change in weather. To our knowledge, we are the first to provide baseline performance on all towns, and to demonstrate such consistency. We also provide a decomposition of the returns that allows learning the value function in a stable and accurate way, as well as a proper normalization function for the estimated advantages.