On the Road With 16 Neurons: Towards Interpretable and Manipulable Latent Representations for Visual Predictions in Driving Scenarios

This paper proposes a strategy for visual perception in the context of autonomous driving. Humans, when not distracted or drunk, are still the best drivers you can currently find. For this reason, we take inspiration from two theoretical ideas about the human mind and its neural organization. The first idea concerns how the brain uses structures of neuron ensembles that expand and compress information to extract abstract concepts from visual experience and code them into compact representations. The second idea suggests that these neural perceptual representations are not neutral but functional to predicting the future state of affairs in the environment. Similarly, the prediction mechanism is not neutral but oriented to the planning of future action. We identify within the deep learning framework two artificial counterparts of the aforementioned neurocognitive theories. We find a correspondence between the first theoretical idea and the architecture of convolutional autoencoders, while we translate the second theory into a training procedure that learns compact representations which are not neutral but oriented to driving tasks, from two distinct perspectives. From a static perspective, we force separate groups of neural units in the compact representations to represent specific concepts crucial to the driving task distinctly. From a dynamic perspective, we bias the compact representations to predict how the current road scenario will change in the future. We successfully learn compact representations that use as few as 16 neural units for each of the two basic driving concepts we consider: cars and lanes. We maintain the two concepts separated in the latent space to facilitate the interpretation and manipulation of the perceptual representations. The source code for this paper is available at https://github.com/3lis/rnn_vae.


I. INTRODUCTION
Road traffic injuries are the leading cause of death for the age group between 5 and 29 years [1]. The World Health Organization reported that in 2018 the number of road traffic deaths was 16 times larger than in war conflicts from that same year [1]. This suggests that mitigation of motor vehicle accidents will probably be one of the most beneficial outcomes expected from artificial intelligence and automation [2]. In fact, in the US only 2% of vehicle crashes are due The associate editor coordinating the review of this manuscript and approving it for publication was Junhua Li . to technical failures; the rest is attributable to the human drivers. Among the major causes of accidents are inattention, fast or reckless driving, illegal maneuvers, the influence of alcohol or drugs, and tiredness [3].
Self-driving cars will be immune to all the risky factors depending on human drivers. The development of fully autonomous vehicles has always been considered a coveted achievement for modern society. The research on this field has a long history that dates back to the late 70s [4], but it became a reality -at an unusually fast pace -no longer than a decade ago [5]. While most of the components of a self-driving system (such as sensors) have improved at FIGURE 1. The idea behind our approach. (a) A first model learns to represent visual scenarios into compact vectors that are at once semantically organized and temporally coherent. By exploiting semantic segmentation as a supporting task, the model forces separate groups of neurons to distinctly represent the basic concepts of cars and lanes, while self-supervision is adopted to bias the internal representation towards the ability to predict the dynamics of objects in the scene. (b) A second neural network uses the compact representations to perform imagery and predict long-term future frames.
the typical rate of technological progress without any specific crucial innovations, the impressive advances have been mainly fueled by the emerging ''deep'' versions of artificial neural networks [6]- [8].
Since the early beginnings, the greatest challenge for autonomous driving systems has been the perception and understanding of the road environment. This is precisely one of the successful fields of application of deep neural models [9]- [11], which have quickly become the method of choice for driving scene perception [12]- [15]. However, despite the impressive progress, perception remains the major obstacle towards fully autonomous vehicles. The core of this issue can be identified in the narrow conception of ''perception'' usually assumed in autonomous driving, which lacks a fundamental aspect: to gather knowledge about objects and events in the environment oriented to plan future actions [16], [17]. Hence, perception is not a mere elucidation of objects in the world but the detection of action possibilities.
In this respect, it might be useful to reflect on how humans are able to drive. When not distracted, asleep, or deliberately engaged in dangerous maneuvering, humans are excellent at driving, as at many other complex and highly specialized sensorimotor behaviors. How the human brain realizes such sensorimotor behaviors is far from being fully understood, but a few general neurocognitive theories try to shed light on this. We believe it is useful to borrow in particular two theoretical ideas to design the perception strategy of autonomous vehicles.
The first neurocognitive theory we take inspiration from concerns how sensory information is coded into lowdimensional representations in the brain. These perceptual representations can capture aspects related to the actions that caused the perceptual stimulus. Because of the sensorial information saved in the representations, the brain can recreate the original stimulus in an approximated form, during a phenomenon called mental imagery [18], [19]. One of the first pieces of evidence of these internal representations was found in the work of Damasio [20], who identified neuron ensembles exhibiting a convergent structure, where neural signals are projected onto multiple cortical regions in a manyto-one fashion. Damasio later developed a broader theory [21] identifying more sophisticated neural structures he called convergence-divergence zones (CDZs). In this case, the very same neuron ensembles can perform both convergent and divergent projections, depending on the current action the brain is engaged with: the convergent flow is dominant during perceptual recognition, while the divergent flow occurs during mental imagery. For this reason, CDZs have been recognized as a crucial component in the formation of concepts in the brain [22]. Therefore, we believe it useful to design an artificial model with a similar hierarchical architecture for learning the abstract concepts relevant to the driving context.
The second theoretical idea concerns the nature of the neural representations in the brain. In most cases, neural representations are not abstract representations of the environment but neural states functional to predicting the state of affairs in the future environment. The ability to predict appears to be the primary goal of intelligence [23], [24]. There is evidence for the existence in the brain of various circuits that provide prediction from perceptual representations. In particular, two forms of prediction -procedural and declarative -are typically acknowledged in different brain structures [25]. However, one of the most popular theories in the field interprets the mental mechanism of prediction in mathematical terms [26], [27]. This theory, called predictive brain, explains the behavior of the brain as the minimization of free-energy, a quantity that can be expressed in mathematical form. We will show how this formulation can be adopted as a loss function to train our model.
Our work aims to learn conceptual representations of the driving scenario from visual information. We intend to learn compact and informative representations that can be useful for a variety of downstream driving tasks, primarily the tasks requiring predictive capabilities. We propose a cognitiveinspired approach that forces the representations to be oriented to the driving tasks, under two distinct perspectives. 1) From a static perspective, we force separate groups of neural units to encode specific concepts crucial in the driving task distinctly. Specifically, we use as few as 16 neurons for each of the two basic concepts we adopt: cars and lanes. The latent space is explicitly partitioned in regions that encode different concepts so that they can be manipulated individually.
2) From a dynamic perspective, we bias the compact representations to predict how the current road scene would change in the future. Albeit this work does not fully develop visual mental imagery, it constitutes progress from mere perception to the creation of manipulable concepts that may increase the cognition abilities of intelligent vehicles, such as action-selection based on online imagery or developing improved sensorimotor abilities based on episodic simulation.
We achieve the conceptual representations by implementing an artificial neural model in line with the two aforementioned neurocognitive theories. We would like to note the term ''neural'' in artificial neural models by no means implies a faithful replication of the computations performed by biological neurons. On the contrary, the mathematics of deep learning shares little resemblance with the way the brain works [28], [29]. However, we identify two methods within the framework of artificial neural networks (ANNs) that appear, at least in part, rough algorithmic counterparts of the neurocognitive theories described above. Specifically, the CDZs may find a correspondence in the idea of convolutional autoencoders [30], while the predictive brain theory resonates with the adoption of Bayesian variational inference in combination with autoencoders [31], [32].
This work is part of the H2020 project Dreams4Cars, 1 aimed at developing an artificial driving agent inspired by the neurocognition of human driving [33]. In the following section, we further describe the objective of our work in more detail, and we discuss in §III the most significant related works. In §IV, we describe the implementation of 4 different neural models that successfully learn informative and compact representations. Lastly, Section §V presents the results of our models on the SYNTHIA dataset.

II. WHAT THIS PAPER IS (AND IS NOT) ABOUT
In the next section, we will review other works in the domain of autonomous driving that share objectives or methods with our proposal. Before that, we consider it useful to frame our proposal within the broader context of computer vision, trying to clarify similarities and differences between our approach and other relevant works in the domain of computer vision.
When looking at the results produced here, for example Fig. 7 and 9, it may seem that the outcome of our model is essentially image segmentation. Image segmentation is the process of partitioning of an image into meaningful subsets, and it has been one of the popular tasks in classical image processing [34]- [36] and continues to be a major topic in the era of deep learning for computer vision [37]- [40]. However, image segmentation has limited relevance to our work. Even if the outputs of the networks here presented indeed include the segmentation of cars and lanes, this is not the objective of the model. 1 www.dreams4cars.eu Our model aims at learning representations of the driving scenario that can be exploited for imagination in the driving context. We want these representations to be, first of all, meaningful. The representations must bear a semantic explanation, i.e., parts of the latent space are associated with concepts useful in the context of drivingcars and lanes in this case, but the work is open to further extensions such as pedestrians or bikes. The model learns these meaningful representations by exploiting semantic segmentation as a supporting task, as we will show in §IV-B, using a multi-decoder network which forces the partitioning of the internal representations into distinct concepts. In this context, segmentation can be therefore considered a practical way to achieve the separation of the semantic concepts in the latent space. Albeit this idea of partitioning the latent space may look as an expedient, we think however that it may be related to the notion of topographic organization largely present in the brain, where similar concepts are encoded in close groups of neurons [41]- [43].
Besides having a semantic organization, the representations learned by our model have a second important feature: they can be exploited for imagery, much like the brain's CDZs do, as described in §I. In this context, imagery is essentially constructing a static scene with attention to the conceptual entities consideredcars and lanes in our case. This process can result from a latent representation of a scenario seen before, or it can be triggered by a prediction of a future scenario based on past ones. It can also results from manipulating a latent space, generating scenarios the model has never seen before. In conclusion, now it is evident how semantic segmentation is just a byproduct of our entire model and not its primary focus.
Having clarified the role of segmentation in our work, we want to discuss the connection with another important machine learning domain called self-supervision. Unlike unsupervised learning, self-supervision is not motivated by biological plausibility; it is instead a way around the everpresent issue of manual data labeling in large datasets of images [44], [45]. Usually, self-supervision is realized by designing pretext tasks without any particular relevance for the agent but useful for the automatic generation of pseudolabels. While learning to solve the pretext tasks, the model is forced to capture certain visual features of images that are ideally useful for the core task of the agent.
The computer vision community has proposed several kinds of creative pretext tasks for self-supervision. A prevalent task is colorization [46], where a color image is first converted to graylevel, and the model learns to reconstruct the color version. Another kind of task is solving jigsaw puzzles made from patches of the input image [47]. There are also self-supervision tasks that are indeed useful to the overall objective of the model, but the labeling is assumed by analytical methods [48]: a common example is the exploitation of the epipolar constrains in the stereo image pair as supervision for training a monocular image depth estimation model [49].
On the other hand, a small number of approaches exploit prediction as a self-supervision task. Our model adopts this idea, using prediction of future frames to bias the internal representation towards the ability to learn the dynamics of objects in the scene. In this sense, prediction for selfsupervision shows a connection with the cognitive idea of predictive brain we mentioned before in §I.
Still, not all approaches maintain a sound cognitive account of prediction in the context of vision. For example, [50] arranges images in overlapping blocks by rows and columns, scanned in sequence with recursive networks attempting to ''predict'' the next block. This account of prediction is clearly an artifact with no correspondence in a cognitive agent. Instead, our work aims to include effective forms of prediction: prediction as imagination, and prediction as the construction of a probable future scenario. The Deep-Mind research group also widely adopts prediction for selfsupervision [51]- [53] in a way more similar to ours.
One of the few works based on a cognitive account of prediction is the model proposed by Ha and Schmidhuber [54]. This model shares some fundamental components with our architectures: the use of variational autoencoders and recursive neural networks. There is, however, a significant difference in the objectives of the models. The work of Ha and Schmidhuber is a complete agent and includes other components not considered in our model, like a controller responsible for determining the course of actions of the agent. Their wider architecture comes at the expense of a very shallow perceptual capability. Much like complex neural networks of the past generation, this model is an interesting proof of concept working in synthetic simplified examples. The simple game-like scenario on which the model has been tested has an overly simplified visual appearance, not using perspective and very low resolution. Conversely, our aim is not training an agent, but learning the perceptual capability needed for visual imagery, including the projection of hypothetical driving scenarios in visual space.

III. RELATED WORKS
It is not uncommon for works adopting neural networks for perception in autonomous vehicles to declare virtues of a neurocognitive inspiration [55]- [57]. However, often these ideas do not transfer the specific brain mechanisms into algorithms. To the best of our knowledge, the two neurocognitive principles embraced by this work -Damasio's CDZs and Friston's predictive brain -have not been proposed in any work on perception for autonomous driving. Besides, the striking similarity between the formulation of brain predictivity given by Friston and the variational autoencoder algorithm seems to remain unnoticed, with few exceptions [58].
The idea of autoencoder has been at the heart of the ''deep'' turn of ANNs [59]- [61], and the variational version has rapidly gained attention [62]. Still, in the domain of autonomous vehicle perception, this architecture is not as popular as other approaches like end-to-end. In the end-toend strategy, images from a front-facing camera are fed into a stack of convolutions, followed by feedforward layers which generate the low-level commands. The first attempt in this direction dates before the rise of deep learning [63], and it has been the groundwork for the later popular NVIDIA's PilotNet [13], [64]. One of the most severe drawbacks of end-to-end systems based on static frame processing is the erratic variation of steering wheel angle within short time periods. A potential solution is to provide a temporal context in the models, combining convolutions with recurrent networks [65].
Still, the most appealing feature of the end-to-end strategy -to dispense with internal representations -is also the primary source of its troubles. Learning the entire range of road scenarios from steering supervision alone, considering all possible appearances of objects relevant to the drive, is not achievable in practical settings. For this reason, several more recent proposals suggest the inclusion of intermediate representations, such as the so-called mid-to-mid strategy used in ChauffeurNet [66], Waymo's autonomous driving system. ChauffeurNet is essentially made of a convolutional network that consumes the input data to generate an intermediate representation with the format of a top-down view of the surrounding area and salient objects. Besides, Chauf-feurNet has several higher-level networks that iteratively predict information useful for driving. Another work [67] proposes to overcome the object agnosticism of the end-toend approach with an object-centric deep learning system for autonomous vehicles. In this proposal, a first convolutional neural module takes an image and produces an intermediate representation. Then, other downstream networks are diversified depending on a taxonomy of objects-related structures in the intermediate representation, and the structures are lastly converted into discrete driving actions. The system proposed by Valeo Vision also uses an internal representation [68] constructed using a standard ResNet50 model [69] with the top fully-connected layers removed. The feature representation is shared across many tasks relevant to visual perception in automated driving such as object detection, semantic segmentation, and depth estimation. All the downstream tasks are realized using the top parts of standard models like YOLO [70] for object detection or FCN-8 [37] for semantic segmentation.
None of the works reviewed so far builds the internal representations through the idea of the autoencoder. We found just two notable exceptions in the field of perception for autonomous driving. The first one is by the company comma.ai [71], where the latent representations of 2048 neurons are obtained with a variational autoencoder using input images of 160×80 pixels. Once trained, the latent representations are used for predicting successor frames in time with a recurrent neural network. The second exception is a work by Toyota in collaboration with MIT [72] and proposes a variational autoencoder learning representations of 25 neurons. The entire internal representations are decoded to restore the input image of 200 × 66 pixels as in a standard autoencoder. Besides, one neuron of the representation is interpreted as steering angle, so end-to-end supervision for this neuron is mixed in the total training loss.
There are similarities between these last two approaches and the one we present, but also fundamental differences. The latent representation of Amini et al. [72] does not take into account the crucial time dimension of the perceptual driving scenario. On the other hand, Santana and Hotz [71] include their internal representation in a recursive network for prediction, but time dependency is not exploited when learning the compact representation. Moreover, the comma.ai's model is agnostic about the meaning of the neurons composing the latent representation, while Amini et al. assign meaning to just the single neuron coding steering angles. We already discussed in §II how a key strategy of our model is to assign conceptual meaning to separate groups of neurons in the latent representation. In contexts different from autonomous vehicles, the idea is not new. For example, in computer vision, [73] proposed a work for the generation of head poses using a latent space with separate representations for viewpoints, lighting conditions, and shape variations. Also, in [74] the latent vector is partitioned in semantic content and geometric coding. We will show in IV-B how our partitioning of the latent spaces differs from these approaches.

IV. THE NEURAL MODELS
In this section, we present the details of our approach. We propose a model composed of two different networks: a first network generates compact representations of visual scenarios; a second network manipulates the latent vectors to predict future scenarios and to perform a rudimentary form of mental imagery.
Concerning the first part of the model, we have experimented some different architectures, all sharing the common feature of a hierarchical arrangement similar to the CDZs in the brain and following the strategy described in §I and §II. We compare three of these architectures, and each can be interpreted as a step forward in developing a more sophisticated way to learn the internal representations. Note that this series of steps can be interpreted as the opposite of what is commonly referred to as ''ablation study''.
To summarize, here we present: • three different autoencoder networks (Net1, Net2, Net3) with increasingly sophisticated approaches to learning internal representations of the driving scenario; • a recurrent neural network (Net4) which performs predictions and imagery, working exclusively with the latent representations created by the previous networks.

A. NET1: VARIATIONAL AUTOENCODER
The first model we present is essential. When talking about representation learning, the first architecture that comes to mind is the autoencoder. This is the simplest model of the family, composed of two sub-networks: The first sub-network is called encoder and computes the compact representation z ∈ Z of a high-dimensional input x ∈ X . This network is determined by its set of parameters . The second sub-network is the decoder (often called the generative network) which reconstructs the high-dimensional data x ∈ X from the low-dimensional compact representation z ∈ Z. This network is determined by the set of parameters . When training the autoencoder, the parameters and are learned by minimizing the error between input samples x i and the outputs f (g(x i )).
A substantial improvement in the architecture of autoencoders comes with the integration with variational Bayesian methods. We refer to Appendix VI for a detailed mathematical definition. The variational autoencoder can learn a more ordered representation compared to the standard autoencoder. However, there is a much space for improvements, especially in our case where we want to focus only on learning representations of driving scenarios. Therefore, we present here our implementation of variational autoencoder mainly as a comparison with the next models. Fig. 2 depicts the architecture of our variational autoencoder (Net1), while Table 6 shows the numbers of layers and the parameters adopted in the final version of the model. The input of the network is a single RGB image of 256 × 256 pixels. The encoder is composed of a stack of 4 convolutions and 2 fully-connected layers, converging to a latent space of 128 neurons. The decoder has a structure symmetric to the encoder, mapping the 128 neurons back to an image of 256 × 256. The network is trained to optimize the loss function in equation (15) in a totally unsupervised way.

B. NET2: TOPOLOGICAL AUTOENCODER
The next model we present shares most of its architecture with the previous one. The crucial improvement is the introduction of a semantic organization in the latent spaces. As discussed in §I, the human brain projects sensory information -especially visual -into compact representations through the CDZ structures. Some of these representations constitute the conceptual space, where neural activations encode the entities in the environment that produced the perceptual stimuli. We can take inspiration from this theory and use the hierarchical architecture of CDZs as a ''blueprint'' to design a more sophisticated neural network, which can learn representations that are not only in terms of visual features but also in terms of useful concepts.
In the driving context, the entire road scenario is informative. However, from a conceptual point of view, it is not immediately necessary to infer categories for every entity present in a scene. Within the aims and limits of this paper, it is useful to project in conceptual space the entities mostly relevant to the driving task. Therefore, for simplicity in this model, we choose to consider the two main concepts of cars and lanes. Fig. 3 presents the architecture of the topological autoencoder (Net2), composed of one shared encoder and three independent decoders. The choice of parameters is similar to Net1, as Table 7 shows. The encoder and each of the three decoders maintain the same structure as in Net1, and the size of the latent space remains unchanged. Still, the internal organization of the latent space is forcefully partitioned. The grey decoder of Fig. 3 works in the visual space -just like the decoder of Fig. 2 -mapping all the 128 neurons of the latent vector z altogether back into an RGB image. This decoder learns to reconstruct the input image and is trained in an unsupervised way. Instead, the decoder colored in green takes only a sub-vector z C of 16 neurons from the latent space and produces a matrix x C of 256 × 256 probability values. The sub-vector of 16 neurons is trained to represent the cars concept, and the output matrix can be interpreted as a semantic segmentation of the input image, where values indicate the probability of the presence of cars entities. Similarly, the violet decoder maps only a sub-vector z L of 16 neurons representing the lanes concepts into a probability matrix x L for lanes entities. These two decoders require supervised learning: their output is converted into binary images by applying a threshold, and trained to minimize the reconstruction error with semantic segmentation of the input images. As we mentioned in §II, the segmentation here can be considered a mere byproduct of the network, and the goal remains the meaningful latent representations.
We already discussed in §III that the idea of partitioning the latent vector into semantic components is not new. However, our approach is different: while we keep the two segments z C and z L disjointed, the entire z learns representations in the visual space. That is why the grey decoder of Fig. 3 takes as input the entire latent space. In this way, we try to adhere to the CDZ theoretical idea, as we explicitly force the network to pay attention to the cars and lanes entities in the environment. Another advantage of our approach in partitioning the latent space concerns the well-known crucial issue of lack of transparency in deep neural networks. In most models, no information is available about what exactly makes the models arrive at their predictions [75], [76]. We can mitigate the issue by explicitly assigning semantic meaning to the components of the inner representation.
To give a mathematical description of the model, it is composed of four sub-networks: The subscript V denotes the visual space, and the subscripts C and L refer to the cars and lanes concepts respectively. For each latent vector z we have: where z C and z L are the two sub-vectors representing the cars and lanes concepts, respectively. The segment in between, z, encodes the remaining generic visual features, while the entire latent vector z is a representation in the visual space. The final version of the model has N V = 128 and N C = N L = 16. We will discuss this choice in §V-B, while other parameters and learning rate are included in Tables 7  and 8. VOLUME 8, 2020 By calling = [ V , C , L ] the vector of parameters of all decoders, the loss functions of the model can be derived from the basic equation (15). At each batch iteration b, a random batch B ⊂ D is presented, and the following loss is computed: where Few observations are due for the differences between this loss function (1) and the basic one (15). First of all, we apply a delay in the contribution of the Kullback-Leibler divergence in the term E K . This strategy is called KL annealing and was first introduced in the context of variational autoencoders for language modeling [77]. The reason is the encoder at the beginning of training is unlikely to provide any meaningful probability distribution q (z|x). Therefore, there is a cost factor for the KL component, which is set initially at a small value k 0 and gradually increased up to 1.0 with a time constant κ.
A second difference in the loss function are the terms E V , E C , E L . They represent the reconstruction errors of the visual scenario and the conceptual entities. The term E V computes the error in the visual space using the entire latent vector z, and it corresponds precisely to the second component in the basic loss (15). The other two terms E C and E L compute the error in the conceptual space and are slightly different. Only the relevant portion of the latent vector is considered, as symbolized by the projection operators C , L .
Another difference is the use of a variant of the cross entropy in E C , E L , indicated with the symbols p C and p L . This variant takes into account the large unbalance between the number of pixels belonging to a concept and all the other pixels, which is typical in ordinary driving scenes. Following the method first introduced in the context of medical image processing [78], we compensate this asymmetry by weighing the contribution of true and false pixels with P, the ratio of true pixels over all the pixels in the dataset, computed as follows: where M is the number of images in the dataset, N is the number of pixels in an image, and s is a parameter used to smooth the effect of weighting by the probability of ground truth: a value evaluated empirically as valid is 4. The term y i,j is the value of the i-th pixel (in a flatten order) of the j-th target image of the dataset. We use a set of target images for each semantic concept. Hence, we have a set of car labels composed of binary images where white pixels indicate the presence of cars in the scene, and a set of lane labels where white pixels correspond to lane markings. Lastly, in the loss equation (1) the contributions of the terms E V , E C , E L are weighted by the parameters λ V , λ C , λ L . The purpose of these parameters is mainly to normalize the range of the errors, which varies widely from visual space to conceptual spaces. For this reason, typically λ V = λ C = λ L .

C. NET3: TEMPORAL AUTOENCODER
The next model is the final step in our development of an autoencoder able to learn meaningful representations of the driving scenario. We made it clear in §I that our work aims to learn representations oriented to the driving task from a static and a dynamic perspective. In Net2, we include the static perspective, i.e., a conceptual organization of the latent representations. In our third model, Net3, we also include the dynamic perspective by forcing a temporal consistency in the representations.
We achieve representations consistent in the temporal dimension with the inclusion of a recursive module in the architecture of Net2 and the use of self-supervision, as already mentioned in §II. In this way, the model learns how the concepts represented in the latent space will change in future driving scenarios. However, the predictions this model can make are still short-term, whereas longer-term predictions will be the subject of Net4. Fig. 4 shows the architecture of Net3, and Table 9 describes the parameters of the final model. The model shares substantially the same architecture of Net2, except for an additional module based on a simple recursive neural network. The training procedure, however, is significantly different from the previous network. Let us introduce the notation x (t) to indicate the frame t steps ahead of frame x. Similarly, z (t) refers to the latent representation of the image t steps ahead of that represented by z. At each iteration of the training, the inputs of the model are two consecutive frames x and x (1) , which are fed to the common encoder. The encoder computes two latent representations z and z (1) , which are passed to a RNN trained to predict the latent vector z (2) containing the representation of the successive frame in the sequence. Then, all three latent vectors are expanded using the same 3decoders structure already seen in Net2, so that the overall model is trained to generate visual and segmented output images for the three frames x, x (1) , x (2) .
The novel recursive sub-network of the model can be described by the function: h z, z (1) → z ≈ z (2) . This module is implemented using a basic recursive neural network (RNN) [79] with a time window of 2 and a set of parameters .
The formulation of the loss used for training the network is similar to equation (1) with additional terms for the recursive prediction: where the first term is the same loss of equation (1) and the additional terms are: For the sake of legibility, let ≈ z = h z, g (x (1) ) . The expressions of the remaining terms are the following: C | z C , The contributions of the terms E V , E C , E L is similar to that of E V , E C , E L , as they represent the errors in the reconstruction of the frame successor of x. The temporal coherence is measured by the terms E V , E C , E L representing the error between the frame 2 steps ahead of x and the images decoded from the latent vector predicted by the recursive sub-network h .

D. NET4: RECURRENT NETWORK
The last network we present is an example of how the results of the previous models can be exploited to perform long-term prediction of driving scenarios. The previous three sections ( §IV-A to §IV-C) describe the steps we made towards the design of a model able to learn latent representations that are both conceptually organized and temporally consistent. Net3 is the result of this development. Once trained, Net3 can be deployed in its encoding part to generate a latent representation of any visual driving scenario. VOLUME 8, 2020  Therefore, the long-term prediction can be realized by working entirely in the latent space. The advantage of having a compact latent representation allows the recurrent network to have a complex architecture with a limited number of parameters. Fig. 5 shows the proposed recurrent network (Net4), which has a first module composed of multiple levels of stacked recurrent sub-networks, one for each latent vector in the input sequence. A second module is composed of multiple parallel recurrent sub-networks predicting successive latent vectors in the sequence. In the first module, each stacked sub-network sends its entire output sequence to the next sub-network input. In the second module, instead, the parallel sub-networks yield only the last output in the time sequence. All the sub-networks of the model share the same core architecture implemented with Gated Recurrent Units (GRUs) [80], and we will discuss this choice in V-C.
The overall model can be described by the function: r z, z (1) , · · · , z (N I −1) where N I is the length of the input sequence, N O is the length of the future predicted sequence, and is the set of parameters of the model. In the final version of the model, we choose N I = 8 and N O = 4. We use 2 stacked GRUs and 4 parallel GRUs, as described in Table 10. Lastly, we want to note that this model does not use any odometry or other kind of information for the prediction, just the rich representation learned by the accompanying autoencoder.

V. RESULTS
In the last section of our paper, we present and discuss the results obtained by our models. We first spend a few words about the dataset adopted in this work. Then, we show qualitative and quantitative results for two of the autoencoder networks we implemented (Net2 and Net3) and for the recurrent network (Net4). Lastly, we show further evaluation on the latent representations learned by the different autoencoder networks.

A. DATASET
The SYNTHIA dataset [81] consists of a large collection of photo-realistic video sequences rendered using the game engine Unity. It comprises about 100, 000 images of urban scenarios recorded from a simulated camera placed on the windshield of the ego car. Each video sequence is acquired at 5 FPS and comes with semantic annotations or several classes, including lane markings, which are not commonly found in other datasets.
Despite being artificially generated, this dataset offers a wide variety of reasonably realistic illumination and weather conditions, occasionally resulting even in very adverse driving conditions. The dataset features 5 sets of driving sequences. Each set contains about 10 recordings of the same track rendered under different environmental conditions: traffic, weather, season, and time of the day. Fig. 6 gives an example of the variety of data coming from the same driving sequence with different conditions. Moreover, the tracks are very diverse as well, including freeways, tunnels, congestion, ''NewYork-like cities'', and ''European towns'' -as the creators of the dataset describe it.
We randomly allocated 70% of the video sequences to the training set, 25% to the validation, and 5% to the test set, ensuring no overlap among the three sets. For a more interesting visualization of the results, we further organize FIGURE 7. Results of Net3 in reconstructing an image and its cars and lanes entities. In the first row, the input frames belonging to different categories of driving conditions. In the center row, the output of the network. In the last row, the same input frames plotted with a colored overlay showing the target cars entities in cyan and the lanes entities in yellow.
the test set into four (overlapping) categories, based on the driving scenarios: urban environments, freeways, sunny conditions, and darkness or adverse weather conditions.

B. RESULTS OF NET2 AND NET3
We present the results of the two autoencoders -Net2 and Net3 -we described in §IV-B and §IV-C. The networks are trained for 200 epochs in their final version. Note that here we omit the results of Net1 since it lacks any conceptual information. However, in §V-D we will include a comparison of all three networks based on their latent representations.
First, we present some quantitative results obtained by the models when reconstructing an image and its cars and lanes entities, measured with the IoU (Intersection over Union) metrics. Table 1 displays the scores for the cars and lanes classes grouped into the four driving conditions mentioned above. The Table also includes the general scores on the entire test set. We compare the performance of our two autoencoders with two other well-known models 2 for pure semantic segmentation, FCN-8 [37] and U-Net [82] (both using VGG-16 as base model). The scores show how Net3 can learn a more consistent latent representation compared to Net2 and the FCN-8 model, in all the categories of driving sequences. The U-Net model outperforms all other models, although the scores are still comparable. However, for both Net2 and Net3, it is evident how the task of recognizing the cars concept achieves better scores compared to the lanes concept. An explanation of why the latter task is more difficult can be the very low ratio of pixels belonging to the class of lanes over the entire image size, and consequently how easily the lane markings get occluded by other elements in the scene.
We would like to stress again that the purpose of our networks is not mere segmentation of visual input, as we discussed in §II. The segmentation operation must be consid-ered a supporting task, forcing the model to learn a semantic organization of its internal representations, which is totally missing in the U-Net and FCN-8 models.
Second, we present two different qualitative results. Fig. 7 shows the images produced by Net3 for four different images of the test set, one for each driving condition. Given an input image (showed on the top row of the Figure), the network produces its corresponding latent representation. The latent vector is passed to the three decoders to reconstruct the initial image and to extract the cars and lanes entities in the scene (center row of the Figure). For easy visualization, we show the output of the three decoders as a single image, having as background the reconstruction in the visual space, and as colored overlays the segmented entities of cars (in cyan) and lanes (in yellow). The images on the bottom row of the Figure are displayed as a reference, showing the target images with the colored overlays of the two classes.
Another qualitative result of Net3 comes from interpolating between different latent spaces. In Fig. 8, each column shows what happens when taking the latent representation of a first frame (first row in the Figure) and linearly interpolate it with the latent representation of a second frame (last row). We generate 5 intermediate latent vectors, passed to the decoders of Net3 to produce novel frames. The images prove to be a smooth and gradual shift from the first input to the second, and they successfully provide new plausible driving scenarios not seen before by the network.

C. RESULTS OF NET4
Here we show the results of our recursive model Net4, trained for 100 epochs on a corresponding dataset of latent representations computed by our most advanced autoencoder Net3 over the initial SYNTHIA dataset.
Starting with the quantitative results, Table 2 contains the IoU scores obtained by the model in the different categories of driving sequences used before. As described in §IV-D, the network takes as input a sequence of 8 frames and predicts VOLUME 8, 2020 the 4 subsequent frames. Since the SYNTHIA sequences are acquired at 5FPS, the network is predicting 0.8 seconds in the future. The table shows the scores for the 4 predicted frames, separated as usual in the cars and lanes classes. It is immediate to note the cars scores are always higher than the lanes scores, just like we saw in Table 1. However, the cars predictions worsen more significantly for the distant frames with a decay of 16%, while the lanes scores lose only 9%. This result can be explained by the fact that, generally, in a driving sequence, the lane markings change more smoothly and predictably compared to the cars, which can suddenly change their trajectory.
Another quantitative comparison is presented in Table 3, where we compare different implementations of Net4 based on the type of internal recursive node: basic RNNs [79], GRUs [80] and LSTMs [83]. The results indicate the GRUs are the best choice in our case. While it is not surprising that the basic RNNs obtain the lowest score, the fact that GRUs outperform LSTMs might seem unexpected. We believe the reason is twofold: first, the number of parameters in the overall model increases by more than 30% when switching from GRUs to LSTMs. Second, although it is well known that LSTMs are the most powerful recursive node for long-term prediction because of their ability to keep track of events in the remote past, in the context of driving is not so crucial to memorize scenarios occurred several seconds before. While driving, the environment and the surrounding vehicles change continuously. It is often useless to try to draw a connection between the current scenario and, for example, the one seen 10 seconds before -note that the typical timescale of vehicle dynamics is less than one second. This situation is clearly opposite to Natural Language Processing, where LSTMs give their best.
As regards qualitative results, Fig. 9 shows four examples of visual predictions, one for each category of driving conditions. We include in the Figure the 4 predicted frames and their corresponding target frames, but we omit to show the 8 input frames to keep the Figure easy to read. The results in the ''freeway'' and ''sunny'' scenarios demonstrate that the model can predict an overtake maneuver from the left as well as from the right. Another interesting result is the different kind of predictions when facing a crosswalk: in the ''city'' scenario there is a car moving perpendicularly to the lane of the ego car, so the network correctly predicts to hold still at the crosswalk; in the ''dark'' scenario cars are driving in the same direction of the ego car, so the model predicts not to stop at the crosswalk and moves forward.
As a final qualitative evaluation, we try to replicate the phenomenon of mental imagery using Net4. To mimic this process, the network is called iteratively, and at each iteration, the output is fed back as the input of the next iteration. In our specific case, we choose to take the 1st of the 4 output vectors and use it as the 8th input vector of the next iteration. Fig. 10 presents the results of 9 iterations of imagery for two different scenarios, along with the corresponding reference frames (the input images are, again, omitted for practical reasons). Note that, while the imagery process must inevitably start with all input frames taken from the original dataset, the results provided in the Figure are obtained from forward iterations, that is when the network computes all input vectors as results of previous iterations. In both driving scenarios, it is possible to appreciate how the model can predict a quite plausible future from just its own representations of the world.

D. LATENT REPRESENTATIONS
We conclude our paper with a few more words on the latent representations learned by our autoencoders with a quantitative evaluation of their temporal consistency and a qualitative visualization of their conceptual organization.
First of all, let us justify the title of our paper. Table 4 shows the impact of the sizes N C and N L on the performance of  a lower dimensionality, we force the model to capture the absolutely essential features from the data, discarding the non-relevant information; second, if the representation of a single concept occupies only a small fraction of the entire latent space, we can learn several different concepts at the same time. Here, we decide to assign 16 neurons to each concept with the idea that in the future, we can use the same architecture to learn more than two concepts, adding for   example pedestrians and bikes. Therefore, the final model adopts the most compact size not causing a severe drop in the performance, like in the cases of N C = N L < 12. Then, we present a statistical evaluation of the latent representations measuring the consistency for the temporal dynamics and their predictability. Table 5 reports the results for all our 3 autoencoder models. A first indicator ξ evaluates the degree of temporal coherence by measuring the ratio between the difference of two latent vectors that are contiguous in time, and the variance over the entire dataset Z of latent vectors. The evaluation is done independently for each component of the latent vector and then averaged: where z i is the i-th element of z, z (1) i is the i-th element of the successor of z, υ i is the i-th element of the variance vector of z over Z, and M is the cardinality of Z. A second indicator ρ measures the ''predictability'' of the representations, and it is computed as the mean square of the residual obtained when using two consecutive latent vectors to predict one neuron of a third vector by linear regression. In order to make computation time acceptable, this index is computed on a subspace Z ten times smaller than Z. By calling ε (A, b) the residual of the least squares approximation of the normal equation Ax = b, ρ can be written as follows: Therefore, Table 5 clearly shows how Net1 and Net2 have comparable scores, while Net3 performs significantly better.
In fact, only with Net3 we introduce the temporal consistency inside the latent representations, and this is nicely reflected in the results.
Moving to a more qualitative analysis, we present in Fig , it is immediately clear how the latter learns a more robust representation. In the case of (b), the variation in the neurons representing the cars and lanes concepts is minimal. The variation in the general 96 neurons is also very localized: the neurons exhibit a similar overall distribution. This fits with the fact that the four images have the same surrounding (the trees, the soil on the right). Conversely, the representations learned by Net2 do not appear as consistent. The cars and lanes neurons visibly change for each input, and even the other 96 visual features do not share any particular pattern in the 4 cases. Therefore, we can conclude that forcing a semantic organization at once and a temporal coherence leads to more robust and disentangled representations.
Lastly, we include the interesting outcome of exchanging parts of latent representations of different images. Fig. 12 shows the imaginary scenarios created by swapping between two input images the neurons corresponding to the cars and lanes concepts. Fig. 12(c) is produced by the decoders of Net3 from a latent vector composed of z C and z L taken from the representation of (a), and z coming from the representation of (b). Similarly, Fig. 12(d) is the result of combining z C and z L from the representation of (b) together with z from the vector representing (a). This is a nice example of how our VOLUME 8, 2020  Result of swapping the conceptual parts of the latent spaces between two images using Net3. Image (c) is obtained by combining the cars and lanes neurons of (a) with the rest of the vector of (b). Image (d) is the opposite, combining the cars and lanes neurons of (b) with the rest of the vector of (a).  model can perform another form of mental imagery, in the sense of creating artificial -although plausible -scenarios.

VI. CONCLUSION AND FUTURE WORKS
This paper presented a novel approach to the perception of driving scenarios loosely inspired by two theories on how the human brain works. We mimic the neurocognitive the-  ories with the tools available within the deep learning framework. Specifically, we choose the autoencoders to emulate the theoretical idea of convergence-divergence zones, which code perceptual concepts using low-dimension representations. Then, we follow the theory of the predictive brain by forcing the probabilistic representation learned by variational autoencoders to capture information about the dynamics of the scenario.
We proposed a method to learn to represent visual scenarios into compact vectors that are at once semantically organized and temporally coherent. Our approach differs from other related works precisely in the learning of the representations: first, there is a semantic organization in the sense that distinct parts of the representation are explicitly associated with specific concepts useful in the context of driving; second, the temporal coherence that is achieved through self-supervision allows the representation to be exploited for mental imagery and prediction of plausible future scenarios.
Our work aims to learn compact and informative representations that can be useful for various downstream driving tasks. Here we presented the example of predicting long-term future frames in a video sequence. However, once learned, the representations can be deployed in many possible contexts. For example, we are currently working on using the representations to predict future occupancy grids. Moreover, since we achieve to assign only 16 neurons to each concept in the representation, it is possible to include in future works more than two concepts inside the latent representations. It would be interesting, for example, to include concepts of vulnerable road users, such as pedestrians and bikes. One more future development we have planned is the adoption of a dataset of real-world video sequences. One of the reasons we adopted the SYNTHIA dataset at the beginning of our research, besides its large size and variety, was the availability of lane marking annotations, which are very rare among the classical datasets for autonomous driving. Recently, UC Berkeley introduced the Berkeley Deep-Drive dataset [84], including several types of lane marking annotations from high-quality real video sequences. Hence, the adoption of this novel dataset could be an interesting future addition to our work.

VARIATIONAL INFERENCE
The variational inference framework takes up the issue of approximating the probability distribution p(x) of a high dimensional random variable x ∈ X . This approximation can be performed by a neural network like the decoder part of Net1. The neural network by itself is deterministic, but its output distribution can be easily computed as follows: where N (x|µ, σ ) is the Gaussian function in x, with mean µ and standard deviation σ . Using this last equation it is now possible to express the desired approximation of p(x): p (x) = p (x, z)dz = p (x|z)p(z)dz.
It is immediate to recognize that the kind of neural network performing the function f (·) is exactly the decoder part in the autoencoder, corresponding to the divergence zone in the CDZ neurocognitive concept. In the case when X is the domain of images, f (·) comprises a first layer that rearranges the low-dimension variable x in a two dimensional geometry, followed by a stack of deconvolutions, up to the final geometry of the x images. In equation (10) there is clearly no clue on what the distribution p(z) might be, but the idea behind variational autoencoder is to introduce an auxiliary distribution q from which to sample z, and it is made by an additional neural network. Ideally, this network should provide the posterior probability p (z|x) -which is unknown -and should be a network like the decoder part of Net1. Its probability distribution is: While the network f (·) behaves as decoder, the network g (·) corresponds to the encoder part in the autoencoder, pro-jecting the high-dimensional variable x into the low dimensional space Z. It continues to play the role of the convergence zone in the CDZ idea. The measure of how well p (x) approximates p(x) for a set of x i ∈ D sampled in a dataset D is given by the loglikelihood: This equation cannot be solved because of the unknown p(z), and here comes the help of the auxiliary probability q (z|x). Each term of the summation in equation (12) can be rewritten as follows: where in the last passage we used the expectation operator E[·]. Being the log function concave, we can now apply Jensen's inequality: Since the derivation in the last equation is smaller or at least equal to ( |x), it is called the variational lower bound, or evidence lower bound (ELBO). Note that now in ( , |x) there is also the dependency from the parameters of the second neural network defined in (11). It is possible to rearrange further ( , |x) in order to have p (x|z) instead of p (x, z) in equation (14), moreover, we can now introduce the loss function L( , |x) as the value to be minimized in order to maximize ELBO: where the last step uses the Kullback-Leibler divergence KL . Still, this formulation seems to be intractable because it contains the term p (z), but there is a simple analytical formulation of the Kullback-Leibler divergence in the Gaussian case (see Appendix B in [31]): where µ i and σ i are the i-th components of the mean and variance of z given by q (z|x).

TABLES OF NETWORK PARAMETERS
See Table 6-10.