STAF: Spatio-Temporal Attention Framework for Understanding Road Agents Behaviors

On-road behavior analysis is a key task required for robust autonomous vehicles. Unlike traditional perception tasks, this paper aims to achieve a high-level understanding of road agent activities. This allows better operation under challenging contexts to enable better interaction and decision-making in such complex environments. In this paper, we tackle the problem of discriminating spatio-temporal features that capture the visual instants that require more attention.We propose a new approach called STAF (Spatio- Temporal Attention Framework) through Long Short Term Memory (LSTM) layers that uses a multi-head attention mechanism on its past cell state to focus on attributes that are relevant over time. Experiments have been carried out on two different scenarios over data from Joint Attention in Autonomous Driving (JAAD) and Honda Research Institute Driving Dataset (HDD), both datasets devoted to understanding the behavior of road agents. The evaluation and results obtained proof that the proposed "STAF" is outperforming stateof- the-art algorithms-based LSTM (Ramanishka [41], Rasouli [40], and LSTM-baseline). For example, STAF outperforms LSTM with mean Average Precision (mAP) of 13%.


I. INTRODUCTION
R OAD crashes kill nearly "1.25 million people each year, on average 3, 287 deaths a day and around 50 million are injured or disabled", reports the Association for Safe International Road Travel (ASIRT) 1 . By 2030, fatal traffic crashes are expected to become the fifth leading cause of death worldwide. The Intelligent Transportation Systems (ITS) community is investigating to find out smarter solutions that enable reliable autonomous driving systems to reduce road fatalities. To do so, many vision-based components have been developed and significant advances have been made on semantic segmentation [1] (mid-level in Figure 2), object detection [2] (mid-level), mapping and localization [3] (mid-to avoid road crashes it is crucial to understand the context (high-level in Figure 2), how participants interact with each other and with objects on the road.
The common approach to remedy the problem of predicting road users behavior is to employ dynamic factors, such as trajectory [3] or velocity [4], or the well anticipated final goal of pedestrians [5], [6]. Another behavioral cue is pedestrian head orientation, to measure the level of awareness at the point of crossing [7]. These studies, however, are limited in scope, and look at very few contextual elements that are necessary to predict the behavior of pedestrians. In practice, there are other factors, in addition to spatio-temporal ones, that can influence the crossing behavior of a pedestrian, including the structure of the crosswalk (e.g. sign, delineation), environmental factors (e.g. weather condition, visibility) or individual characteristics of pedestrians (e.g. demographics).
Towards a comprehensive road scene analysis, we propose in this work: • Tackling the issue of road agents behaviour understanding. • Proposing a spatio-temporal deep network based on an attention mechanism which leverages key visual features over time. • Reuse of the multi-head self-attention mechanism with LSTM.
In Figure 1, we present our description of on-road behavior analysis components. By agents, we mean the different road traffic actors: pedestrians, bicycles, motorcycles, vehicles, buses, trucks, etc. Past attempts proposed in the state of the art have made use of common machine learning and computer vision approaches to predict the road participants behaviors based on (i) elementary cues like velocity [8], trajectories [3] and (ii) devoted to a single scenario such as maneuver classification [9], driver states recognition [10] and pedestrian activities [11].
In the autonomous vehicle and smart mobility applications, one of the main and crucial tasks is the road behavior understanding field and its potential impact on road safety to the whole road agents such as drivers, pedestrians, cars, staff, etc. To improve the road scene understanding, we need to investigate the complementary relationships between different elementary tasks that we defined as the main components of road behavior understanding. For this, we have considered five related tasks including situational awareness, driver-road interaction, road scene understanding, trajectories forecast, driving activities, and status analysis.
To achieve a clearer understanding, we propose a new learning approach to explicitly model how road agents behave and interact with their surroundings. The contribution of this paper is 3-fold: • Firstly, we propose a spatio-temporal deep network based on an attention mechanism which leverages key visual features over time. • Secondly, inspired by the transformer model introduced in [12] for translation tasks, we propose to reuse the multi-head self-attention mechanism with Long-Short Term Memory (LSTM) network which we call STAF for Spatio-Temporal Attention Framework. • Finally, to prove the efficiency of our model to deal with road behaviors analysis, we generate extensive experiments over the JAAD and the HDD datasets devoted to understanding the interactions between road agents, drivers behaviours and causal reasoning.
As is presented in Figure 2, our on-road behavior analysis includes three levels: high, mid, and low-level layer. At the lowest layer, data such as appearance, depth and motion are collected, pre-processed and labeled to prepare the settings for higher layers' tasks. One level up, several mid-level understanding scenarios might occur to describe and model spatial and/or temporal visual data like road agents detection, recognition, activity classification, maneuver identifications, etc. At the highest level, an aggregate of the lower layers' outcomes is used to learn the overall on-road behaviors through including more intuitive paradigms like causality. In this work, we focus on five high-level tasks, mainly situational awareness, driver-road interaction, road scene understanding, trajectory forecasting, and driving status analysis.
The remainder of this paper is organized as follows. Section I introduces this paper. In Section II, we review the related work on attention mechanisms with LSTM approaches, behaviour understanding algorithms, and datasets devoted to the agents behavior and understanding. In Section III, we describe our new approach in more details, the architecture of the STAF model, and training process. Experiments and results are presented in Section IV. Finally, the conclusion and future directions are outlined in Section V.

A. ATTENTION WITH LSTMS
The attention mechanism has transformed the way RNNs/LSTMs work allowing them to focus on a certain segment of a given spatio-temporal sequence when learning a segment of the output sequence [13], [14]. This allows LSTMs to overcome the curse of spatio-temporal data length, which results in higher performance specifically on longer sequences and allows model innovation through enhancing the learning process by increasing the ability of focusing on the most relevant segment of the sequence. Standard attention modules are employed in standard RNNs between the encoder and the decoder modules where, for a given time slot, it takes both the encoder's output vectors, and its corresponding decoder' states as inputs. The output of the attention module is composed of a sequence of vectors commonly known as the context vector, which enables the following module, and the decoder, to focus on certain parts of the input when predicting the output. One of the milestone proposals is the transformer module. Instead of using the attention mechanism in conjunction with RNNs, the revolutionary transformer module introduced initially in [12] and successfully reused in [15], [16], proves that this  sequential nature can be captured by using only the attention mechanism without any association with other machines. The transformer architecture allows transforming one sequence into another one with the help of an encoder and a decoder module composed mainly of multi-head attention and feed forward layers. In [17], the attention mechanism has been put above of RNNs through introducing the so-called "the consciousness prior". It expressly consists of formalizing consciousness recurrently over time, by retaining fragments of the input representations through time in the recurrent cell as if it was attention.
The previously mentioned attention modules and their extensions have barely been used in video analysis with application for perception in road environments [18]- [20] to determine which visual instants in the image sequence are important for deciding about the learning task. However, these approaches used to map inputs into high-level CNNlike features through spatial pooling layers that diminish relevant spatial information, are important for the learnt task. Spatial-temporal learning is however well mastered in applications such as video recognition [21], [5] and processing [22]. For instance, [23] employs the attention mechanism through LSTM to generate spatially weighted representa- VOLUME 4, 2016 tions. Yang proposed in [7] to use attention to detect a region of interest at each instant but they failed to localize it at each frame and it required semantic supervision to reach all the relevant regions. Hierarchical attention strategies are elaborated in [24], [25] aiming at extracting weights from spatial attention and their conditioned likelihood with their corresponding temporal attention scheme. Spatio-Temporal Attention. As mentioned above, temporal attention has been widely used in recent video captioning work to decide which frame(s) in the video are important for generating the next word in a caption. However, these systems usually map the raw video frames into high-level CNN features (via a suitable spatial pooling operator), which marginalizes important spatial information (such as location and class of specific objects or actions) that are important for captioning.
Tu et al. [24] and Yu et al. [5] propose hierarchical attention schemes that condition on the current caption word and visual features. They first generate spatial attention weights, conditioned on which a similar attention scheme is adopted temporally; the weighted features are used to generate the word.
While this scheme shares a similar motivation to ours, their attention model must select from a much larger number of features-a harder attention problem that demands larger datasets for training. We avoid this difficulty by attending to spatial and temporal features in stages, reducing the data complexity in each stage. More recently, Aafaq et al. [26] use spatio-temporal feature engineering to improve captioning performance. In [27], object saliency is combined with bidirectional temporal graph reasoning; this is related to our proposed ranked attention model, but our formulation is much simpler.
However, RNNs can't be replaced: they are O(n), while attention mechanisms are O(n 2 ), where n is the temporal axis along with the processed time series. So, being different data structures, it is not true that attention models can fully replace recurrent models. Even if the Attention Mechanisms were stacked hierarchically (like recursive pooling) with small windows -similar to how a Wavenet would use convolutions with windowing -at best this neural architecture would require O(nlog(n)) time to process its input time series. The advantages of O(n 2 ) attention only appear when the sequence length is small enough to be computed in O(1) within a huge graphic card having enough memory and cores to process the whole thing all at once. All that being said, RNNs can't simply disappear: it's a fundamental linear data structure. Articles claiming the fall of RNN have no reason to exist and twist the facts. We shall not think of apples as oranges.
Recently, Xu et al. [28] proposed an end-to-end FCN-LSTM network for this task. They considered 4 discrete actions in learning a driving model. The definition of four actions is based on CAN signals with heuristic techniques.

B. ON-ROAD BEHAVIOR UNDERSTANDING
Human activity understanding plays an important role in achieving intelligent systems. Different datasets have been proposed like Cambridge-driving labeled Video dataset (CamVid) [29], AutoRate to predict driver inattention [30], Trajectory Prediction in Heterogeneous Traffic (TraPHic) [31], and Deep RObust Goal-Oriented trajectory prediction Network (DROGON) [32] to address the limitations in earlier works. ROad [33] event Awareness Dataset for Autonomous Driving (ROAD) is a dataset that was designed to test a robot-car's situation awareness capabilities. The annotation process follows a multi-label approach in which road agents (vehicles, pedestrians, etc.), their locations as well as their actions. Note that our data can enable research in learning driver behaviors as mentioned in the introduction. Particularly, recognizing a Goal-oriented action is an egocentric activity recognition problem. The Stanford-ECM dataset [34] is related to our dataset in the following two aspects: 1) They define egocentric activity classes for humans similar to our Goal-oriented classes for drivers. 2) They provide egocentric videos and signals from a wearable sensor for joint learning activity recognition and energy expenditure while we provide multisensor recordings from an instrumented vehicle for learning driver behavior.
In [41], Ramanishka introduces the Honda Research Institute Driving Dataset for driver behavior that includes a novel annotation methodology with a 4-layer representation. Driver behavior detection-based algorithm was trained and tested for the validation process. Rasouli [40] introduced a novel dataset devoted to traffic scene understanding that combines information like pedestrian localization in scenes, pedestrian behavior, and contextual data. In addition to learning egocentric activities, we also annotate how traffic participants interact with drivers.

III. STAF APPROACH A. OVERVIEW
In Figure 3, we present the architecture of our STAF model that includes the following steps: 1) The extraction of CNN features from the input frames.
• FC (Fully Connected) layer. 3) Cross-entropy loss layer to measure the performance of the classification. Inspired by the multi-head attention mechanism [12], here we propose a multi-head-based attention LSTM network that allows the model to cooperatively handle data from multiple sub-spaces of different representations. Unlike standard selfattention, the cell state c t−1 can query its own k past inputs  Pictorial overview of the proposed STAF framework. After extracting CNN features from the input frame, we make at first spatial LSTM through SA-LSTM, after this a pooling step is carried out through TAP, we apply after a temporal LSTM via TA-LSTM before a fully connected step through FC layer. All steps are followed by cross-entropy loss layer for classification evaluation.
with an attention window of the size of k (cf. Figure 4). To ease the notation, let us denote a t the concatenation overtime of the c states as (Eq. (1)): Indexing these k states over time defines the Positional Encoding (PE) originally/earlier used in [12] which makes use of different sinusoids to encode the positions, as (Eq. (2)): Where i ∈ {1 · · · L}, L is the length of the input sequence, j = 1, · · · , d model 2 and d model is the dimension of the each layer's output. In our STAF model, the positional embedding, contrary to, what was proposed in the literature, we do not need a random phase in the formula proposed in equation 7 since it profits from fixed window size that allows omitting the dependence of the encoding function on the variable number of features. The windowed cell states c t are concatenated to the features instead of adding them as performed in [12].
Then, we precisely obtain positional information that purposes generalizing the multi-head attention, given a query Q and its corresponding set of keys and values (K, V ) generated from the proposed positional embeddings. The multihead attention can be computed as the weighted sum of V proportional to Q and their corresponding K values.
For our LSTM cells, we make use of their Batch Normalized variant (BN-LSTM) introduced [36] to reduce internal covariate shift [37]. Unlike the batch normalization used in convolutional neural networks, we leverage it in the input-tohidden and the hidden-to-hidden connections, by means of Exponential Linear Units (ELUs) activation function [38].
Given a query q and a set of key-value pairs (K, V ), attention can be generalized to compute a weighted sum of the values dependent on the query and on the corresponding keys. The query determines which values to focus on; we can say that the query "responds" to the values.
More specially, the attention mechanism is employed to enhance the features of key nodes at each time step. Instead of performing a single attention function with d modeldimensional keys, values, and queries, we found it beneficial to linearly project the queries, keys, and values h times with different, learned linear projections to dk, dk, and dv dimensions, respectively. On each of these projected versions of queries, keys, and values, we then perform the attention function in parallel, to yield dv-dimensional output values. These are concatenated and once again projected, to obtain in the final values, as depicted in Figure 4 (Eq. (3)): Where U and W are the weights as a matrix to calculate the multi-head Attention of Q, K, and V .
Where [.] means the concatenation along the "features dimension" [35]. Equation (5) present the Key: Where PE is the Positional Encoding. Equation (6) present the query [35]:  The Batch Normal [35] in Eq. (7): a t presented in Eq. (8) [35]: With S is a multi-head softmax function proposed in the transformer block introduced in [12]. Equation (9) present c t [35]: Where z t is a hidden layer in a typical sequence transduction encoder or decoder. STAF cell that makes use of the multi-Head attention of the transformer model to run several attention layers in parallel to its past cell state values for a limited time slot. Compared to LSTM, STAF is using the multi-head attention function inside each LSTM cell so that it can query its own S past values not only with attention but also with restricted windowing on the S most recent previous cell state. This allows the cell to perform complex queries to exploit its previous inner memories, which increases the long short-term effect of the STAF. To prove the efficiency of our STAF network in high-level road understanding, we generate experiments over the JAAD and HDD datasets for the classification task of the so-called precondition behaviors prior to crossing: moving fast, moving slow, or standing.

B. TRAINING DETAILS
As the problem of training is equivalent to the problem of minimizing the loss function. There are a lot of algorithms that optimize functions. We use gradient-based algo-rithms called Stochastic Gradient Descent (SGD) optimization which is actually one of the most popular optimization algorithms in deep learning. A hyper-parameters are external to the training process and define their properties (i.e.: momentum coefficient, learning rate, and regularization parameter in weight decay term). The objective was to test various combinations to create as many models as there are possible combinations. The first step was to define a list of values that we want to test on the whole hyper-parameters. We were inspired by the values used in the state-of-the-art models which gave better results. Then, we create a dictionary containing all our previous hyper-parameters. We have associated our model to our dictionary including all the combinations that we want to test, and launch the training on both JAAD and HDD datasets. Once the training is finished, we have retrieved the combination of hyper-parameters that gave us the best score. In other way, these parameters have been obtained by experimentation after several validation tests. They also represent an average of the values used in the literature. During the training of the model, the latter learned by itself to refine some of these parameters according to the chosen loss function. We varied the learning rate over the course of training, according to the formula of optimizer given in [12]. We also chose the Adam optimizer with a weight decay for training optimization. For training our methods, we used the SGD method with constant step learning update with γ set to 0.1. Momentum, µ, and weight decay, ω, were set to

A. DATASETS
We evaluate our STAF framework on two publicly available datasets: the JAAD dataset [39], [40] and the HDD dataset [41]. Each of them is used to assess the performance of the STAF on different road agents' behavior understanding. JAAD contains around 300 image sequences of duration up to 15 seconds of 30 FPS acquired using cameras positioned inside the car in two different locations and over various traffic conditions. The settings of JAAD enable the determination of pedestrian's behaviors in front of the vehicle-specific annotation provided information that concerns their gaits (especially walking, stranding, looking towards the traffic, and not looking) and their corresponding attention level like looking, standing at the curb and checking the traffic which are useful indicators of their moving/crossing/standing intention as studied in [4]. As for the HDD dataset, it includes around 10k image sequences acquired under different weather conditions. Four annotation layers have been proposed for this dataset (please refer to [41] for more details). In our experiments, we make use of two layers called cause and goal-oriented action that proposed 5 and 11 classes for driver's behaviors, respectively ( Figure 5). The cause set includes a sign, congestion, traffic light, parked car and pedestrian classes which are the immediate causes of the goal-oriented action classes among intersection passing, left turn, right turn, crosswalk passing, left lane change, right lane change, merge, right lane branch, railroad passing, and u-turn ( Figure 5).

B. RESULTS ANALYSIS
For the three test sets (JAAD, HDD goal-oriented, and HDD cause layer), to present a fair comparison of our STAF framework to the state-of-the-art approaches [40], [41], we propose two kinds of CNN features extractor: the Inception-V3 [6] and MobileNets-V3 [42] pre-trained on ImageNet [43].
HDD Cause Layer. Table 1 reports the results of the classification of behaviors from the cause layer which are immediate causes for actions like deviating or stopping. We observed that our proposal that uses the inception features, is outperforming other methods by a huge margin of 38.96% in terms of mAP. For the classes 'parked car' and 'pedestrian', the performance of STAF is more important. For 'pedestrian' cause/class, the driver performs a stop action for pedestrians while making frequent turns and for 'parked car' cause/class, the driver deviates to avoid a collision. Visually for both these two classes, the key action is occurring over a small time slot of cause, and the spatio-temporal model employed in the state-of-the-art is based on standard LSTM fails to model the motion. In these cases, unlike our attention-based model, the STAF can focus on the most important frames despite its duration comparing to the full sequence.
HDD Goal-oriented Layer. Results 11 goal-oriented action over this test set is reported on Table 2. Our proposal is enhancing the state-of-the-art results with a margin of 28.68%. The standard LSTM model proposed in [41] did not succeed to discern 'merge' and 'lane branch' categories since they are quite similar in the temporal domain and thus the confusion rate between them should be higher. The improvement of our STAF method is proving its ability to selectively pinpoint the most relevant segments of the video. The latter cause is allowing better representation for the all other obvious classes.
It is interesting to note that the 'blind model' (without camera input) can successfully guess 'intersection passing' because most of them are happening in a very specific pattern: 'deceleration/wait/acceleration'. A 'railroad passing' is surprisingly hard for the CNN model because this behavior type includes not only railroad crossing in the designated locations which have discriminative visual features but also tram rails crossing. The confusion of behavior classes with a 'background' class remains the most frequent source of errors for all layers. Table 2 provides the APs for 11 goal-oriented actions (starting from 'right turn' to 'u-turn') for our model and its ablated versions. The last column provides the mAP value for all the methods. First, we describe the baselines used in this experiment. The first baseline ('Random') simply assigns random behavior labels to each frame and serves as the lower bound on model performance. It also illustrates the portion of frames in the test data for which every class label is assigned.   The next one ('CNN pool') encodes each frame by extracting convolutional features using an InceptionResnet-V2 network and pooling them spatially to a fixed-length vector. These pooled representations of frames are sequentially fed to the LSTM to predict the behavior label. The third baseline ('Sensors') uses only the sensor data as input to the LSTM. The next method ('CNN conv') is a variant of the second method: instead of spatially pooling CNN feature encodings, we used a small convnet to reduce the dimensionality of the CNN encodings of the frames before passing them through the LSTM. Finally, the 'CNN+Sensors' method adds sensor data to the 'CNN conv' method.
JAAD Set. To prove the efficiency of our ALSTM network in high-level road understanding, we set up experiments over the JAAD Dataset [39], [40]. This is considered as the first dataset devoted to the situational awareness understanding of the AV with regards to pedestrians at the crossing points. By analyzing the performance comparison of our proposed STAF with a generic LSTM network over JAAD dataset for the classification of the so-called precondition behaviors before crossing: moving fast, moving slow, or standing, we can prove the effectiveness of our STAF model that outperformed the state-of-the-art approaches with a mAP margin of 13% as shown in Table 3.

V. CONCLUSION AND OUTLOOK
We have presented in this paper a new innovative approach called STAF for a high-level understanding of the road agents' activities. We have presented a comparison between our STAF framework and the state-of-the-art approaches by the proposition of two kinds of CNN features: STAF +MobileNet and STAF +Inception. Results of the classification of behaviors from the cause layer which are immediate causes for actions like deviating or stopping show that our STAF using the inception features are outperforming other methods by a huge mAP margin of 38.96% particularly for the classes "parked car" and "pedestrian". STAF is able to focus on the most important frames despite their duration compared to the full sequence. The standard LSTM model proposed in the literature did not succeed to discern "merge" and "lane branch" categories since they are quite similar in the temporal domain. Results obtained by our STAF model under eleven goal-oriented actions is enhancing the state-ofthe-art results with an mAP margin of 28.68%. We have also demonstrated that the improvement of our STAF method is proving its ability to selectively pinpoint the most relevant segments of the video that allow better representation for all other classes. Finally, to prove the effectiveness of our ALSTM network in understanding high-level roads, we also validated our approach on a JAAD dataset more suitable for understanding the situational awareness of AVs, such as pedestrians at crossing points. Our early STAF model outperformed the state-of-the-art approaches with an mAP margin of 13%.
We are already preparing our future work oriented towards the analysis and understanding of complex road scenes by Multitask Deep Learning (M-TL). The fine-grained multitasking approach in development aims at developing an innovative convolutional and/or recurrent deep learning model combining several types of problems to be solved by the same model in order to improve the network performance. The tasks are to be defined in such a way that they respond to distinct but complementary problems: 3D detection, realtime tracking, semantic description, prediction, etc. The objective is to sturdily analyze the spatio-temporal sequences of a road scene. Of course, the STAF model will be part of these tasks for the development of our M-TL.

ABBREVIATIONS
The following abbreviations are used in this manuscript: STAF

APPENDIX A STAF ARCHITECTURE
The bottleneck architecture used in our FAST model is presented in Figure 2, Figure 3, and Figure 4. STAF +Inception results for Cause layer HDD dataset are presented in Table 1.
Performance evaluation on Goal-oriented action set using from HDD dataset is shown in Table 2. Performance evaluation over JAAD dataset and comparison between visual information and its combination with Contextual Information (CI) are given in Table 3. For two years, he was with the University of Strasbourg as an associate teacher-researcher, before moving to the industry, where he worked as a project manager in Jouve, Alten, and Thales companies. Afterwords, he joined IRSEEM/ ESIGELEC High Engineer School (Normandy University) as an associate professor in embedded systems and computer vision, and the "Instrumentation, Computer Sciences, and Systems" research team in the IRSEEM laboratory. His research interests include mobile robotics, environment perception, and computer vision dedicated to autonomous vehicles and smart mobility.
B. DECOUX is currently associate professor with the ESIGELEC, Normandy University of Rouen, France. He obtained his Ph.D. from the University of Rouen in 1995. From 2008 to 2012, he worked as a research engineer in the Kelenn Technology company, on industrial applications of optical character recognition. Since 2012, he was worked as an associate professor, and his research interests include embedded systems, computer vision, and artificial intelligence.
J-Y. ERTAUD is currently associate professor with the ESIGELEC, Normandy University of Rouen, France. He obtained his Engineer Level from ESIGELEC Engineering High School in 1982, and his Ph.D. from the University of le Havre in 1989. From 1983 to 2008, he was head of the electrical engineering department of ES-IGELEC Engineering High School. Since 2008, he has been a teacher-researcher in embedded systems and robotics. He is also a stakeholder in computer vision and robotics. His research interests include embedded systems, computer vision, mobile robotics, and articulated robotics.
R. BOUTTEAU received his engineering degree from the IMT Lille Douai and his MSc degree in Computer Science from the University of Lille in 2006. In 2010, he received his Ph.D. degree from the University of Rouen Normandy for works related to Computer Vision (catadioptric sensors, 3D reconstruction, Structure-from-Motion). From 2009 to 2020, he was an Associate Professor at the ESIGELEC engineering school and a researcher in the IRSEEM research institute. In 2018, he obtained the HDR (French Habilitation to supervise research) from the University of Rouen Normandy for his research on autonomous vehicles localization. Since 2020, he has been a Full Professor at the University of Rouen Normandy within the STI team (Intelligent Transportation System) at the LITIS Lab (IT Laboratory, Information Processing and Systems). His research interests are perception, localization, and computer vision dedicated to autonomous vehicles.