An End-To-End Emotion Recognition Framework based on Temporal Aggregation of Multimodal Information

Humans express and perceive emotions in a multimodal manner. The multimodal information is intrinsically fused by the human sensory system in a complex manner. Emulating a temporal desynchronisation between modalities, in this paper, we design an end-to-end neural network architecture, called TA-AVN, that aggregates temporal audio and video information in an asynchronous setting in order to determine the emotional state of a subject. The feature descriptors for audio and video representations are extracted using simple Convolutional Neural Networks (CNNs), leading to real-time processing. Undoubtedly, collecting annotated training data remains an important challenge when training emotion recognition systems, both in terms of effort and expertise required. The proposed approach solves this problem by providing a natural augmentation technique that allows achieving a high accuracy rate even when the amount of annotated training data is limited. The framework is tested on three challenging multimodal reference datasets for the emotion recognition task, namely the benchmark datasets CREMA-D and RAVDESS, and one dataset from the FG2020’s challenge related to emotion recognition. The results prove the effectiveness of our approach and our end-to-end framework achieves state-of-the-art results on the CREMA-D and RAVDESS datasets.


I. INTRODUCTION
The emergence of artificial intelligence (AI) techniques, more precisely deep learning techniques, enabled the fast technological development that lately occurred in the humancomputer interaction domain. Many AI-based systems are able to automatically detect the affective states of the users, leading to a personalized experience in terms of humancomputer interaction (e.g., social robots [1], monitoring systems for car drivers' condition [2]). These interdisciplinary systems are at the border between Affective Computing (i.e., deals with the development of systems that can recognize, process and simulate human affects) and Social Signal Processing (i.e., deals with the analysis of verbal and non-verbal information extracted during social interactions) [3]. Emotion AI-based systems complement standard information in a wide variety of use case scenarios, e.g. healthcare industry, customer service, marketing, security and fraud detection.
Each person is unique and can express emotions in their own characteristic way, depending on their culture, age, gender or previous life experiences [4]. Nevertheless, there are common characteristics that can be exploited in order to obtain an accurate classification system. Facial expressions represent one of the most important modes of communication through which people express their emotions and intentions. According to the Facial Action Coding System (FACS), each human emotion can be described through a combination of several Facial Action Units (FAUs) that correspond to particular facial muscle movements [5]. Apart from facial expressions, speech contains also relevant cues for the discrimination between emotional states, e.g, speech inflection and vocal intensity are characteristics that contain information about the emotional state of a subject. Due to the increased interest in developing real-world scenarios datasets and also the increased computer processing capabilities, recent studies have shown that the integration of multimodal discrimant information (e.g., facial and audio features) in emotion recognition systems enhances their robustness [6]- [10]. While several multimodal fusion approaches for emotion recognition exist, most of the current approaches consider the whole video sequence during the analysis (i.e., entire visual and/or audio sequences). However, most of the previous and subsequent frames in the proximity of an analyzed frame may contain redundant information that could only increase the computational burden of the system without bringing additional knowledge in terms of emotion recognition. Moreover, audio cues reveling information about the emotional state of a person might occur both before and after the same information is reflected by means of visual cues.
In this paper, we propose a novel multimodal neural network architecture that combines a limited amount of audiovisual information in windows that are randomly selected within individual temporal segments of the input video. In order to improve the generalization of the model, the audio and visual features are extracted in an asynchronous manner by allowing a small temporal offset between the two modalities. For this reason, although an important number of video frames can be discarded in order to improve the computational speed during the recognition process, for each analysed frame, we consider both pre-and post-frame audio samples within a temporal segment. Our approach towards the selection of frames and corresponding audio samples naturally yields a simple, yet effective, data augmentation technique by using different time windows from videos at each iteration during the training process. Considering that manual data annotation is expensive both in terms of spent time and level of required expertise, the proposed system is conceived to work in the context of a limited amount of annotated data. This is naturally achieved through the random selection of temporal windows within the individual fragments of the input video. Therefore, the main contributions of this article are three-fold: (i) we propose a novel audiovisual multimodal fusion framework for emotion recognition based on a random selection of analysis windows collected from individual temporal segments of the input video; (ii) the proposed method can be easily adapted to work also when the amount of available annotated data is limited; (iii) due to its reduced computational complexity and an overall processing time shorter than 30 ms per analysed audio-video sequence, the end-to-end framework can be considered as a valid candidate for online emotion recognition.
In order to prove the effectiveness of our approach, we tested our solution on two widely used and challenging audio-visual datasets in the field of multimodal emotion recognition, i.e. the CREMA-D [11] dataset and, the more recent, RAVDESS dataset [12], and achieved state-of-the-art results. Moreover, we also validated our method on a dataset proposed for the Multimodal (Audio, Facial and Gesture) based Emotion Recognition Challenge under FG2020, for which we considered only the audio and visual information.
The rest of the paper is organized as follows. Section II reviews several related works in the domain of audiovisual emotion recognition. Section III presents the proposed method for emotion recognition that is based on the temporal aggregation of audio-visual modalities. Section IV is dedicated to presenting several implementation details, the databases used for method validation and the experimental results, accompanied by comparisons with existing state-ofthe-art approaches. Finally, Section V concludes the paper.

II. RELATED WORK
Emotion recognition systems using only visual information (i.e., video frames) can be mainly classified into static and dynamic methods depending on the feature representations.
In static-based methods, the features are encoded with spatial information from singular frames without taking into consideration the temporal extent, whilst dynamic-based methods consider the temporal relation between continuous frames from the input sequence. In the case of static-based methods, state-of-the-art deep neural networks architectures (e.g., VGG [13], ResNet [14]) have been proposed for feature extraction, whilst the classification into emotion categories is performed using a Support Vector Machine (SVM) classifier [15]. Speech emotion recognition has been a highly active research field in the past decade. One of the most successful approaches for audio feature extraction and classification of speech was the openSMILE toolkit, largely deployed in automatic emotion recognition from speech [16]. Similarly, spectrogram representations of emotional speech yield competitive performance for automatic speech emotion recognition [17]. Considering their success achieved in many visual recognition tasks, convolutional neural networks (CNN) are able to capture high-level representations in the spatial domain and to provide solutions for numerous tasks related to speech processing challenges, e.g. the ResNet architecture used in an x-vector model for emotion and speaker recognition [18]. In speech-based emotion recognition, CNNs were used in the extraction of salient features from spectrograms [19], or in parallel with an attention-based bidirectional Long Short-Term Memory (LSTM) module [17].
Although numerous emotion recognition systems focus on using only the audio information for the task of emotion recognition, the performance of such systems is limited because of the restricted amount of labeled data in the audio domain. One possibility to enhance the performance of audio emotion recognition systems is to transfer the knowledge from the labeled video frames (i.e., for the facial emotion recognition task, there is a large amount of publicly available datasets) to the heterogeneous labeled audio domain. In this sense, the method presented in [20] serves as a data augmentation technique using a large labeled visual dataset to increase the amount of audio-based emotion recognition data. In the same vein, facial expressions from videos can be used to boost the awareness and the prediction tracking of emotions in audio data leading to a cross-modal knowledge transfer between audio and facial modalities within the emo-tional context [21].
Combining information from both modalities, audio and video, leads to increased emotion recognition performance [6]. Multimodal systems often require fusion mechanisms that efficiently combine extracted features from different modalities in order to produce one global decision. These fusion strategies for combining audio and visual modalities emphasize the most important frames that reveal the subject's emotion. For example, Zhou et al. consider CNNs to extract features from the speech spectrogram and several relevant video frames, which are highlighted through various intra-modal fusion strategies (e.g., self-attention, relationattention, perceptron-attention) [22].
In order to include temporal dynamic characteristics between video frames, Beard et. al proposed a recurrent multiattention (RMA) mechanism with shared external memory that is updated over multiple iterations of analysis [7], allowing the relevant memories to persist over multiple hops. The method achieved an accuracy of 65% on the CREMA-D dataset, comparable to the accuracy obtained through crowdsourced human rating [11].
In general, attention mechanisms are used to capture the complementary information between visual and audio modalities by weighting time-windows from videos for multimodal learning and fusion. The original video sequence is divided into separate time windows, each window containing a sequence of frames. The system proposed in [23] consists of two encoder networks, one for each modality. For each sequence, audio and visual embeddings are generated using VGG networks and their complementary information is captured through an attention mechanism, which is used to weigh the audio and video representations from different time instances in the original video sequence according to their importance.
Other mechanisms of fusing multimodal information are based on deep neural network (DNN) architectures. The DNN architecture proposed in [24] is regarded as an intermediate level of fusion between multimodal features (e.g., lowlevel descriptors for acoustic features, bag-of-video-words for video features, and bag-of-words for text features), where the classifiers and the fusion function are globally optimized.
A style extractor model that creates transformations from emotional to neutral faces in the presence of speech is proposed in [25]. The facial movements, induced by speech articulation, generate noise and degrade the performance of a facial emotion recognition system. In order to solve this issue, in [25], transformed neutral faces are contrasted with original ones to create a discriminative feature representation which emphasizes the spatial deviations between emotional and neutral faces. The mapping between emotional and neutral is done using deep learning techniques by training a model with paired data that contain the same lexical information, but different facial expressions. Apart from the style extractor model, the system proposed in [25] comprises two additional models, namely the feature extractor model and the fusion model. The feature extractor model is responsi-ble for extracting representative features for a given face, whereas the fusion model aggregates the information from the feature extractor and the style extractor models in order to predict the emotion present in the input video sequence.
In an attempt to exploit the complementary information brought by diverse modalities (i.e., audio and video), a Multimodal Emotion Recognition Metric Learning (MERML) was defined in [26]. The learned metric was further used by a SVM-based classifier with a Radial Basis Function (RBF) kernel. Similarly, a multimodal system that encodes video sequences (i.e., both audio and visual data) into a metric space was proposed in [8]. The aim is to reduce the representation distance and to explore the additional information that each modality brings. Inspired by the fact that emotions can be expressed with varying degrees of intensity at different times and, thus, the correct perception of emotion requires a broader temporal context to capture this evolution, the system presented in [8] takes into account the temporal evolution of emotion throughout the entire video sequence. Using a gating paradigm that involves presenting a stimulus repeatedly in time sequences of increasing duration, the proposed architecture comprises two 3D CNNs, one for each modality, mapping the input into a common space. The difference in representation is minimized by using a distance function as error function. The outputs of the two CNNs are connected via LSTM cells that analyze the time dependence between extracted video windows at different times.
Considering the behavioral differences between people and the diverse modes of communicating their feelings, methods addressing person-specific affective understanding have been also developed. In [27], Barros et al. proposed both a neural model that evokes a general representation of emotions and a group of neural networks that behave as personalized emotional memories to learn individual aspects of emotional expression. The proposed architecture for the general understanding of emotions consists of an adversarial autoencoder, which, on one hand, learns representations of facial expressions and, on the other hand, generates new images using conditional emotional information. Once this model is trained, it is used to generate a series of images edited with various expressions for a particular person. This image-generated collection is used to initialize a Grow-When-Required (GWR) neural network that functions as a personalized affective memory when learning individualized aspects of emotions for a particular person.

III. PROPOSED METHOD A. DEFINITIONS AND NOTIONS
We consider a labeled multimodal dataset D, consisting of m audio-visual pairs and corresponding labels: is the i th audio-visual pair and Y i is the corresponding discrete emotion label. Given a pair of audiovideo modalities, we denote by r a and r v the corresponding audio sample rate and video frame rate, respectively. VOLUME 4, 2016 The goal is to predict the emotion Y expressed by the subject in each audio-visual test pair (X a , X v ). The total number of possible discrete emotion labels is denoted by L.

B. TEMPORAL SEGREGATION AND INTEGRATION OF ANALYSED AUDIO-VISUAL DATA
In this paper, we propose to model the temporal audio-visual information through sampling via a CNN-based network architecture, called Temporally Aggregated Audio-Visual Network (TA-AVN). Considering that analyzing a large number of redundant frames does not increase the informational content with respect to emotion recognition, we extract only a limited number of frames from the videos. More precisely, we divide each video stream X v into N temporal segments of equal length and we randomly select one representative frame, which concentrates the visual information for the corresponding temporal segment. Thus, the video analysis will be conducted only over the N representative frames, sampled from the video stream. We denote this set of representative For each video frame in a temporal segment, we extract a related audio signal of length m seconds (i.e., m · r a samples) from the entire audio sequence X a . Considering a small offset between the audio and visual modality, we allow a certain degree of freedom when choosing the audio signal associated to a particular temporal segment. Thus, the middle sample of the audio signal related to frame index i j is randomly chosen in the interval comprised between the limits (i j − o) · r a /r v and (i j + o) · r a /r v , where o is a small offset. This allows the integration of multiple modalities even if the frame rates, or sampling rates, differ from one modality to the other.
The proposed method of selecting the representative audio and video sequences has several benefits. Firstly, as mentioned, this selection enables the integration of the two modalities, audio and video, even though the rate values are different. Secondly, the proposed approach provides a natural technique for data augmentation. We will come back to this aspect in the next sections. As mentioned already, the task is to predict a discrete emotion label for each audiovisual pair (X a , X v ) in the test set. We divide this task in N classification subtasks, with N being the number of temporal segments. More precisely, for each temporal segment, a classifier provides scores for all discrete emotions. For the j th temporal segment, j ∈ {1, 2, ..., N }, the scores are stored in a vector of L components, such as: where f a and f v represent the feature extraction function for the audio and video part, whilst g is a fusion function that integrates the multimodal information provided as input.
The aggregation of the temporal information carried by the audio-visual pair (X a , X v ) is performed by a simple addition of the scores retrieved for all temporal segments: The best overall score in y yields the predicted discrete emotion, Y * , for the audio-visual pair (X a , X v ). The framework for aggregating the audio-visual information is depicted in Fig. 1.

C. NATURAL DATA AUGMENTATION
One of the challenges when training deep learning architectures is the need for large volumes of annotated data. In the case of emotion recognition tasks, this is not simple to achieve since manual labelling requires time and expertise. When dealing with an insufficient amount of annotated training data, the dataset can be naturally enlarged several times by considering two degrees of freedom. Firstly, for each temporal segment, a representative frame is selected randomly in a temporal window. Secondly, the middle sample of the raw audio data is selected with a random temporal offset of maximum o/r v seconds around a representative frame. A direct consequence of these random choices is the possibility to train the networks with different data in each epoch. Furthermore, from the perspective of the emotion recognition task, considering different temporal shifts between modalities allows accommodating various "performance" speeds, i.e., there might be a lag between the emotion expressed through speech and the one provided by the visual cues.

D. EMOTION RECOGNITION AT TEMPORAL SEGMENT-LEVEL
As described above, we use two modalities to recognize emotions, namely audio and video. For each temporal segment, the video frame and the audio signal are fed into an audio-visual network, with the purpose of extracting meaningful information. We denote by d a and d v the dimensionality of audio and video features. Once these features extracted, they are concatenated and presented as input to a single fully-connected (FC) layer. Thus, for a temporal segment, the output of the FC layer is a tensor z j = z j,1 z j,2 .... z j,L of L elements, each element representing a score for the L possible emotional states. A Softmax activation function is further used to transform the vector of L scores into a normalized vector of exponential scores, y j = y j,1 y j,2 .... y j,L , where: for c ∈ {1, 2, ..., L}.

1) Visual Data Analysis
The majority of the state-of-the-art results in computer vision and image recognition involve stacked CNNs, which are able to extract meaningful features at each processing level. For the video modality, we first extract only a limited number of frames, i.e., one frame for each temporal segment will be considered as input for a CNN-based architecture. In order to remove the unnecessary information, for each frame, the face of the subject is delimited. This is a very important step for the success of the emotion recognition algorithm because the rapid movements of the head between consequent frames could degrade the extraction by capturing the face only partially. Face detection and alignment are performed using the deep cascaded multi-task framework based on CNNs (MTCNN), as proposed in [28]. The MTCNN method outperforms state-of-the-art results in different benchmark setups, while maintaining real-time performance. The MTCNN method implies building an image pyramid that is passed through a three-stage cascaded framework for joint face detection and alignment. The first CNN, called P-Net, proposes several candidate windows. False candidates are removed by the second CNN, called R-Net. The last CNN, called O-Net, provides five facial landmarks' positions that better describe the faces.
Considering the success of CNN-based architectures in many image recognition tasks, a simple CNN architecture, consisting of 3 main convolutional blocks and a fully connected layer, is used for the extraction of the video features. The architecture is depicted in Table 1 and the dimensionality of the video feature vector is d v = 256. At each convolutional layer, the number of feature maps is increased, while the size of the feature maps is downsampled by a factor of 2 through the usage of a MaxPooling layer. A rectified linear unit function: is considered as activation function. The stability when training the CNN architecture is increased by inserting a batch normalization layer after each convolutional layer [29].

2) Audio Data Analysis
In speech and audio-related research, spectrograms or Log-Mel spectrograms are often used as inputs for CNN-based VOLUME 4, 2016 architectures due to their image-like configuration [30], [31]. One of the most commonly used 2D time-frequency representations of raw audio signals is the discrete Short-Time Fourier Transform (STFT), defined as: where x[n] is the discrete input signal, w[n] is a window function, and N is the STFT length [32]. Although STFT is the most general time-frequency representation, it does not lead to a perceptually-inspired processing since there are no direct assumptions about the human auditory systems [33]. By contrary, Mel and Log-Mel spectrograms are time-frequency representations inspired by the human auditory perception. The Mel scale is, in fact, an approximation to the cochlea's non-linear frequency scaling, which is obtained from the linear frequency scale by applying a non-linear transformation [34]: The Mel scale is divided into n mel evenly-spaced frequencies and the energy of the spectrum in each band is determined by applying a filterbank of n mel triangular filters on the audio signals. Let us denote by H m the triangular filters in the discrete frequency domain: where In this paper, we use the Log-Mel spectrogram because the convergence is faster achieved and the accuracy is, in general, higher, if compared to other time-frequency representations. Moreover, Log-Mel spectrograms combined with CNN-based architectures became state of the art in many speech recognition tasks [30]. Following the approach used for the extraction of video features, we design a similar CNN-based architecture for audio feature extraction. The architecture for audio feature extraction is shown in Table 2 and the dimensionality of the audio feature vector is d a = 64. The length of feature vectors extracted from the audio content is 4 times smaller than the length of the feature vectors extracted from the video content. This choice is argued by the rich content that FACs provide in emotion recognition tasks, i.e., the combination of various facial muscle movements translates into a particular emotion.

E. TRAINING THE ARCHITECTURE
Cross-entropy loss is used for training the multimodal neural network architecture. Considering that there are L emotion category labels, the output scores for an observation x are stored in y = y 1 y 2 ... y L . Each of these scores are passed through a normalized exponential function: to obtain a probability distribution over the predicted output classes. The loss for observation x is computed as: where, δ x,c = 1, predicted class c for x is correct 0, otherwise.

IV. EXPERIMENTS A. DATABASES
To prove the effectiveness of the proposed approach, we conducted experiments on two benchmarks of multimodal datasets for emotion recognition, namely the Sourced Emotional Multimodal Actors Dataset (CREMA-D) [11], and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [12]. In addition, we also validated our approach on another challenging dataset, that was used during the Joint Challenge on Compound Emotion Recognition and Multimodal (Audio, Facial and Gesture) based Emotion Recognition (CER&MMER) under FG2020 (FG2020-ER) [36]. CREMA-D [11] is a dataset of 7442 original clips from 91 actors, of which 48 male and 43 female actors with ages between 20 and 74 years and characterized by various ethnic backgrounds (e.g., African America, Asian, Caucasian, Hispanic). Actors were asked to interpret 12 sentences by emphasizing a particular emotion from a list of 6 types of emotions (i.e., anger, disgust, fear, happy, neutral, sad), with a particular level (e.g., low, medium, high, unspecified). In the case of the CREMA-D dataset, the audio-video sequences have an average length of 2.54 seconds (i.e., the length of the video sequences ranges from 1.27 seconds to 5 seconds).
RAVDESS [12] contains 1440 speech files, along with the corresponding video files. 24 professional actors (12 male and 12 female) were asked to vocalize two lexically-matched statements in a neutral North American accent, while emphasizing 7 categories of emotions (i.e., calm, happy, sad, angry, fearful, surprise, and disgust) at two levels of emotional intensity (i.e., normal and strong). In the case of the RAVDESS dataset, the average length of the audio-video sequences is 3.74 seconds (i.e., the length of the video sequences ranges from 2.99 seconds to 5.31 seconds). Compared to CREMA-D, the RAVDESS dataset, containing only speech and video, is not only smaller, but has also a greater number of emotion category labels.
The FG2020-ER's [36] corpora contains recordings registered in studio conditions, acted out by 16 professional actors (8 male and 8 female). The target is to recognize multimodal emotions composed of three modalities, namely, facial expressions, body movement / gestures, and speech. The actors expressed 7 types of emotions, i.e., neutral, sad, surprise, fear, anger, disgust, happy. In our setup, we used only the facial expressions and the speech modalities. The FG2020-ER dataset was already split into training and testing datasets, containing 314 and 140 audio-video files respectively. The FG2020-ER dataset is challenging since it contains sequences with lengths that vary from 1.8 to 11.6 seconds and an average length of 4.65 seconds.

B. IMPLEMENTATION DETAILS
The videos in all three datasets are characterized by a frame rate of 30 fps. However, the audio sample rate was different for the three datasets that we considered, namely 16 KHz for CREMA-D and 48 KHz for the RAVDESS and FG2020-ER datasets. The video frames, containing only the cropped faces of the subjects (i.e., using the MTCNN algorithm [28]), were resized at 80 × 98 pixels. For each temporal segment, we consider a moving window of 1.28 seconds of audio with a middle sample taken within a small offset o of 10 ms around the representative frame. The Log-Mel spectrograms of the audio signals, with n mel = 128 evenly-spaced frequencies on the Mel scale, 2048 STFT window samples, hop-length of 512 samples, were resized at 192 × 120. We implemented our model using the Pytorch framework [37]. For the end-to-end training, we used batch size of 16 and stochastic gradient descent (SGD) with Nesterov's Accelerated Gradient method for which we considered a 0.9 momentum coefficient [38], a learning rate of 1e-4, and 50 epochs. Both CREMA-D and RAVDESS datasets were split into disjoint training (80%) and test (20 %) subsets randomly, with no overlap between subjects in the two subsets.

C. EXPERIMENTAL RESULTS
This subsection presents the performance achieved by our proposed method, called TA-AVN, on the two benchmarks, CREMA-D and RAVDESS. We further extend our analysis on the newly introduced multimodal dataset, FG2020-VOLUME 4, 2016 ER. We provide comparisons with other recent methods, that achieved state-of-the-art performance results on these datasets.

1) Performance assessment
For our experiments, we varied the number of temporal segments between 1 and 18, with a step of 3, and successively increased the augmentation factor from 1 to 5. The results are shown in Fig. 2. Increasing the number of temporal segments improves the overall recognition accuracy, i.e., an increase of 14.17 % for CREMA-D and 22.92 % for RAVDESS when comparing the cases of 1 versus 12 temporal segments. However, increasing the number of temporal segments over 12 leads to extracting redundant information and to increasing the amount of computational time needed for training the end-to-end recognition system, without obtaining an improvement in the performance.
The effect of augmenting the training dataset is best observed when the dataset is not very large, as in the case of RAVDESS. In the case of RAVDESS and 12 temporal segments considered, we observe an increase of 5.83 % in overall accuracy when augmenting the training dataset 5 times. This is also shown in Fig. 3 and Fig. 4, which contain the accuracy and loss curves for 50 epochs, reported for the CREMA-D and RAVDESS datasets. In both cases, the convergence is achieved in a smaller number of epochs when the dataset used for training is augmented, i.e., around 15-20 epochs are sufficient to train the end-toend recognition system with an augmentation factor of 5. Moreover, comparing the results in Fig. 3 and Fig. 4 in terms of convergence, the benefit of augmenting the training set is greater for the RAVDESS dataset, which is almost 5 times smaller than the CREMA-D dataset. In addition, when augmentation is used during training, an audio-video pair is used several times (i.e., the number of times is controlled by the augmentation factor). Since the process of selection is randomized, different representative frames and audio sequences are chosen for each reused audio-video pair. This shows that the augmentation technique, inherently induced by the randomized construction of the end-to-end framework, is very important in cases when the amount of annotated training data is insufficient.
Furthermore, we investigated the influence of the offset value over the performance of the emotion recognition system. In this regard, we varied the value of the offset o ∈  {0, 5, 10, 15} ms around the representative frame, for both training and testing phases. The results are shown in Fig. 6. We noted that using a completely synchronized audio-video pair (i.e., with no offset) results in lower performance when dealing with an insufficient amount of training data, as in the case of RAVDESS. Thus, allowing a certain degree of freedom (i.e., 10 ms) when selecting the audio sequences around the representative frame, combined with the aforementioned augmentation technique, yields a significant improvement in the robustness of the recognition system.
In addition, we considered aggregating the temporal information using a weighted sum instead of the simple addition. Specifically, we replaced equation (2) with: with a j being the learned weight corresponding to j th tempo-ral segment. However, the performance of the model was not improved, if compared to the aggregation using the simple addition. Specifically, we reached an overall accuracy of 78.5 % for CREMA-D (i.e., a decrease of 5.5 %) and 71.5 % for RAVDESS (i.e., a decrease of 7.2 %). Fig. 5 shows the confusion matrices obtained by our approach on CREMA-D and RAVDESS datasets. All emotion categories have been identified with a high level of perclass accuracy, as indicated by the diagonal elements of both matrices. In both cases, anger, happiness and neutrality are retrieved with a higher level of accuracy, if compared to sadness and fear. However, we notice that fear and sadness are confused with other emotions. This is inline with the human performance achieved on these datasets [11], [12].

2) Comparisons with other methods
We compared our proposed temporal aggregation technique for emotion recognition with human performance and other several recent approaches that achieved high performance results on the two benchmarks, CREMA-D and RAVDESS. In [7], the authors propose combining facial and audio temporal features using recursive attention. [8] focuses on generating temporal joint audio-video embeddings (via LSTM) in an end-to-end multimodal deep learning metric paradigm, in which the visual features are extracted with 3D-CNN [39] and the audio features are computed using soundNet with 8 convolutional layers [40]. In [23], two encoder sub-networks, integrated in a Multi-Head Self-Attention framework, are used to extract features from the video frames (i.e., pretrained VGG-M containing 5 convolutional layers [41]) and audio signals (i.e., pretrained VGGish with 16 convolutional layers [42]). The metric learning paradigm is also employed in [26], with visual representations retrieved via the VGG model [13] and audio features using the speech analysis toolkit openSMILE (e.g., energy / spectral and voicing related Low-VOLUME 4, 2016
In Table 3, we show both the overall accuracy rates when learning the model parameters on the training datasets with no augmentation and on the five times-augmented training datasets. Our approach outperformed previous results and, even, human performance in the case of CREMA-D. In the case of RAVDESS, TA-AVN achieved the closest result to human perception, i.e., 78.7 % overall accuracy achieved for 12 temporal segments and a five times-augmented dataset. Moreover, we have measured the impact that the random selection of the multimodal information within a temporal offset has over the recognition system. The experiments show that, when the amount of training data is insufficient, the random selection of the multimodal information within a small temporal offset leads to improved performance, e.g. for the RAVDESS dataset, the overall accuracy achieved by TA-AVN decreased with 7.9 % when no temporal offset was considered.
As already mentioned, we tested our TA-AVN method on another challenging dataset, FG2020-ER, characterized by a very small amount of training data. Using the same settings as above, we obtained an accuracy of 75 %, using only audio and video information on 15 temporal segments. The number of temporal segments is selected in accordance with the number of 12 temporal segments achieving top-level accuracy for CREMA-D and RAVDESS (i.e., the majority of the video sequences in FG2020-ER are longer than the ones in CREMA-D and RAVDESS). The accuracy is close to the one reached by the competition's winner [43], 76.43 %, which was achieved for fusing audio, facial and gesture information. As in the case of CREMA-D and RAVDESS dataset, the results on FG2020-ER further prove the stability of our approach for videos with varying temporal length.

3) Real-time emotion recognition
Compared to our approach, the majority of the topperforming methods imply convolutional neural networks with a higher number of layers (e.g., variants of VGG neural networks [23], [26], 3D-CNN and soundNet [8]), leading to an increased inference time. Undoubtedly, a key factor for the emotion recognition task is the processing time. For this performance parameter, we report an average of 7.5 ms for face detection and alignment using MTCNN [28], 12.8 ms for feature extraction and 5.6 ms for inference time through the neural network architecture, yielding a total of 25.9 ms when 12 temporal segments are considered. Considering the low processing time, the proposed TA-AVN system is a valid candidate for real-time emotion recognition since the faceto-face communication between humans can be regarded as a real time process with a time scale order of about 40 ms [44]. We mention that the experiments were conducted on an Intel Xeon E5-1680v3, 8 cores 3.2GHz, equipped with an Nvidia Quadro M4000 GPU with 8 GB RAM.
The number of parameters of the TA-AVN model depends on the number of temporal segments that are considered for the analysis. As shown in Fig. 2, a low number of temporal segments is sufficient to reach a high level of performance (e.g., 12 temporal segments for CREMA-D and RAVDESS datasets containing videos with average lengths less than 4 seconds). The TA-AVN models used for CREMA-D and RAVDESS datasets contains 18.7 million parameters, whereas the TA-AVN model used for the FG2020-ER dataset (i.e., with the majority of the video sequences longer than 4 seconds) comprises 23.3 million parameters. In both cases, the number of parameters is smaller than for other architectures used for real-time execution, e.g., the popular ResNet-50 [45], containing 25.6 million parameters, achieved small inference time on mobile devices with the DeepRebirth accelerated framework [46] and even 26 ms inference time on smartphones using a dedicated programming framework called CADNN [47].

V. CONCLUSION
In this paper, we presented a robust end-to-end architecture that incorporates multimodal information for emotion recognition. The proposed TA-AVN architecture is flexible in combinining audio and video data with different sampling rates across modalities (i.e., video at 30 fps and audio signals at 16 KHz or 48 KHz). Similar to how our brain combines knowledge gained through direct observation, the proposed architecture allows the aggregation of temporal multimodal information in an asynchronous manner, accommodating different speeds of expressing emotions. Considering different temporal shifts between modalities and a randomized selection of audio-visual content further leads to a natural augmentation technique of the training dataset, yielding improved performance when the amount of available training data is limited. Since the computational complexity of the proposed solution is not very high (i.e., the audio and video networks follow a simple convolutional neural network architecture and, also, the final classification subnetwork of TA-AVN is based on a fully-connected layer), TA-AVN can be considered as a candidate for online multimodal emotion recognition. Compared to other recent approaches in the literature, the proposed technique achieved competitive results on two important and challenging benchmark datasets, i.e. best overall accuracies of 84.0 % for CREMA-D and 78.7 % for RAVDESS.