Toward Vehicle Occupant-Invariant Models for Activity Characterization

With the advent of self-driving cars and the push by large companies into fully driverless transportation services, monitoring passenger behaviour in vehicles is becoming increasingly important for several reasons, such as ensuring safety and comfort. Although several human action recognition (HAR) methods have been proposed, developing a true HAR system remains a very challenging task. If the dataset used to train a model contains a small number of actors, the model can become biased towards these actors and their unique characteristics. This can cause the model to generalise poorly when confronted with new actors performing the same actions. This limitation is particularly acute when developing models to characterise the activities of vehicle occupants, for which data sets are short and scarce. In this study, we describe and evaluate three different methods that aim to address this actor bias and assess their performance in detecting in-vehicle violence. These methods work by removing specific information about the actor from the model’s features during training or by using data that is independent of the actor, such as information about body posture. The experimental results show improvements over the baseline model when evaluated with real data. On the Hanau03 Vito dataset, the accuracy improved from 65.33% to 69.41%. On the Sunnyvale dataset, the accuracy improved from 82.81% to 86.62%.


I. INTRODUCTION
Successfully capturing passenger activity has far-reaching implications for both the user experience and safety features in autonomous vehicles (AVs). Without a driver responsible for the safety and integrity of the vehicle and its occupants, it is incumbent upon automated detection systems to monitor occupant well-being and actions and to detect potentially harmful behaviour or even violence. However, the multitude of possible actions that can be represented, the variability in how different individuals represent the same actions, the The associate editor coordinating the review of this manuscript and approving it for publication was Khoa Luu . heterogeneity of sensors and types of information collected, and the influence of external factors still pose significant challenges to this task [1]. The current state of the art in action recognition is based on deep models, but their use for vehicle occupant action recognition is not without problems.
Training deep learning models requires significant amounts of data, which escalates with the complexity of the models to avoid overfitting. Moreover, the available datasets are usually split, with one part used for training and another part used for subsequent testing. Dataset availability and size can be critical when dealing with individuals engaged in different activities, both for legal reasons and because of the effort involved in preparation. If the dataset used to train a VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ model contains only a small number of actors, the model can be biased toward them, i.e., to specific individuals and their unique characteristics [2]. This causes the model to perform poorly when confronted with a different set of actors performing the same activities. In the context of this study, an actor is defined as a specific group of people in the vehicle (rather than a specific person). This means that different actors may have some individuals in common. In this work, we focus on the detection of violence and non-violence in the vehicle and describe and evaluate three methods that aim to counteract actor bias by removing actor-specific information from the features of the model, or by providing the model with data that is independent of the actors, such as information about posture, which contributes to a more robust model. The presented research focuses on the monitoring of occupants of autonomous vehicles, more specifically on the detection of violence and non-violence. We believe that this scenario is particularly relevant since: (1) there is a strong focus on people with high heterogeneity; (2) to our knowledge, there are no available datasets for this scenario, leading to a high dependence of the models on the actors. Our main contributions are: • Using a dataset that is domain-specific to the target scenario; • The application of domain generalization methodologies; • The application of regularization techniques for the specific scenario of the vehicle interior; • The use of body posture to diversify data sources and consequently reduce actor bias. In addition to the introduction, this paper contains 5 other sections. Section II presents the current state of the art for this problem. In section III the methods used in this study are reviewed. In section IV the data and experimental settings used in this work are presented. In section V we analyse the results. Section VI gives the conclusions from this work and presents ideas for future work.

II. RELATED WORK A. VIDEO ACTION RECOGNITION
Action classification requires models that focus on modelling spatio-temporal information. Early attempts at action recognition used compact video descriptors to extract handcrafted spatio-temporal features [3], [4], [5]. In recent years, the use of these techniques has declined, and deep learning-based methods have pushed the boundaries of the state-of-the-art in action recognition. However, despite their success, these models have raised concerns about their bias towards specific domain characteristics. This problem arises because deep learning methods rely on training data. This dependency leads to the developed models being biased towards, for example, actor features when the dataset used lacks diversity.
One of the most successful deep learning approaches is the two-stream model [6]. In these models, there are two separate deep Convolutional Networks (ConvNets) for processing RGB frames and optical flow, which are then combined by late fusion [7]. Based on this model, several approaches to action recognition have been proposed [8], [9], [10], [11]. Another approach that is proving successful is the use of 3D Convolution Neural Networks (CNNs). These networks behave similarly to 2D CNNs, but use 3D convolutions to extract features in spatial and temporal dimensions [12]. The C3D [13] network is an example of this type of approach, where the spatio-temporal features are learned to use 3D ConvNets trained on large-scale supervised video datasets. The use of large-scale video datasets was possible because C3D can handle video frames as input without any preprocessing. Other variants based on 3D CNNs have been proposed [14], [15]. One worth mentioning is an architecture that combines the two-stream model with 3D ConvNet, called I3D [15], which served as the basis for the model created in this work and is still one of the best performing architectures for action recognition. One of the problems introduced by the use of 3D convolutions is the high computational cost. To solve this problem, several works have proposed to treat the spatial and temporal dimensions differently. Some proposed the decomposition of 3D convolutions into 2D spatial convolutions and 1D temporal convolutions, such as S3D [16], P3D [17], R(2+1)D [18]. Furthermore, SlowFast [19] shows that space and time should not be treated symmetrically. Therefore, it introduces a two-path structure to handle slow and fast motion separately.
The models mentioned so far use random spatial crops of video frames as inputs during training. Since they do not explicitly focus on the human body, it is easy to overfit the scenes and objects in the video. Therefore, skeleton data was used to focus the action recognition on the human body. This has the advantage of being lightweight and free of scene cues. As for pose-based methods for action recognition, the main differences are the use of CNNs or Graph Convolution Networks (GCN). CNN-based [20], [21], [22], [23] methods represent the skeleton with a pseudo-image and thus recognize actions in the same way as image classification. Nevertheless, skeleton data is essentially a graph in non-Euclidean space with skeleton joints as vertices and bones as edges. Therefore, GCN-based [24], [25], [26] methods were proposed to capture joint interactions on the skeleton graphs, explicitly considering the adjacent relationship between joints in the non-Euclidean space.

B. REPRESENTATION BIASES
Representation bias is a problem in various image and video classification problems, and studies have been conducted to analyse and mitigate it. Li et al. mitigated scene, object and people biases by re-sampling the original video datasets [27], [28]. The re-sampling approach reduces representation bias, but it also reduces the number of training data, which is not desirable for deep learning methods. Another proposed method was to use adversarial losses for different scene types to mitigate scene biases [29]. Similarly, adversarial learning procedures allowed learning signer-invariant latent representations to be highly discriminating for sign recognition [2].
The primary source of representation bias is the dataset used for model training. Some studies have shown that video datasets for action recognition exhibit biases toward objects, scenes, and people [27], [28], [29], [30]. Video datasets for action recognition are mainly divided into generic and fine-grained action recognition datasets. Generic action recognition datasets provide the generic action recognition task, which attempts to classify various action categories in various domains. Such datasets include videos from different domains, such as daily activities, sports, and entertainment. Due to the wider diversity of domains in these datasets, the models trained on them can recognise actions indirectly, i.e., by the presence of an object or a person. The use of such biased models leads to degenerate transferability and incorrectly recognizes novel actions in the same static cues. Popular datasets for generic action recognition are Kinetics [31], Moments in Time [32], ActivityNet [33], UCF-101 [34], and HMDB-51 [35]. Fine-grained action recognition datasets have been recently released, providing videos in a specific domain. Something-Something [36] includes the videos of fine-grained actions of human-object interactions. Jester [37] is a dataset for hand gesture recognition. Diving48 [27] is a video dataset in sports. These datasets overcome some problems associated with representation biases since the same domain's static cues, such as objects and scenes, are similar. Therefore, it is not easy to recognise actions based on only static cues. However, these datasets might face biases in domain static cues related to people. This happens especially in datasets with a small set of actors. In this study, the datasets used are fine-grained action recognition datasets. Nevertheless, the number of actors present is quite small. Therefore, several techniques were implemented to improve the generalisation capability regarding the actors' characteristics.

C. IN-VEHICLE OCCUPANT ACTIVITY RECOGNITION
In-vehicle activity recognition is still relatively unexplored. Some proposals addressing this topic demonstrated that activity recognition might be a possible approach for monitoring the vehicle's occupants. Some proposals focused on violence detection to monitor occupants through anomaly detection of occupant's behaviour [38] or recognition of occupant's interactions [39]. Beyond detecting violence, the recognition of occupant's activity using different modalities has also been explored, specifically using audiovisual features [1]. Audio features appeared from applying this type of feature for the classification of group emotion [40]. This work focused primarily on the available hardware and energy consumption constraints associated with implementing an activity recognition system in a vehicle. Using an audio module and a cascading strategy demonstrated reductions in memory requirements and computational demands. This reduction was most significant when the audio module was the first processing block. This research paper continues these previous studies, focusing on a hypothesis previously identified of the possibility of bias in the results obtained so far.
This hypothesis arose because the past studies used small datasets with a small set of actors.

III. METHODOLOGIES
In this section, the baseline method used in this work and the methods adapted to promote actor generalization are reviewed. The goal of the latter is to accurately predict violence and non-violence in videos of vehicle occupants, regardless of the actors present. Each of these methods aims to achieve this goal in different ways.
The Adversarial Learning method [2] uses an additional network that learns to classify the different actors in the training set and later uses that knowledge to remove actor information from the features of the network, making it less biased toward the actors.
The Bilevel Learning methodology [41] assigns a different weight to each mini-batch of the training set based on how much its gradients agree with the gradients of a validation mini-batch. This reduces the actor bias by giving more weight to training mini-batches that have similar gradients to validation mini-batches, minimizing the error on the validation set and leading to better generalization capabilities.
The use of pose information is also studied, both as a substitute and as a complement to the RGB information. We calculate the pose of each person in the frame and generate the keypoint map that is then given to the network [42]. Since the keypoint map does not contain any direct information about the actors, it helps to reduce actor bias.

A. BASELINE
The baseline model used in this work is based on the model proposed in [40], presented in Figure 1. It uses RGB data for recognizing actions, and it has achieved results comparable to state-of-the-art methodologies. This model is a 3D CNN, which allows the network to take advantage and use the temporal information more efficiently [43], [44], [45], [46], [47], [48], [49], [50], [51], [52]. The output of 3D Convolutions FIGURE 1. Schema of the network used for activity recognition (adapted from [40]). VOLUME 10, 2022 is a video volume that retains the temporal information of the input, making them superior for video classification than standard 2D Convolutions, which lack the extra dimension used for the temporal component.
The model is composed of the 3D ResNet50's convolutional encoder blocks, an average pooling layer, a dropout layer, and a fully-connected layer with 2048 inputs and 2 outputs (predicting violence or non-violence). The convolutional encoder has one 3D convolutional layer with 64 filters of size 7 × 7 × 7, a batch normalization layer and ReLU activation function. The following layers are three residual blocks of type A, four blocks of type B, six blocks of type C, and four blocks of type D. Residual blocks contain three convolutional layers, with a filter size of 1 × 1 × 1, 3 × 3 × 3, and 1 × 1 × 1, respectively. The number of filters of the first two convolutional layers depends on the type of block. Type A contains 64 filters, type B contains 128 filters, type C contains 256 filters, and type D contains 512 filters. The first two convolutional layers are followed by a batch normalisation layer. The last convolutional layer of each block contains four times the number of filters of the first two layers and is followed by a batch normalisation layer and ReLU activation function. More details on the network architecture can be seen on the original ResNet paper [53].
The used network was pre-trained with videos of the Moments in Time dataset. During training, the layers of the network were frozen, except for the last 2 layers. Class weights were added to the cross-entropy loss function to combat class imbalances. The Adam optimizer was used with a learning rate of 1 × 10 −4 . The work was developed using PyTorch. In all the experiments, we used the balanced accuracy to combat any class imbalances in the datasets.

B. ADVERSARIAL LEARNING
To accurately predict violence and non-violence in video, independently of the actors that are present, the model should be trained in a way where the latent representations learned by the model preserve the information relative to the action, and discard the information relative to the actors, which may negatively impact the classification. To accomplish this, the methodology proposed in [2] was adapted. It presents a network architecture and an adversarial training objective, which addresses the signer-independent problem (actor-independent problem, in our case). The model consists of a feature extractor, which maps input videos to latent representations, and two classifiers. In our case, the network architecture is the same as our baseline model. The feature extractor is the inflated ResNet-50 network, and the two classifiers are dense layers. The action-classifier predicts violence or non-violence, and the actor-classifier predicts the actor present in the video.
During the learning process, the feature extractor is simultaneously trained to help the action-classifier while trying to fool the actor-classifier. Figure 2 shows an overview of the network architecture and the loss functions.
represent a labeled dataset with N samples, where X i represents the i-th video sample (a set of 8 concatenated RGB frames), y i represents the action label (violence or non-violence), a i represents the actor label and A represents the set of actor labels.
The feature extractor learns an encoding function h(X ; θ h ), which using the parameters θ h encodes an input video sample X into a latent representation h.
The action-classifier receives the latent representation h and learns a function f (h; θ f ), parameterized by θ f , that gives the predicted probabilities p(y|h; θ f ) for each action class.
The actor-classifier learns a function f (h; θ g ), which using the parameters θ g maps the latent representation h to the predicted probabilities p(a|h; θ g ) for each actor.
The actor-classifier is trained to minimize the negative log-likelihood of correct actor predictions: The feature extractor and the action-classifier are trained to minimize the negative log-likelihood of correct action predictions: Additionally, the predictions of the actor-classifier should be close to uniform, meaning that it is not capable of doing better than random guessing the actor identity. The following loss (eq. 3) adjusts the weights of the feature extractor to make the predictions of the actor-classifier close to uniform, and it is an adversarial loss with respect to the actor classification loss L actor : In order to further encourage the actor invariance properties of the latent representation h, another loss L transfer was added. It minimizes the distance between the hidden latent representations of different actors at each layer of the feature extraction network. The distance D (m) between the latent representations h (m) (•; θ h ) of actors a and t, at the m-th layer is calculated as follows: where || • || 2 is the l2-norm, and N a and N t represent the number of training examples of actors a and t, respectively. This assumes that the dataset is balanced in respect to the action labels for each actor. If this is not the case, each mini-batch used during training could be designed to fulfil this requirement.
To calculate the actor transfer loss at the m-th layer, the pairwise distances between all actors are summed: The final loss L transfer is a weighted sum of the loss calculated at each layer of the feature extraction network, where β (m) is the weight attributed to the layer m, which controls the importance of that layer: Therefore, the final objective can be written as: (7) where λ and γ are weights attributed to each loss component to control their relative importance.

C. DEEP BILEVEL LEARNING
In this study, the Deep Bilevel Learning [41] methodology is explored as another strategy to achieve actor generalization. Deep Bilevel Learning improves the training process by giving different weights to each mini-batch in the training set. These mini-batch weights favour the batches whose gradients match the gradients of the validation mini-batches, minimizing the error on the validation set and resulting in a model with better generalization capabilities. A batch size of 16 was used, and for each weight update, 4 training mini-batches and 1 validation mini-batch with the same class distributions were selected. Figure 3 shows an overview of the methodology.
The weights are calculated using the following function: where T t represents the collection of training mini-batches used at the t-th training iteration, V t is the collection of validation mini-batches used at the t-th training iteration, θ t are the model parameters at the t-th training iteration, ∇l i (θ t ) are the gradients of the i-th mini-batch in the training set, ∇l j (θ t ) T are the gradients of the j-th mini-batch in the validation set, ω i is the weight attributed to the i-th mini-batch in the training set,λ is an adjustable hyperparameter, and µ is a term added to avoid divisions by zero.
A new gradient descent step can be calculated as follows: where ω i can be interpreted as a learning rate specific to each training mini-batch. When the gradients of a mini-batch in the training set ∇l i (θ t ) point in the same direction as the gradients of a mini-batch in the validation set ∇l j (θ t ), then their inner product is ∇l j (θ t ) T ∇l i (θ t ) > 0; when the gradients point in different directions (do not agree) the inner product is ∇l j (θ t ) T ∇l i (θ t ) ≤ 0 which gives a weight with a negative value or zero.

D. USING POSE INFORMATION
The baseline model uses RGB data for action recognition and has achieved results comparable to state-of-the-art methodologies.
A series of experiments were conducted to measure the effects of using pose information as a complement or alternative to RGB data [42]. Using pose information as an input can have some advantages, as the model receives data that is simpler and easier to interpret, and can also help improve robustness to the actors. When using the raw RGB data, however, the model needs to extract and interpret more complex features, which can make the classification process more complex.
To calculate the key points of each person in a frame, a Keypoint R-CNN model with a ResNet-50-FPN backbone [54] is used. The model calculates the position of 17 key points, and whether they are visible or occluded. The key points calculated by the model are presented in Table 1 along with the corresponding assigned label (colour) and further illustrated in Figure 4.  To calculate the pose information for a certain video, we start by extracting each frame of the video. The pose information is extracted for each frame, consisting of the coordinates of the key points and the bounding box of each person in the frame.
For each frame, we generate an image with a black background and circles of a certain colour with a radius of 4 pixels, in the position of each key point. We added a different colour to each key point to label them, to make it possible for the model to distinguish each body part. Table 1 shows the key points and their corresponding colour. Figure 4 shows an example of an image and the corresponding key point map.

IV. DATA AND EXPERIMENTAL SETTINGS
As previously stated, the proposed methodologies are tested in a shared autonomous vehicle scenario. Given the lack of freely available appropriate data, two datasets from Bosch Car Multimedia were used, which will be designated Hanau03 Vito and Sunnyvale. Each of the datasets contains videos of people performing different actions inside the vehicle.
The Hanau03 Vito dataset contains 74 actors, and the videos were captured from above with a fisheye lens. Figure 5 shows a frame extracted from one of the videos. The Sunnyvale dataset has 9 actors, and the videos were recorded from the front without a fisheye lens, making them more similar to the videos in the MMIT dataset, which was used to pre-train the video processing sub-module. Figure 6 shows a frame extracted from one of the videos. In these experiments, we focus on detecting violence and non-violence. Each annotated video segment of the dataset was divided into sub-segments that will be referred to as samples. These samples are then used to train and test the model. Each sample has a duration of 1 second, and a framerate of 8 frames per second (fps), which means that for each prediction, the model receives 8 concatenated frames with a resolution of 224 × 224 × 3 pixels. The number of samples extracted were 32739 for the Hanau03 Vito dataset, and 46838 for the Sunnyvale dataset. Table 2 contains the number of samples that were extracted from the datasets, the number of actors, and the average number of samples per actor.
As previously mentioned, an actor is defined as a specific group of people inside the vehicle (and not a specific person). This means that different actors could have some individuals in common. When splitting the dataset into train set, validation set and test set, it was ensured that each specific person was not in different sets simultaneously, meaning that each of the train, validation, and test sets were completely independent of each other in terms of actors. The datasets were split in a way that kept the number of samples in each class relatively balanced. Table 3 and Table 4 show the dataset splits.  Since the focus of this work is the detection of violence/non-violence the original classes of the datasets were grouped into those two categories. Table 5 and Table 6 show the original classes and their classification as violence or non-violence.
In addition to the Hanau03 Vito and Sunnyvale datasets three widely used and publicly available datasets are used: Moments in Time [32]; HMDB51 [35]; Hollywood [55]. When analysing the pose keypoint data it was noticed that the pose keypoints calculated for the Hanau03 Vito dataset were inaccurate in some situations due to the position of the camera (top view). Therefore, to further test the methodology that used the pose information, we decided to add these datasets. The advantages of these datasets are that they are all publicly available, and they contain actions that could be performed inside the vehicle.
The Moments in Time dataset is a large-scale action dataset. It contains one million 3-second videos and 339 classes. The HMDB51 dataset contains 6849 video clips from 51 action classes obtained from movies and web videos. The Hollywood dataset is a human action dataset which contains video samples obtained from 32 movies. Each sample has one or more labels corresponding to 8 action classes.
For each of the publicly available datasets, a subset of action classes was selected, based on the relevance for this study and for the context of in-vehicle action recognition.

A. ADVERSARIAL LEARNING
After training on the Hanau03 Vito dataset the baseline had an accuracy of 65.33% and the adversarial learning VOLUME 10, 2022 methodology had an accuracy of 62.42%. The actor classification accuracy during training was 2.04%. The actor accuracy should be close to random, as this indicates that the methodology is correctly removing the actor information from the features of the model. Table 7 shows the results. After training on the Sunnyvale dataset the baseline had an accuracy of 82.81% and the adversarial learning methodology had an accuracy of 58.11%. The actor classification accuracy during training was 35.96%. Table 8 shows the results. The obtained results indicate that the presented methodology did not improve the results of the baseline model. The hyperparameters used for the experiments were the ones proposed in the paper [2], and given that this problem uses a different dataset and a different network architecture, some tuning could be needed. Additionally, the datasets used are very small and contain a low number of samples per actor, which further increases the difficulty of the problem by making the network unable to correctly learn the distribution of each actor, making this methodology less effective. An interesting experiment would be to train the model from scratch, to be able to remove actor information from the first layers of the network, but more data would be required, since training the model from scratch with these datasets would result in an overfit model that would perform poorly when presented with new data.

B. BILEVEL LEARNING
After training on the Hanau03 Vito dataset, the model achieved an accuracy of 65.33% on the baseline and 68.19% on the bilevel learning methodology. Table 9 show the results.
After training on the Sunnyvale dataset the model achieved an accuracy of 82.81% on the baseline and 86.62% on the bilevel learning methodology. Table 10 show the results.
Although the gradients of the validation set are not used directly to train the model, there may be some data leakage during the training process, since more weight is given to the training mini-batches that have gradients similar to the TABLE 9. Accuracy after training on the Hanau03 Vito dataset using the baseline and the bilevel learning methodology. gradients of the validation mini-batches. This is also a reason why the data is split into training, validation and testing.
On both datasets, the training set accuracy decreased when using bilevel learning, which means that the model is not overfitting as much to the training data. There is also an increase in accuracy in both the validation set and the test set.

C. USING POSE INFORMATION
Previously, the input of the model was the RGB data of the video frames. This section presents two experiments that aim to assess if inputting pose information into the model has any impact on the performance.
The first experiment consists of training the model using only pose information, where instead of giving the model the RGB frames, the model is given the generated key point map. The second experiment consists of training the model with the pose information as a complement to the RGB frames. To accomplish this, the RGB frames are concatenated with the key point maps before giving them to the model, which results in a total of 6 channels (labels in the key point map are colours expressed in RGB). To accommodate for this change, the number of input channels of the network is duplicated and the pre-trained weights are copied to the new channels. The first layer of the network is also unfrozen, as 3 more channels were added to the filters of this layer. Table 11, Table 12 and Table 13 show the results on the Moments in Time, HMDB51 and Hollywood datasets, respectively.
Results show that using the pose information as an alternative to the RGB data does not translate into an improvement in performance. The only information that is given to the model is the position of the body parts in each frame, which makes the problem more difficult as there is less available information.
The results on the model trained with the RGB frames and the pose information suggest that there is an improvement for most classes. This could mean that there is an advantage to giving the model the pose information, as it can more easily extract relevant information about movement or the position   of people in a given frame, improving the accuracy of the network.
There seems to be no advantage in using the pose information for the ''hugging'', ''kissing'' and ''eating'' classes, when comparing the classes in common between the Moments in Time and HMDB51 datasets. The Hollywood dataset also shared the same results, showing no improvement when using the pose information for the class ''kissing''. This might mean that some classes benefit more from using the pose information than others. Table 14 and Table 15 show the results on the Hanau03 Vito and Sunnyvale datasets, respectively. Since there was the need to unfreeze the first layer of the network for the RGB + Pose experiment, we wanted to test if unfreezing the first layer on the other experiments would have any significant impact on the performance. Therefore, the tables present the results with the first layer frozen and unfrozen.  Once again, the model achieves reasonable results when using the pose information as a complement to the traditional RGB frames. It also seems that due to the more controlled scenario found in the Sunnyvale dataset, using only the pose information was enough to achieve results superior or comparable to the baseline (RGB column). The results on the Hanau03 Vito dataset did not show any improvements over the baseline model when using the pose information. This could be caused by the camera angle of the dataset (top view), since it makes it difficult to detect the keypoints, making them inaccurate and thus affecting the results.
Since the pose information only contains the position of the actor and its body parts, the model is less likely to become biased towards a certain actor. The downside is that the performance of the model is dependent on the accuracy of the keypoints.

VI. CONCLUSION AND FUTURE WORK
This paper focused on methodologies that counteract actor bias, by removing specific actor information from the features of the model, or by using data that is actor independent, such as pose information.
The first methodology used an adversarial approach to remove actor-related information from the feature vectors of the model. Although the results did not show an improvement, some fine-tuning of the hyperparameters could give better results.
The second methodology assigned a weight to training mini-batches based on how much their gradients ''agreed'' with the gradients of the validation mini-batches. Results show an improvement over the baseline, and since the gradients of the validation set are taken into account (which contain different actors), this methodology is giving a preference to the training mini-batches that give the model better generalization capabilities.
The third methodology studied the impact of giving the model the pose information of the actors. Since the keypoints do not contain much information regarding the actor, we wanted to test if this was a viable strategy to reduce actor bias. Results show some improvements over the baseline on public datasets, and comparable results on the Hanau03 Vito and Sunnyvale datasets. VOLUME 10, 2022 For future work, it would be interesting to explore the use of these methodologies together and see if there are any significant improvements. Since the use of these methodologies had results aligned or superior to our baseline model, we wonder if the combination of them could achieve even better results. LEONARDO

CAROLINA PINTO has been a Deep Learning
Researcher at Bosch, since 2020, for interior vehicle sensing and autonomous driving. Her M.Sc. dissertation was on human interaction recognition using visual information captured from sensors inside the vehicle. Her main work conducted at Bosch was focused on using audiovisual information for occupant emotional monitoring, violence detection, and activity recognition. Her research interests include pattern recognition, biometrics, deep learning, and computer vision.  , IEEE) is currently a Senior Researcher at INESC TEC and an Invited Professor at the School of Engineering, Polytechnic of Porto. As part of its activities at INESC TEC, he participated or was a principal investigator in more than 20 research and development projects, including national, European, and with companies, and has more than 40 papers published in international journals and conferences. His research interests include computer vision, multimedia systems, and decision support systems. JAIME S. CARDOSO (Senior Member, IEEE) is currently a Full Professor at the Faculty of Engineering, University of Porto (FEUP). From 2012 to 2015, he served as the President of the Portuguese Association for Pattern Recognition (APRP), affiliated to the IAPR. His research interests include computer vision, machine learning, and decision support systems, image and video processing focuses on medicine and biometrics, the work on machine learning cares mostly with the adaptation of learning to the challenging conditions presented by visual data, with a focus on deep learning and explainable machine learning. The particular emphasis of the work in decision support systems goes to medical applications, always anchored on the automatic analysis of visual data. He has coauthored more than 300 papers, more than 100 of which in international journals, which attracted more than 7000 citations, according to Google scholar. VOLUME 10, 2022