Toward Fast and Accurate Violence Detection for Automated Video Surveillance Applications

Surveillance cameras are increasingly being used worldwide due to the proliferation of digital video capturing, storage, and processing technologies. However, the large volume of video data generated makes it difficult for humans to perform real-time analysis, and even manual approaches can result in delayed detection of events. Automatic violence detection in surveillance footage has therefore gained significant attention in the scientific community as a way to address this challenge. With the advancement of machine learning algorithms, automatic video recognition tasks such as violence detection have become increasingly feasible. In this study, we investigate the use of smart networks that model the dynamic relationships between actors and/or objects using 3D convolutions to capture both the spatial and temporal structure of the data. We also leverage the knowledge learned by a pre-trained action recognition model for efficient and accurate violence detection in surveillance footage. We extend and evaluate several public datasets featuring diverse and challenging video content to assess the effectiveness of our proposed methods. Our results show that our approach outperforms state-of-the-art methods, achieving approximately a 2% improvement in accuracy with fewer model parameters. Additionally, our experiments demonstrate the robustness of our approach under common compression artifacts encountered in remote server processing applications.


I. INTRODUCTION
Today, surveillance and security cameras are deployed in various public places to monitor public events and human activity. Video surveillance improves public safety and plays a crucial preventive role in protecting a specific territory against crimes. The recorded surveillance footage is often used as evidence in criminal prosecutions. To prevent crime and reduce the crime rate, detecting and recognizing anomalies such as violence as soon as possible is a crucial task for the military and law enforcement agencies. However, surveillance cameras generate a large amount of video data every single day and instances of violence occur very The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li. rarely compared to other normal activities. Therefore, it is impractical and cumbersome for humans to manually monitor this video data for instances of violence. Human error may also reduce the efficiency of a manual, labour-intensive approach. Therefore there is a significant need for automatic and efficient methods for detecting abnormal or violent activities, especially in surveillance videos.
Video classification using Human Activity Recognition (HAR) is a popular research topic in recent years and is analogous to the field of violence detection. In these methods, sensor data is used to provide information on simple or complex physical activities of humans, such as standing, talking and cooking. Earlier techniques for HAR involved detecting and tracking human body parts in consecutive video frames using image-level descriptors, such as Histogram of Oriented Gradients (HOG) or Histogram of Oriented optical Flow (HOF) [1]. Other advanced approaches involved computing spatio-temporal descriptors for motion [2], [3]. However, one of the major drawbacks of these techniques is that they often require good lighting conditions and clear visibility for successful operation. With the development of depth cameras, algorithms have emerged that use depth measurements from sensors such as Microsoft Kinect [4], [5], ASUS Xtion2 [6] or Intel RealSense [7] for HAR. One advantage of depth sensors is that they come with Software Development Kit (SDK) containing real-time algorithms for detecting skeletons [8]. Specifically, a skeleton joint coordinate can be obtained in three dimensions (3D) in real-time and series of these coordinates, when tracked over time, can be used to detect and describe human actions. As a result, several algorithms have been proposed in the literature for using depth sensors to perform HAR [9], [10], [11], [12], [13] or using a combination of color and depth sensors [14]. However, depth sensors, even the modern ones, often have substantial noise in their measurements. Without adequately filtering out this noise, it can be difficult to achieve good detection for HAR. Additionally, integrating depth sensors into use cases such as surveillance can increase the hardware costs and may not always be feasible.
The use of Convolutional Neural Networks (CNNs) has become increasingly common in computer vision due to their exceptional success in image recognition tasks [15], [16]. CNNs are evolving rapidly in many fields of research, and it is expected that future solutions will enhance the adoption of CNNs. With the availability of big data and the exponential growth of computing power, these learning algorithms continue to have significant development potential. Several successful methods have recently been proposed that extend the spatial CNNs, which are used for image recognition tasks, to the temporal domain for HAR in videos [17], [18], [19], [20] [21], [22], [23], [24]. One of the main advantages of using CNNs for HAR is that they can handle challenging cases such as changes in lighting conditions, background changes, camera movement, different dressing styles and varying body shapes of people. They can also handle videos with partially or completely occluded human body parts.
In this paper, we address the problem of violence detection using deep learning with CNNs. Specifically, the following are the contributions of this work: 1) We propose a deep learning-based approach for filtering videos based on their violent or normal content. Our method is computationally efficient, making it practical for real-world applications and performs better on popular video classification metrics than several stateof-the-art methods for violence detection. Additionally, our method is able to maintain high classification accuracy even in the presence of video compression artifacts. 2) We present a comprehensive video database for the study of violence, comprising both violent and normal videos. Our database combines and extends seven existing video databases, providing a diverse range of violent content in various contexts. 3) We present a fully functional stand-alone system that implements the proposed methods for automated violence detection.
The rest of the paper is organized as follows: In section II, we discuss related work. Section III describes the proposed approaches in detail. In section IV, we present the results and discuss the scope of generalization of our approach. In section V, we conclude the work and derive future directions for our current work.

II. RELATED WORK
In this section, we describe several classes of algorithms that have been proposed in the literature for detecting violence using deep learning. We note that in the literature, there are multiple variants of violence detection that are being studied under different names, such as anomaly detection, abnormal activity detection and fight detection. Our current work focuses on forms of violence that primarily involve humans and human interaction with objects.
Due to the lack of a substantial amount of labeled data containing diverse real-world violence samples, several studies in the literature have used training data containing only a few samples. There are also some large-scale, publicly available datasets for violence detection. However, for these datasets, the exact time and duration of the violence are not available. Algorithms trained on such data often strive to minimize unusual patterns among training samples in order to learn about rare violent activities [25], [26] [27], [28] [29]. These methods are described in the subsections II-A and II-B. We also introduce methods that use labeled training data for violence detection in subsection II-C.

A. MODELLING NORMAL PATTERNS
These techniques learn patterns of normal behavior from training videos that contain no violence. Since only normal videos without violence are used in the training phase, no specific labels are provided. During testing, these methods are expected to find samples that deviate from the learned normal behavior [30]. In [31], and [25], motion trajectories are used to learn about normal patterns. In [31], the authors suggested representing motion patterns using super-trajectories that describe motion of local groups of similarly moving points (pixels in a video sequence) and clustering these motion patterns hierarchically to derive prototype patterns for normal samples.
In [32] and [33], the authors used auto-encoders to learn regularities in video sequences. As inputs to the autoencoders, they used state-of-the-art spatio-temporal motion features computed using HOG and HOF [34]. In [35], the authors used optical flows along with video sequences and constructed multiple auto-encoders. They used reconstruction loss [36] to detect abnormal or violent events.
Authors in [37], [38], and [39] also incorporated auto-encoders to learn normal behaviors, but without explicitly computing local motion patterns. This one-stage approach is faster in terms of computational speed because it does not require object detection or feature extraction. There were also approaches that augmented memory modules [37] to auto-encoders and used optical flow images [40], [41] to define flows of normal patterns. In [37], the authors augmented the output of an encoder in a variation of an auto-encoder CNN with a memory module that adaptively records prototypical patterns of normal data for more accurate detection of violent cases in a given database.
In [39] and [42], the authors employed a variation of auto-encoders to predict a future video frame from a given set of consecutive video frames. Then, they computed perpixel differences between the predicted and ground truth frames to make a decision on whether the current video sequence is normal or not. Future frame prediction has gained increasing attention due to its potential applications in unsupervised feature learning for video representation [43]. In [39], the authors also quantized the output of the encoder using a predefined codebook (a concept similar to augmented memory modules) that further narrows the explanation of normal events and aids in better future frame prediction in normal videos. In [44], to generate more realistic and accurate future frames, the authors imposed a loss in the temporal space. In particular, they computed optical flows in video sequences using a pre-trained CNN [45] and formulated a loss function for an auto-encoder that ensures the optical flow of predicted frames is consistent with the ground truth. In addition to the methods that predict future frames, there were also efforts in the literature to predict transformations needed for generating future frames [46], [47], [48], [49].

B. MULTIPLE INSTANCE LEARNING
These methods also aim to learn about violent actions using video-level labels that are provided during the training phase. In contrast to methods that model normal patterns, these methods use both normal and violent data to train violent detection models [50], [51], [52], typically using Multiple Instance Learning (MIL) [53]. Sultani et al. [50] divided each video (in both normal and violent videos) into multiple temporal segments to form positive and negative bags that capture instances of the violent and normal events respectively. C3D [54] spatio-temporal features were then extracted from each segment and used to train multiple fully connected layers, which derive scores for the positive and negative bags. Due to the absence of segment-level labels, a novel ranking loss function was proposed that encourages the score indicating violence in the positive bag to be higher than the score in the negative bag. The ranking loss also imposed smoothness and sparsity constraints in the ranking loss to reduce false alarms.
By extending the approach of Sultani et al., Zhu et al. in [51] introduced temporal context information into the MIL ranking loss to compute video-wise scores, rather than segment-wise scores. They proposed a temporal augmented network that captures motion features using pre-computed optical flows, similar to an auto-encoder. The encoded motion patterns were used to train MIL ranking model for better localization of violence instances.
Philippe et al. [55] proposed a two-step approach where they first detect and track humans locally across a given segment of a video to form human tubes (spanning the entire segment) and then use multi-fold Multiple Instance Learning (MIL) with Support Vector Machines (SVM) [56] to learn about human tubes that contain the action described by the video-level labels. In [57], Yan et al. proposed a multi-task ranking model. In their approach, they segmented videos into supervoxels using a graph-based segmentation method to generate action tubes and action-actor tubes. Action tubes were then used as proposals for actions, e.g., walking, adult running, and crawling. Features were extracted from each tube to train the ranking model to select the most characteristic action tubes.
Arnab et al. [58] proposed a probabilistic variant of MIL, in which they estimate the uncertainty of an instance-level prediction. They used a pre-trained person detector trained on a large image dataset to detect persons over consecutive frames of a video to form person tubelets. A bag for MIL consists of all tubelets within a video, and it is annotated with the video-level label. During training, they also model the label noise through the uncertainty of sampling bags that do not contain any tubelets with the labeled action.
Mettes et al. in [59] aimed to find the spatio-temporal locations of actions in videos using pseudo-annotations. They investigated spatio-temporal pseudo-annotations from different sources such as action proposals, object proposals, person detection, motion, and center biases. They later combined the extracted pseudo-annotations using a correlation metric to train a classifier using MIL.

C. SUPERVISED LEARNING
There have been multiple approaches that use deep learning to classify violent videos using labeled data. These methods rely on video datasets with accurate visual information about the relevant class, such as videos in the violence class containing few or no normal events.
In [60], Long et al. proposed a method for classifying violent videos using the Motion SIFT (MoSIFT) algorithm to extract features and then applying Kernel Density Estimation (KDE) to filter out noise. These reduced MoSIFT features were then transformed into a video-level feature vector using sparse coding, and a Support Vector Machine (SVM) was trained on these vectors to classify videos.
In 2012, Hassner et al [61] proposed a method for real-time detection of violence in crowded scenes using the Violent Flows (ViF) descriptor to capture optical flow information between consecutive video frames and a linear SVM to classify the videos based on the computed ViF descriptors. They demonstrated that their method was effective at classifying videos containing crowd violence, and it was compared to other existing methods at the time. In a later study, Meng and Serrano [62] proposed a method for violence detection that combined feature extraction with deep learning using Convolutional Neural Networks (CNNs). Their approach involved using a Hough Forests spatio-temporal feature extractor in combination with a 2D CNN.
Sudhakaran and Lanz [63] proposed a method for encoding the difference between two successive frames using a combination of a CNN and a Long Short-Term Memory (LSTM) module and demonstrated that this approach had better performance than a model trained on raw frames. AlDahoul et al. [64] proposed a lightweight model with fewer parameters that used a CNN and an LSTM module to capture spatial features for violent video classification. Fath U Min Ullah et al. [65] proposed a Violence Detection Network (VD-Net) that first used object detection to detect humans and suspicious objects like guns to pre-filter video sequences for violence detection and then applied a combination of a convolutional LSTM and gated recurrent units [66] to the filtered video sequences for violence detection. Romas et al. [67] also proposed a CNN-LSTM-like architecture that was computationally light, using MobileNet V2 to extract spatial features for training an LSTM network.
Chollet et al. [68] used a model based on XceptionNet [69] to extract features from a video and then applied a bi-directional LSTM to analyze the extracted features in both forward and backward temporal directions for classification. Khan et al. [70] proposed a method that uniformly samples a video into segments, selects a representative frame from each segment using computed levels of saliency, and then fine-tunes a MobileNet model [71] using the representative frames to classify the corresponding segment as violent or non-violent. Li et al. [72] proposed a DenseNet-based [73] 3D CNN architecture that directly processes video data without explicitly computing features, and demonstrated good accuracy on standard databases with a relatively lightweight model. Fernando et al. [74] proposed an architecture based on a variant of DenseNet [75] that extracts feature maps and then applies self-attention mechanisms [76] to link different positions in a single sequence and generate a representation that focuses on the most relevant parts of the sequence. This representation is fed into bi-directional LSTM blocks and fully connected layers for classification. They demonstrated good accuracy on four different databases using this method, and also experimented with using both optical flow and pseudo-optical flow computed from adjacent frames as inputs to the DenseNet.
In a 2016 study, Dong et al. proposed a multi-stream deep convolutional neural network consisting of three streams (color, optical flow, and person-to-person acceleration) for violence detection. The acceleration stream aimed to capture the intense information that was hypothesized to be present in violent events, and three LSTMs were trained using the features from the three streams. The outputs from the streams were fused to classify a video. In a later study, Su et al. [77] proposed a method for violence detection that involved computing 3D skeleton point clouds from video and then using interaction learning on these point clouds to capture spatio-temporal features and model interactions between skeleton points. They used multiple Skeleton Points Interaction Learning (SPIL) modules together with a fully connected layer to classify violent videos from normal videos. In another study, Mu et al. [78] proposed a method for violence detection that used both visual and audio cues, as it was hypothesized that visual information may not be reliable for violence detection and that using audio could improve performance. They extracted audio features using 40-dimensional Mel Filter-Bank (MFB) coefficients and used an SVM to classify audio samples from input videos. Table 1 compares different approaches for violence detection in videos proposed in the literature and lists their advantages and disadvantages. Successful algorithms for violence detection should be computationally fast, achieve high classification accuracy, and be adaptable to scenarios not present in the training data. Some normal actions involving close physical interaction between humans can mimic violent actions and can mislead the deep learning algorithms that are solely trained on normal videos. It is suggested that it is important to incorporate both normal and violent behaviors in the training data for better generalizability of the trained models. MIL using spatio-temporal feature-based methods can be computationally fast but may not achieve high classification accuracy, as they focus on predicting bag-level labels while neglecting the hidden temporal context information in violence and normal patterns.

D. COMPARISON OF THE STATE-OF-THE-ART METHODS
From the results presented in the literature, it is evident that methods that use 3D deep learning architectures that capture spatio-temporal features in the data account for both the spatial structure of the video frames as well as the temporal dynamics between frames. This makes such 3D CNNs effective at tasks such as action recognition and violence detection, where the actions being performed and their temporal evolution are important factors to consider. However, it is important to note that the cost of extracting some of the spatio-temporal features is still prohibitive for practical applications. In the current work, a computationally light and accurate 3D deep learning architecture (see section III) is adapted and extended and labeled datasets are used (refer to section III-A) to develop efficient methods for violence detection.

III. FAST AND ACCURATE VIOLENCE DETECTION
ResNet [79] is a popular base architecture for image and video recognition tasks, known for its effectiveness and stateof-the-art results on benchmarks like ImageNet [80] and COCO [81] datasets. 3D ResNets [23] are an extension of the ResNet architecture, designed for learning spatiotemporal features from video data. They have achieved strong performance on various benchmarks and real-world applications, including the Kinetics-700 action recognition dataset [82] (where a variant called I3D [83] achieved state-of-theart performance) and the Something-Something V2 action recognition dataset (where a 3D ResNet called R(2+1)D [84] achieved state-of-the-art performance).
3D ResNets have higher accuracy than counterparts like 3D-MobileNet [85] due to factors such as more layers for learning complex spatio-temporal features and skip connections between the input and output of each layer that allow input to bypass intermediate layers. However, they are generally more computationally intensive due to a large number of model parameters. To improve computational efficiency, model complexity can be reduced through techniques such as reducing the number of layers, using fewer filters in convolutional layers, and using smaller input data, though this may decrease accuracy on complex tasks. Christoph et al. [86] experimented with various parameters of the 3D ResNet architecture to understand the effect of reduced model complexity on accuracy. They expanded the architecture along multiple axes to form spatio-temporal models and selected the axis that achieved the best trade-off between computational speed and accuracy, resulting in a series of models ranging from extra small (XS) to extra large (XL) in increasing complexity. Using the Kinetics-400 dataset [87], they showed that their expanded model, X3D-M, had the same accuracy as state-of-the-art video classification networks but with a 10X reduction in model parameters.
The X3D-M model is an appropriate choice for our violence detection task due to its high accuracy and reduced model complexity. As demonstrated by Christoph et al., the X3D-M model achieves similar accuracy to state-of-theart video classification networks, but with a significantly lower parameter count. This reduction in model complexity makes the X3D-M model more efficient to train and deploy, particularly for resource-constrained systems. Also, the ResNet 3D backbone, which has a proven ability to learn complex spatio-temporal features, is particularly useful for our violence detection task, as it allows the model to capture the dynamic nature of the videos and learn robust representations of the data. The proposed system using X3D-M model architecture is detailed in section III-B.

A. DATASETS FOR EXPERIMENTS
Due to data protection laws such as GDPR [91], it is not possible to obtain large amounts of real-world footage containing violence for training deep learning models. Recently, the usage of synthetic training data has become more common in computer vision. The use of training data containing pasted object patches on real images has been shown to be effective for tasks such as 2D object detection [92], [93], [94] and human pose estimation [95]. However, for violence detection, we postulate that such fabricated training data may not fully capture the complex and diverse action patterns of violent actions with various nuances. Therefore, preparing and using synthetic training data is not considered in the scope of the current work.
In their study, P. Sernani et al. [96] proposed the AIRTLab dataset, which contains videos showing violence patterns performed by non-professional actors. They studied the use of 2D and 3D deep learning architectures for violence detection using their dataset and found that the studied models adapt well to their setting, where violence is mimicked by nonprofessional actors. However, they also noted that their results cannot be considered general, as their architectures were not validated on other datasets and no cross-validation experiments were performed. Therefore, we do not consider such datasets in our experiments.
In the current work, we have considered seven different datasets that are commonly used in the literature for experimentation with violence detection and to facilitate comparison of our results with other methods. We have also extended some of these datasets with annotations to assist in in-depth cross-validation experiments. These datasets are described in the following: • Crowd Violence (CV) dataset [61] contains videos involving violence in crowds, collected from YouTube.
• Hockey Fights (HF) dataset [88] is a collection of fights between players in hockey games from the USA's National Hockey League (NHL).
• Movie Fights (MF) dataset [88] consists of a collection of scenes from action movies. • XD-Violence Selected (XD-V) dataset contains a subset of videos from the XD-Violence dataset [97]. The XD-Violence dataset consists of several untrimmed videos covering six anomalies, including Abuse, Car Accident, Explosion, Fighting, Riot, and Shooting, gathered from action movies and YouTube. Similar to the UCF-Crime dataset, we selected a set of videos belonging to the classes of Abuse, Explosion, Fighting, Road Accident and Shooting from the XD-Violence dataset and manually trimmed these videos to only contain violent parts for training and testing purposes. All of the datasets also contain normal videos for training and testing that do not involve violence. In the case of the UCFS and XD-V datasets, we trimmed the normal videos to five-second video clips to match the average duration of normal clips in the other datasets. Additionally, in the case of the UCFS and XD-V datasets, we limited the maximum duration of a video clip containing violence to approximately five seconds. Table 2 provides more details about each of the datasets we used in our experiments.

B. MODEL ARCHITECTURE
We note that for accurate violence detection, it is important to a have properly labeled dataset containing a large number of diverse examples for training a deep learning model. Successful action recognition datasets such as Kinetics-400 [87], contain a minimum of 400 videos for each action class, such as standing, sitting and talking. All videos in the Kinectics-400 dataset have a fixed duration of five seconds. The authors obtained clips for each class from YouTube and then used Amazon Mechanical Turkers (AMT) to decide if a given clip contained the desired action. A clip was accepted if it received three or more confirmations (out of five) [87]. The dataset was also de-duplicated to reduce redundancies in the environment.
In several cases, actions involving violence are more complex than actions such as sitting and talking and the number of example violent videos collected in existing datasets may not be sufficient for training a model that generalizes well and can lead to overfitting. Additionally, as shown in Table 2, different datasets for violence detection contain clips with different durations in seconds and they are not well-organized to check for the validity of a specific action or for redundancies. To address these issues with existing datasets for violence detection, we follow training approaches that are inductive in nature. Specifically, we aim to make use of the knowledge learned using better-calibrated action recognition datasets to solve the efficient violence detection problem. To this end, we propose two different deep learning configurations that are described in the following subsections.

1) FINE-TUNED X3D-M MODEL
In the Fine-Tuned X3D-M (FT) model, we consider the X3D-M model architecture initialized with weights obtained by training on the Kinetics-400 dataset. Note that the original architecture used for training on the Kinetics-400 dataset contains two fully connected layers, with the output of the second fully connected layer representing the classification results for each class (the number of outputs of this layer is equal to the number of classes in the training dataset). Since we aim to predict if a clip contains violence or not (a binary classification), we modify the architecture into a regression model to generate a violence coefficient that indicates the probability of violence in a given video clip. Specifically, we trim the X3D-M model until the first fully connected layer and replace the second fully connected layer with one that outputs a floating-point variable, which is converted into the range of [0, 1] using a sigmoid function to derive the violence  coefficient. Simply, during learning, we label the violence coefficient as 1 for samples of video clips containing violence and as 0 for samples of video clips containing no violence.
The architecture of the X3D-M model follows the fast pathway design of SlowFast networks [98] with down-sampled temporal input. Therefore, we pre-process the input videos as required by the X3D-M model. In particular, for a given video clip, we first extract 16 video frames by uniformly sampling in the temporal domain. Then, we transform the pixel value range of the extracted frames to be within [0, 1] to obtain floating-point images. Next, we normalize the video frames using mean and standard deviation and resize the frames so that the shortest side corresponds to 256 pixels. Finally, we center crop the resized frames to obtain 16 video frames with a spatial resolution of 256 × 256. Batches of pre-processed video frames are supplied to the FT model with corresponding labels for training. Note that the X3D-M model weights obtained by training on the Kinetics-400 dataset are only used for network initialization and these are further optimized during training on datasets for violence detection. The FT architecture is shown in Fig. 1, and Table 3 presents information on the corresponding model parameters.  2048 elements, which is used to train three additional fully connected layers, as shown in Fig. 2. The output of the additional fully connected layers is a floating-point variable, and similar to the FT model, we transform this variable to be within the range of [0, 1] using a sigmoid function to obtain the violence coefficient. Table 4 presents information on the TL model parameters.

C. LEARNING AND OPTIMIZATION
We do not apply data augmentation techniques in the training of the proposed models. We use Adagrad [99] to optimize our models with an initial learning rate of 1e −3 . Both models are compiled to minimize the Binary Cross Entropy (BCE) between the estimated and ground truth violence coefficients. For training the TL model, we use a batch size of 30 samples collected from shuffled pre-computed X3D-M feature vectors. Since the FT model takes videos as input, to account for higher memory usage during training, we consider a batch size of 4 samples collected from shuffled videos. For regularity, within a training batch, for both models, we concatenate a batch of violent video clips with a batch of non-violent video clips. For ease of access, all our hyperparameters are listed in the table 5

IV. RESULTS AND DISCUSSION
In this section, we present the results from our various experiments using the proposed models and the various datasets described in section III-A. Most of the datasets used in the study already have a training and testing data split with 80% of the data as the training set and 20% as the  test set. For other datasets, for our experiments, we preserve this percentage and randomly select 20% of violent and nonviolent samples to create a testing set for fair comparison across datasets. To facilitate fair comparison, all the models are trained for 50 epochs using a given training dataset. We use the PyTorch [100] deep learning library to train and test our models on a Nvidia GeForce GTX 1080 Ti GPU using the CUDA toolbox. We use the Ubuntu Linux operating system on an AMD Ryzen Threadripper 1950X 16-core processor. To evaluate the performance of various methods, we use the following metrics that are commonly used to evaluate the performance of classification algorithms using deep learning.
• Accuracy (ACC) [101] is the most popular metric for evaluating deep learning models for video classification. It is the ratio of the number of correct predictions (as violent or non-violent video clips) to the total number of predictions. To compute the accuracy, we used the provided ground truth binary labels -0 (for video clips without violence) and 1 (for video clips with violence)that are provided during training. Since we designed our networks to output floating-point violence coefficients, we round the predicted violence coefficients to the nearest integer before calculating the accuracy. In line with other methods in the literature, we report the accuracy score in percentages.
• Area Under Curve (AUC) [ represent the ability of a model to effectively discern between violence and non-violence cases, while lower values represent the opposite. We have conducted several experiments, including crossdataset validation, to evaluate the performance of the proposed approaches using the considered datasets and metrics. The details and results of these experiments are presented in the following subsections.

A. EXPERIMENTS ON INDIVIDUAL DATASETS
Most datasets already have pre-defined data splits for training and testing, with 80% and 20% of the data respectively. We used these splits without modification for unbiased comparison. For the remaining datasets, we maintained this proportion of training and testing data by randomly selecting 20% of violent and non-violent samples for testing. We trained our models on the training data split and evaluated their performance on the testing data split for each dataset separately. The testing results using the ACC and AUC metrics are presented in Tables 6 & 7 respectively. The tables also show the performance of state-of-the-art methods discussed in section II on the respective datasetes. As mentioned, we created the UCFS and XD-V datasets and we report the results on these datasets using only our methods.
It is worth noticing that only a few studies in the literature report evaluations using the AUC metric. We argue that in applications such as violence detection, false positives (incorrectly reporting non-violent events as violent) should be explicitly considered when evaluating the performance of a model and the ACC metric does not directly account for false alarms.
The experimental results on individual datasets show that both of our proposed methods perform well on individual TABLE 6. The ACC(%) scores of our FT and TL models along with the state-of-the-art methods on individual datasets. Based on the ACC metric, our FT method outperforms most of the state-of-the-art methods on all datasets except HF, with relatively fewer model parameters. datasets. Overall, our FT model outperforms most of the state-of-the-art methods and our TL model also achieved decent performance on all datasets. We postulate that the FT model, which optimizes the parameters of the (trimmed) X3D-M model during learning, is more adaptable to a given dataset. On the MF dataset, the results for both TL and FT models suggest overfitting, which is consistent with the results from most methods in the literature. This suggests that the MF dataset may contain more regular examples with less diversity and may be less challenging for deep learning video classification models. The Tables 6 & 7 also show the model parameter count for various models under comparison and our models have fewer parameters than the state-of-the-art methods.
Bilinski et al. [3] achieved a higher accuracy than our TL model on the CV dataset. They used improved Fisher vectors for spatio-temporal feature extraction, which can be contextdependent. For example, the CV dataset only contains examples of violence involving a crowd and their results show that their method performs better in such scenarios. It is important to note that statistical feature extraction methods like this can be sensitive to variations in the video capture environment and may result in false alarms. When evaluated using the AUC metric, our TL model performs better than the method of Bilinski et al. [3] on the CV dataset (see Table 7).
Sudhakaran et al. [63] used a pre-trained AlexNet model trained on ImageNet for their method. They used the difference between consecutive video frames as input to capture temporal information. The results show that their method performs better on the CV dataset compared to our TL model. We should note that our TL model extracts features using a pre-trained X3D-M model trained on the Kinetics-400 dataset. This dataset contains a smaller number of examples with several people appearing in individual frames of the videos. In contrast, the ImageNet dataset contains a relatively higher number of examples with several people appearing in one frame. Therefore, we suggest that the extracted X3D-M features might be noisy and result in lower accuracy on datasets involving crowds such as the CV dataset.
Li et al. [72] used a DenseNet 3D-CNN to train and extract spatio-temporal features from videos. Their model was initialized with parameters from a pretrianed model trained on the Kinetics-400 dataset, similar to our FT model. However, their model had more CNN layers and higher model parameters, which contributed to its better accuracy on the CV and HF datasets compared to our TL model. It should be noted that DenseNet uses multi-layer feature concatenation for improved feature representation, but this approach requires more GPU memory and longer training times. Choqueluque-Roman et al. [104] followed an approach that used an I3D architecture in combination with a ResNet50 for feature extraction using human action tubes for training a deep learning model based on MIL. Their results showed that, according to the accuracy and AUC metrics, our models achieved better performance with relatively fewer model parameters, which confirms that training based on MIL may not achieve high classification accuracy.
Violence-Net [74] also used DenseNet for training and extracting feature maps. According to the ACC metric (see Table 6), their method using optical flow input achieved better scores than our FT model on the HF dataset. However, their architecture contains more model parameters and involves computing optical flow information, making it computationally more complex than ours. When pseudo-OF was used as input in their method, the accuracy decreased compared to our FT model. On the CV dataset, their model with more number of parameters achieved higher accuracy than our TL model. As previously mentioned, the extracted X3D-M features from videos involving crowds can be noisy and lead to less accurate results.
The method proposed by Romas et al. [67] used MobileNet V2 architecture for spatial feature extraction and LSTM modules for learning about temporal associations. Despite having a similar number of model parameters as our TL model, our methods achieved higher accuracy. As demonstrated by our results, methods that capture 3D spatio-temporal features directly from the video data, such as our proposed models, represent temporal associations more accurately and are therefore more effective at detecting violence in videos. This is due to the ability of our proposed models to accurately capture the full context and dynamics of the events depicted in the video, leading to improved performance in violence detection tasks.
The SPIL method [77] achieved higher accuracy scores than our TL model on the CV and RWF-2K datasets. However, this method requires significant computational resources due to the need to estimate 3D skeleton point clouds for interaction learning, making it impractical for practical applications.
The Violence Detection Network (VD-Net) [65] achieved better accuracy on HF and RWF-2K datasets compared to our TL model and has slightly more model parameters. VD-Net first detects humans and suspicious objects such as guns, which requires more computational resources than our TL model. However, the AUC scores for the TL model are comparable to VD-Net.
Finally, the CNN-LSTM-IOT model [64] has fewer parameters than all of the models under comparison, including ours, and it has been demonstrated that it can run on a low-cost Internet of Things (IoT) device like a Raspberry Pi. However, the model relies on spatial features for learning and performs poorly on the RLVS and RWF-2K datasets.
In summary, our experiments on individual datasets demonstrated that our FT model outperformed most of the state-of-the-art methods on most datasets while having fewer model parameters. Our TL model also achieved decent performance on all the datasets, despite having fewer trainable parameters than the FT model, as shown in Tables 3 & 4. This suggests that the TL model is relatively less adaptable to specific scenarios.

B. EXPERIMENTS ON GENERALIZABILITY
To study the adaptability of our proposed approaches to unseen videos, we conducted cross-dataset experiments where we trained a model on one dataset and evaluated its performance on another dataset. Table 8 shows the results from such one-on-one cross-validation tests in the top section (columns [5][6][7][8]. It should be noted that, among the considered datasets, different datasets have different numbers of videos containing instances of violence and non-violence actions. In general, the number of samples available for training can greatly affect the learning capabilities of a deep learning model. Few and less diverse training samples can lead to model overfitting, where the model models some noise or random fluctuations in the training data is modeled very well, but it cannot generalize to new data. In our case, since we follow an inductive training approach using a pre-trained X3D-M model on the Kinetics-400 data, we suggest that our models are least influenced by the number of training samples, and our cross-validation results essentially show the ability of our models to learn the concept of violence. Both ACC and AUC metrics show that there are several inconsistencies in the results across the considered datasets. To provide deeper insights into our cross-validation results, we plot the ACC and AUC scores obtained by training on a specific dataset and averaging the testing scores on the rest of the datasets in Figures 3 & 4 respectively for both FT and TL models. Each plot also shows the standard deviation of the metric scores obtained from the testing datasets, indicated by the red color lines. According to the metric scores, the trained FT and TL models on the CV dataset did not generalize well to other datasets (see bar plots in Figures 3(a) & 4(a)). This is anticipated since the CV dataset contains only examples of mass violence, and the other datasets do not contain many such examples. Also, the trained FT model on the HF dataset poorly generalized to other datasets, indicating that the HF dataset does not contain diverse examples of violence and contains monotonous fighting videos between hockey players. However, the TL model trained on this dataset showed better generalization than the FT model as indicated by the metric scores.
FT and TL models trained individually on datasets -MF, RLVS, RWF-2K, UCFS & XD-V performed satisfactorily in our cross-validation tests and generalized well to other datasets with average ACC scores close to or above 80% and average AUC scores close to or above 0.8. When considering both metrics, FT and TL models trained on UCFS and XD-V datasets exhibited the best generalization ability in our crossvalidation studies. This suggests that these datasets, which we compiled, contain the most representative and diverse samples for violent and non-violent actions.
For closer examination, we also conducted leave-oneout cross-validation tests where we trained our models on all datasets except one, which was reserved for testing. The results of these tests are presented in the middle section of Table 8. The tests suggest that when the CV TABLE 8. Cross dataset experiment results -One-on-one cross-validation test results are shown in the top section, leave one out cross-validation test results are shown in the middle section, and the bottom section shows the performance of our models on the training/testing folds used in Violence-Net [74] . To compare, ACC scores for Violence-Net using both OF and Pseudo-OF inputs are also provided for relevant datasets.
dataset was left out of the training, the TL model did not achieve a good ACC score. This is expected because the TL model extracts features from training videos using a pretrained X3D-M model that was trained on the Kinetics-400 dataset, which does not contain many examples involving crowd participation. However, the FT model achieved decent accuracy, indicating that the datasets other than CV contain a sufficient number of examples for learning about violence involving crowds. In line with the results obtained in the one-on-one cross-validation tests, leaving out the UCFS or XD-V datasets from training resulted in poor performance for the FT model. However, the performance of the TL model did not drop when these datasets were left out of the training, indicating that the TL model generalizes better than the FT model. To confirm this, we collected all instances of the one-on-one cross-validation tests when a specific dataset was being tested for further examination. In Figures 5 and 6, we plot the ACC and AUC scores obtained by averaging the testing accuracy scores on a specific dataset when all other datasets were used individually for training    for both the FT and TL models. Each plot also shows the standard deviation of the metric scores obtained during testing, indicated by red lines. Based on these plots, it is evident that overall, the TL model showed better capability to VOLUME 11, 2023   generalize and had lower standard deviation within the testing accuracy scores for individual datasets when compared to the FT model.
To the best of our knowledge, results from cross-validation studies are rarely presented in the literature for violence detection algorithms. For comparison, we have also included the cross-validation results from Violence-Net [74] using both OF and pseudo-OF inputs (columns 2-3) in the Table 8. Only ACC scores are provided since AUC scores are not presented in their original study. Also the authors of Violence-Net only used four datasets in their experiments, so results are presented only for these four datasets. The comparison results show that, on average, our TL and FT models consistently outperformed Violence-Net using both OF and pseudo-OF inputs. This suggests that our approaches are more accurate and better able to generalize to unseen scenarios for violence detection when compared to Violence-Net.

C. EXPERIMENTS WITH ALL COMBINED DATASET
In this section, we describe our experiments using combined dataset and discuss the performance of the FT and TL models on this dataset. To ensure a fair distribution of training samples from each dataset, we selected and grouped the predefined 80% of the data from each dataset for training and the remaining 20% for testing. Figure 7 illustrates the proportion of samples from each dataset. The ROC curves, including the obtained ACC and AUC scores are presented in  Figure 8. Results from both metrics suggest that our models performed satisfactorily on this dataset, with the FT model achieving slightly better performance. It is worth noting again that the TL model has fewer trainable parameters than the FT model.
For further analysis, we present the confusion matrices for both models in Figure 9. The rows of the confusion matrix represent the true labels, or the expected output, for the Violent (V) or Non-Violent (NV) classes, while the columns represent the predicted labels. In our case, the following are the four numbers presented in the confusion matrices: • True Positives (TP) -the number of videos actually containing violence that were predicted as containing violence. TP are shown in the first row, first column of the confusion matrix.
• False Negatives (FN) -the number of videos actually containing violence that were predicted as not containing violence. FN are shown in the first row, second column of the confusion matrix.
• False Positives (FP) -the number of videos actually not containing violence that were predicted as containing violence. FP are shown in the second row, first column of the confusion matrix.
• True Negatives (TN) -the number of videos actually not containing violence that were predicted as not containing violence. TN are shown in the second row, second column of the confusion matrix. From the confusion matrices, it is evident that the TL model produced a greater number of combined FP & FN than the FT model. For detailed evaluation, we also studied and presented the metric scores and confusion matrices for individual datasets.

D. ANALYSIS OF THE DATASETS AND CHALLENGES
Even though the CV dataset has a smaller number of examples, both models trained on the combined dataset performed well on it. However, our leave-one-out cross-validation results indicate that when the CV dataset was excluded from training, the models did not perform well. This suggests that the CV dataset contains diverse and representative examples of crowd violence. However, it should be noted that the dataset only includes examples of violence involving crowds and the models trained on it did not generalize well to other types of datasets.
The HF dataset, on the other hand, contains a relatively larger number of training samples, primarily consisting of monotonous fighting videos between hockey players. Both our FT and TL models trained on the combined dataset performed well on this dataset as well. However, our leaveone-out cross-validation test revealed that excluding this dataset did not significantly decrease the accuracy of our models. Additionally, the model trained solely on the HF dataset did not generalize well to other datasets, as shown in figure 3. In line with our previous results on generalizability, highly monotonic datasets like the HF dataset are less useful for developing robust deep-learning models for violence detection.
Since our models use pre-processed input containing 16 uniformly sampled temporal frames, the duration of a video and the number of frames per second can affect the model's performance. The MF dataset has significant  fluctuations in the FPS values of the training videos (as seen in table 2), which is not favorable for training our violence detection models. Additionally, this dataset has the least number of training samples compared to others and models trained solely on this dataset did not generalize well to other datasets. We hypothesize that these drawbacks of this dataset could be the reason for the decrease in the performance of the FT model (trained on the combined dataset) on this dataset. On the other hand, due to better generalizability, the TL model trained on the combined dataset performed well on this dataset.
In addition, our leave-one-out cross-validation test shows that the MF, RLVS, and RWF-2K datasets do not contribute significantly to model generalizability. The RLVS dataset mainly contains examples of two people fighting, which are also present in other datasets such as UCFS and XD-V. The RWF-2K dataset contains videos that are encoded at 30 frames per second, but we have observed that there are videos captured at very low fps, resulting in repeated frames to create 30 fps videos. Additionally, most examples in this dataset are repetitive in terms of environment and lighting conditions and lack diversity. However, it is important to note that the RLVS and RWF-2K datasets contain the highest number of examples, which can lead the model trained on the combined dataset to better represent scenarios in these datasets. We hypothesize that due to the aforementioned drawbacks specific to each of these two datasets, our models trained on the combined dataset did not perform very well on the RLVS and RWF-2K datasets.
Finally, our results show that models trained on our UCFS and XD-V datasets generalize better to other datasets (as seen in figure 3). Also, when these datasets were excluded from training, the performance of our models dropped significantly, indicating that these datasets contain wellcalibrated, diverse video footage, which is highly relevant for training practical deep learning algorithms for violence detection (as seen in table 8). However, these datasets contain a fewer number of training examples compared to RLVS and RWF-2K. Additionally, the UCFS and XD-V datasets contain forms of violence such as explosions and road accidents, which are not distinctly available in other datasets. Due to this, we hypothesize that our models trained on the combined dataset did not perform very well on the UCFS and XD-V datasets. Overall, with fewer false positives and false negatives, our FT model performed better on the combined dataset than the TL model.

E. EXPERIMENTS WITH VIDEO COMPRESSION
Depending on the available hardware resources, it may be necessary to stream the surveillance video to a remote server for actual classification and violence detection. Additionally, depending on the available network resources, there may not be sufficient bandwidth to stream the video in its native resolution and quality. In several fields where video streaming is involved, video compression techniques are commonly applied to reduce the video bit-rate, which can introduce artifacts in the video. To study the effect of such video artifacts on the performance of our TL and FT models, we generated compressed video streams with varying bitrates -300, 500, 1000 and 1500 Kbps.
For this experiment, we randomly selected two datasets, RWF-2K and CV and compressed the testing videos from these two datasets. Multiple videos were generated with the different bit-rates using ffmpeg [105]. We used the models trained on the combined dataset for this experiment and the testing results are presented in Table 9. Our study shows that both TL and FT models did not show significant fluctuations in the performance and performed well even under extreme compression (300 Kbps). This suggests that our trained models did not model the noise in the training videos and focused on learning the concept of violence.

F. STANDALONE IMPLEMENTATION AND PERFORMANCE
We have implemented a standalone application for violence detection using the PyTorch deep learning library and using our FT and TL models that are trained on the combined dataset. The application design is outlined in Figure 14 and can be easily extended for usage in surveillance applications. The incoming video stream is divided into non-overlapping video segments of four seconds, from which 16 video frames are extracted per segment using uniform temporal sampling. These 16-frame blocks are pre-processed and then used as input for either the FT model or TL model to determine a violence coefficient for the current segment. The application VOLUME 11, 2023  was implemented on an Ubuntu Linux operating system using an AMD Ryzen Threadripper 1950X 16-core processor and a Nvidia GeForce GTX 1080 Ti GPU with the CUDA toolbox for running our trained PyTorch models.
Our results indicate that, when combined with block extraction and pre-processing, both the FT and TL models require an average of 0.06 seconds on average to infer a violence coefficient for each four second-video segment. The pre-processing was implemented on the CPU, consuming an average of 0.04 seconds. Therefore, the average time required to run the FT or TL model is 0.02 seconds. It should be noted that, the dense or fully connected layers of the models consume minimal computational resources in practice. As a result, even though the TL model has more parameters than the FT model, the average time required to run both models is similar.
To give a thorough understanding of the performance of our standalone system, we have graphically represented the progression of violence coefficients over time using the FT model that showed the most optimal results on the combined dataset. In figures 15 to 19, we illustrate our classification outcomes on selected video samples that exhibit the capability of our system in identifying violence. Each figure comprises of three sections: the top row displays the actual graph of violence coefficients over time, where the coefficients are set to one during the occurrence of violence.  The middle row illustrates the predicted violence coefficients by our FT model on a series of non-overlapping video segments with a duration of four seconds. The bottom row shows key frames extracted from the videos. Figures 15 to 18 demonstrate the performance of our standalone system on video clips from the testing set of the original UCF-Crime dataset. These video clips include different scenarios such as instances of violence amidst normal events (as illustrated in figure 15), multiple occurrences of violence (as illustrated in figure 16), a crowd engaging in violence at a metro station (complex and long video sequence as illustrated in figure 17), and a single, short instance of violence in the form of shooting (as illustrated in figure 18). The predicted violence coefficients align closely with the ground truth, indicating the algorithm's capability to accurately identify and predict instances of violence in video segments of various lengths and complexities.
To further evaluate our system, we also created a video sequence by combining random video clips from the Smart-City CCTV Violence Detection Dataset [106], which was not used in our study. As shown in figure 19, our results exhibit outstanding performance on this compiled sequence as well. This illustrates the adaptability of our algorithm and its capability to perform well on new and unseen data. Figures 15 to 19 also demonstrate the areas where our standalone system falls short, which require further improvement in future research. We have noticed certain situations where our tested FT model triggers false alarms in the standalone implementation. For instance, when a person suddenly starts running or crawling (as shown in the keyframes from the 28th second in figure 15), it is detected as violence, but with a lower level of violence coefficient. In the original UCF-crime dataset, activities such as crawling or sudden fleeing are not considered violence. Nevertheless, in real-world surveillance scenarios, such actions may appear suspicious and require more investigation.
In situations where there is occlusion and the individuals or objects engaged in violence are only partially visible, the model may have difficulty identifying the violence. This can be observed in the predicted violence coefficients between the 32nd and 36th second in figure 16, where a person is holding a gun in his hand which is partly visible and hidden by his body.
As previously noted, videos that include people in crowds situated closely together can lead to inaccuracies in our system. Figure 17 between seconds 76 and 80 illustrates this scenario, where the predicted violence coefficient suddenly falls to zero even though violence is happening during this time. As mentioned earlier, our training dataset has limited examples of crowds, and including more such examples in future work is suggested.

V. CONCLUSION AND FUTURE WORK
In this work, we addressed the problem of efficient violence detection for automated surveillance applications by adapting the computationally lightweight X3D-M deep learning architecture for learning and detecting violence patterns from videos. We proposed two architectures, FT and TL, for classifying video clips containing violence, which leverage action recognition features learned from the Kinetics-400 dataset.
In order to perform a detailed analysis and performance evaluation of the proposed approaches, we collected and extended seven different datasets in our study. In the past, several deep learning-based methods for violence detection have focused on datasets involving mostly fighting between two or more people for experiments. However, it is important to note that the spectrum of actions and visual patterns representing violence is far wider. For example, violence happening between a group of people in the form of a fight is visually very different from violence involving the use of objects such as a gun or violence involving explosions.
To also incorporate such cases, we annotated several videos from the UCF and XD-Violence datasets for our experiments.
Using our collected videos, the FT model optimizes the X3D-M parameters learned from the Kinetics-400 dataset, while the TL model extracts spatio-temporal features first, without modifying the X3D-M parameters (trained on the Kinetics-400 dataset), to train multiple fully connected layers. Our experiments with individual datasets show that both models performed well in terms of ACC and AUC scores on the collected datasets. However, the FT model performed better than most of the state-of-the-art methods on popular datasets with relatively fewer model parameters.
In the previous works on violence detection, cross-dataset evaluations have not been thoroughly studied. We argue that these evaluations are crucial for understanding the prominence of various datasets as well as developed deep learning models. In this work, we bridge this gap by providing comprehensive evaluations, including one-on-one cross-dataset validation and leave-one-out cross-validation. Our cross-dataset tests showed that the TL model generalizes better to unseen scenarios than the FT model. However, when tested on the combined dataset, the FT model achieved better performance, while the TL model produced a higher number of combined false positives and false negatives. Further tests on individual datasets show that models trained on the combined dataset did not perform well in several cases when compared to the performance of models trained on individual datasets. This highlights the inconsistencies in the publicly available datasets for violence detection. Additionally, results from comparisons with several methods in literature have shown limitations of both the developed methods and existing datasets.
We note that the existing public datasets for violence detection are inconsistent in terms of the video duration, FPS, the number of videos available for training and testing and the forms of violence depicted. Furthermore, these existing datasets are not particularly representative of surveillance applications. Our results indicate that, in the future, there is a great need for the development of diverse and meaningful large-scale datasets, including footage from realworld surveillance, to make these technologies practically feasible. In the future, we plan to take steps towards constructing such a large-scale dataset. Once such a datasets is available, we also plan to re-evaluate the models presented in this work for more general results.
We also presented a computationally light and functional standalone system architecture for implementing the proposed models in practical surveillance applications. In this architecture, we extracted and evaluated non-overlapping video segments having a duration of four seconds from the incoming video stream. This strategy may fail in cases where an event of violence begins at the end of a segment and ends before the end of the next segment. In the future, we also plan to develop smart strategies to handle such scenarios, such as reducing the size of video segments adaptively and/or using overlapped segments. The main focus in developing such strategies will be on achieving the best computational speed and accuracy trade-off.