Augmenting the Robustness and Efficiency of Violence Detection Systems for Surveillance and Non-Surveillance Scenarios

Violence detection holds immense significance in ensuring public safety, security, and law enforcement in various domains. With the increasing availability of video data from surveillance cameras and social media platforms, the need for accurate and efficient violence detection algorithms has become paramount. Automated violence detection systems can aid law enforcement agencies in identifying and responding to violent incidents promptly, thereby preventing potential threats and ensuring public protection. This research focuses on violence detection in large video databases, proposing two keyframe-based models named DeepkeyFrm and AreaDiffKey. The keyframes selection process is critical in violence detection systems, as it reduces computational complexity and enhances accuracy. EvoKeyNet and KFCRNet are the proposed classification models that leverage feature extraction from optimal keyframes. EvoKeyNet utilizes an evolutionary algorithm to select optimal feature attributes, while KFCRNet employs an ensemble of LSTM, Bi-LSTM, and GRU models with a voting scheme. Our key contributions include the development of efficient keyframes selection methods and classification models, addressing the challenge of violence detection in dynamic surveillance scenarios. The proposed models outperform existing methods in terms of accuracy and computational efficiency, with accuracy results as follows: 98.98% (Hockey Fight), 99.29% (Violent Flow), 99% (RLVS), 91% (UCF-Crime), and 91% (ShanghaiTech). The ANOVA and Tukey tests were performed to validate the statistical significance of the differences among all models. The proposed approaches, supported by the statistical tests, pave the way for more effective violence detection systems, holding immense promise for a safer and secure future. As violence detection technology continues to evolve, our research stands as a crucial stepping stone towards achieving improved public safety and security in the face of dynamic challenges.


I. INTRODUCTION
Increased criminal activity in the 21st century has resulted in more loss of life and property [1] compared to other humancentered issues.Intelligent surveillance systems are among the most important methods for detecting abnormal human The associate editor coordinating the review of this manuscript and approving it for publication was Li He .activities at early stages.These systems can have the ability to automatically detect and generate reports of anomalous human activities, which is vital in maintaining public safety indoors and outdoors.Over the past few decades, many distributed surveillance cameras have been installed in public spaces such as hospitals, prisons, airports, and public parks to guarantee national security [2].Manually analyzing the large amounts of surveillance footage produced by these cameras is complicated, tedious, error-prone, and costly.Therefore, using computer vision technology to detect abnormal events automatically is effective and efficient.However, it presents many challenges, the most notable of which are different lighting levels, the appearance of the person being photographed, and the distance of the viewpoint from the camera.Therefore, in today's age of technology, intelligent surveillance technology is needed.These technologies can detect abnormal events quickly and accurately and generate a report to alert the concerned authorities [3], [4].Detecting abnormal scenes in moving or static camera-based recorded surveillance video comprises target overlap, cluttered images, partial or complete occlusion, fixed-pattern noise, low video pixel density, and change management under poor solidity and lighting conditions.Vision sensors that cover many entities, including desired targets acting abnormally and person without disability, make it more difficult to detect anomalies using visual inputs.For example, a runner is considered normal in one scenario (a soccer field) but ambiguous in another (a shopping mall).This makes it difficult to explicitly collect the exceptional samples and define exceptions for the AI model, as exception events are unlimited, rarely occur, and are not well defined.Normal events in surveillance videos are easier to manage than collecting data on abnormal events.Detecting data anomalies involves looking for data points that do not conform to the typical patterns.Pervasive anomalies can be detected by anomalies [5], [6], [7].The detection of anomalies reveals a wide range of anomalies.Many traditional image descriptor-based approaches [8], [9] have been developed for violent scene detection.However, due to the non-invariant nature of these feature descriptors, such as illumination, translation, rotation, and scale, such systems fail to perform better in challenging environments.Nevertheless, today's systems tend to have tendencies in particular areas, resulting in a limited ability to classify various surveillance anomalies.Anomaly detection, such as the detection of violence in images and videos, is a hot topic of research in AI and is of interest to people with diverse research backgrounds.The AI model capable of learning can be divided into three categories: supervised learning, which requires labeled features, and semi-supervised learning, which requires Partially Labeled features.Clustering, frame reconstruction, and techniques based on future predictions are used for unsupervised anomaly detection [5], [10], [11] to locate anomalies when labels are not present in the training data [5], [10].These methods do not work well against complex real-world surveillance video data.In this paper, we present a supervised anomaly detection technique as a way to overcome these limitations.
The technique's training set contains both normal and abnormal data to identify both types of anomalies.Weak supervision techniques, in particular, provide video-level labels for normal and abnormal events only in the training sets, trying to solve the anomaly detection problems relatively better.This is so because weak regulation requires less information.The researchers demonstrated in paper [3] that an unsupervised classification is a viable approach to handling an anomaly detection system using weak learners, which can be found here.In multiple instances learning system, a collection of video frames is often called a package, and the individual frame features are considered instances.The next step in the system is to learn the instance-level exception labels based on the bag-level annotations.The concept of anomalies as events that do not conform to the normal behavior of predictions [12] has been the basis for the successful application of semi-supervised techniques, which have been proven to be successful.In running data, events that do not follow a typical pattern are called exceptions.
Many previous studies have made semi-supervised anomaly detection the main research topic.The primary goal of such models is the development of a system or representation which can capture the object's normal motion patterns and visual appearance [13], e.g., the authors of articles [8], [14], [15] used the trajectory of movable objects to represent the hidden patterns of a certain object of interest.Patterns that deviate significantly from the specification are outliers, also known as outliers.Because they only consider visual patterns and ignore the importance of the objective, trajectory-based techniques perform poorly when encountering challenging environments, such as those involving crowded environments.This is because these techniques focus solely on visual patterns.Dictionary learning and sparse encoding are two other video advertising methods that have gained notoriety in recent years [16], [17].In the proposed schemes, normal scenes are usually encoded into a vocabulary called the input features and treated as exception events.Upon the completion of model training, the model is evaluated with the test data; the events are classified as normal or abnormal events with a minimum false classification rate.The most notable disadvantage of such methods is time complexity for optimally calculating the sparse coefficient with other factors such as weather, lights, etc.
In this research, two keyframe-based models have been proposed for the task of violence detection in large video databases.These models are DeepkeyFrm and AreaDiffKey.The proposed models aim to address the challenge of selecting keyframes, which is a critical step in violence detection systems.The keyframes selection process is necessary to reduce the computational complexity of the system and improve the accuracy of violence detection.The proposed models have been designed to tackle this challenge in different ways.EvoKeyNet is a proposed deep learning-based model that uses an evolutionary algorithm to select optimal features attributes of the keyframes learnable features.The algorithm is based on a fitness function that considers both the quality and diversity of the CNN (Convolutional Neural Network) learnable features attribute.The selected optimal features are then fed into the deep neural network to perform violence classification.
KFCRNet is another proposed deep learning-based model that uses a multimodel CNN to extract features from the keyframes, the features are passed from the evolutionary 123296 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
search method, the features are then feed into an ensemble of three (Recurrent Neural Networks) RNNs i.e., Long Short Term Memory (LSTM), Bi-Directional Long Short Term Memory (Bi-LSTM) and Gated Neural Network (GRU) which are individually trained on the optimal features which is followed by majority voting scheme to finally classify a video frame into corresponding class.The proposed models have been extensively evaluated on a large video database and the results show that they outperform existing violence detection methods in terms of accuracy and computational efficiency.The proposed models have the potential to make a significant impact in the field of violence detection and can be used to develop real-world systems for detecting violence in videos.
One of the primary advantages of our proposed study lies in the selective use of video data for model training.Through AI-driven and image processing approaches, we efficiently identify optimal data for model selection, avoiding the need to incorporate the entire video dataset.Additionally, our models feature extraction process is performed using a CNN feature learning block that generates a 1 × 1000 feature vector, wherein not all features are equally important.To resolve this issue, we employ an evolutionary search method to select the most optimal features.As a result, our model is trained on a minimal yet optimal set of features and keyframes, ensuring the protection of model from overfitting and also enhancing computational efficiency of the model.
In this research, we present significant contributions in the realm of violence detection through the development of two keyframes selection methods and two novel classification models.The core contributions of our work are as follows: • This study focuses on enhancing the efficiency of our approach by reducing model training time.To achieve this, we employ a two-fold strategy.Firstly, we minimize computational costs by reducing the number of frames to avoid redundancy and lower the computational burden.Secondly, we utilize an evolutionary search scheme to identify the most relevant and discriminative feature attributes for violence detection while eliminating irrelevant or redundant information.
• We have developed two methods for selecting keyframes from video namely DeepkeyFrm and AreaDiffKey, these methods are designed to choose the optimal frames avoiding duplication and redundancy.This ensures that we extract the silent frames, the minimized frames will not only speed up the training process but also avoid model overfitting.
• Our first approach is EvoKeyNet classification model that leverages feature extraction from silent frames using to state-of-the-art CNNs i.e., EfficientNet and Inception ResNet v2 networks.Evolutionary search method is applied on the fused features and the most relevant optimal features retributes are selected for efficient and accurate model training.
• The second violence detection approach is KFCRNet which involves the extraction and fusion of learnable features from the silent frames.Apply Evolutionary search method to select the optimal features attributes, the features vector with optimal features is passes to an ensemble RNN model that consist of LSTM, Bi-LSTM and GRU classification models followed by a voting scheme to avoid false prediction of violence in videos.

II. LITERATURE REVIEW
Currently, global, local spatiotemporal, and deep learning features are available for anomaly detection in the video [3].Global Spatio-temporal features examine only the changes that occur across the entire video image instead of focusing on individual objects in the immediate vicinity.This strategy yields promising results when applied to tasks involving the detection of anomalies.In addition, global spatiotemporal features, such as the optical flow histogram (HOF) and motion boundary histogram [18] are exceptionally descriptive concerning the extraction and encoding of scene motion features [19].In the proposed spatiotemporal point-of-interest method, HOF and HOG (directional gradient histogram) features are extracted from these points of VOLUME 11, 2023 123297 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
interest in order to differentiate between different action categories [18].The authors trained a descriptor by combining HOF and HOG to identify anomalous behavioural patterns within the video.SIFT (Scale Invariant Feature Transformation) and HOG have been extended to create the new features SIFT3D [20] and HOG3D [21], respectively.It has been proposed that signal-based descriptors such as the Violent Flow (ViF) descriptor [22] and the Weber Local Descriptor (WLD) [23] can detect violence in videos; however, their performance is limited in scenes with a large number of individuals [24].The authors propose two HOG and HOF variants.The local gradient histogram (LHOG) and the local optical flow histogram (LHOF) are names for these variants.These descriptors encode information regarding the appearance and motion, respectively, of the ROI level.
Using the standard BoW (bag of words) method, they trained a fixed-dimensional representation of the video clips and classified violent and non-violent events using a support vector machine.There are multiple patterns (intra-class variations) of violent and non-violent behaviour, making it highly unusual to use this method for anomaly detection tasks.However, this method outperforms brute force detection in a supervised environment by a significant margin.Some older methods [25], [26], [27] used hidden Markov models (HMM) to discover local patch patterns in cluttered scenes, generating multiple models to discover patch-level motion patterns.This is accomplished using the window method, which is considered a computationally intensive realtime process.As part of the BoEM-based approach that we have proposed, one set of models is trained using video fragments of normal behaviour (fragment-level modelling), and another set of models is trained using video fragments of abnormal behaviour to predict fragment-level (event) scores.Because we are not particularly concerned with cluttered scenes, the proposed strategy for training multiple models does not involve the use of local patches.
The proposed method, in contrast to training multiple models using local patches, focuses on training multiple models separately for normal and abnormal events.This is done to account for the within-class variation that is characteristic of both normal and abnormal events [28] suggests that a multi-view parameter-free framework (MPF) can facilitate the identification of consistent groups within a population.Using the L1 and L2 specifications, two new iterations of the algorithm for multi-view clustering have been developed.
This method circumvents the current limitations of coherent group detection by employing new structural context descriptors based on the attribute designs of individual targets in complex scenarios.By tracking the extraction of more refined, lower-level information from dynamically crowded scenes, this method demonstrates its superiority over the most advanced group detection methods currently available.This type of information includes direction and context-based similarity [29].A framework for online anomaly detection is proposed in this article.In the work presented here, the support area is managed in response to the dynamic changes in the scene.This method identifies features that enhance online performance by concentrating on features within a restricted processing area.This assists in narrowing the search for features that enhance online performance [30].Encoding motion dynamics necessitate using low-level characteristics, such as information about the optical flow.In the final step, bags of words and Gaussian mixed models are employed to identify anomalous events.Local Spatio-temporal features provide an abstraction of a scene's behaviour by directly analysing shape, trajectory, and size information at the object level.This allows the features to provide an abstraction of behaviour.Trajectory analysis as a technique for describing video anomalies is an effective method.To identify visual anomalies, the trajectory-based method compares the degree of similarity or distance between clusters that have been generated [31].Another study proposed a semi-supervised learning technique for detecting violent behaviour.This method simultaneously trains a singular dictionary and a linear classifier.Combining the reconstruction loss and representation constraints of expensive labelled and inexpensive unlabeled data defines the objective function of dictionary learning.This is done to increase the dictionary's discriminatory power [32].To circumvent the constraints imposed by k-recent classification, the authors propose using a group of prototype objects as representations to employ weighted combinations of different types of similarity.The authors of the research paper [33] propose a multi-dictionary-based method for hyperspectral anomaly detection to circumvent the current limitations of hyperspectral anomaly detection.These limitations include dealing with large spectral dimensions and obtaining spectral correlations with difficulty.
By training sparse representations based on multiple dictionaries and applying this training to different background scenarios for remote sensing, it is possible to acquire discriminative features for anomaly detection [23].The authors used a technique called maximum pooling in conjunction with sparse encoding to extract the distinguishing characteristics from the video.A new Motion Weber local descriptor was proposed as a possible solution for identifying abnormal motion in video sequences.Low-level appearancebased features and kinematic dynamics-defining components are added to train the Weber Local Descriptor (WLD).Consequently, identifying violent video content using manual feature extraction techniques is no longer problematic.In addition, these results demonstrated that the WLD descriptor accurately captured motion near the camera.In recent years, long-term, short-term memory, abbreviated as LSTM, has been used to solve various issues in speech recognition, natural language processing, and motion recognition.LSTM was created to solve the gradient disappearance or explosion problem, which had previously plagued the deep learning research community.In the past, the entire deep learning research community was plagued by this crucial issue.The research paper's authors [34] present multiple models of autoencoders in which local spatiotemporal and depth features were investigated.The models were investigated.
The first autoencoder acquires knowledge using conventional spatiotemporal features, while the second autoencoder acquires knowledge end-to-end via a convolutional feedforward architecture.In the research [35], CNNs are trained to use semantic information to detect suspicious video events.This information is extracted directly from the videos.In the interim, the authors of the paper [36] could achieve the same outcome by employing a network that has been pre-trained on the ILSVRC benchmark dataset [37] [38].The proposed sparse coding method employs RNN to optimize the parameters and enhance the ability to predict anomalous events.The research paper's authors [39] used generative adversarial networks to reconstruct appearance and motion representations to identify anomalous video events.This technique uses the optical flow map of normal frames to model the network.Eventually, deviations from the normal model are determined using measurements of local differences.A brute-force strategy for detecting video activity is proposed [40] as a bidirectional C-LSTM architecture that takes frame difference as its input.They encoded spatial information with the VGG16 architecture [41] of CNNs, and they derived and encoded temporal dynamics with bidirectional convolutional LSTM.The authors propose a new technique for bidirectional temporal coding and maximal feature-by-feature pooling as an alternative to the current violence detection models based on spatial-temporal coding.This strategy employs data enhancement because, as stated previously, depth models require a substantial quantity of data.We prioritised applying simpler models over representations based on deep learning in this work.These representations require voluminous training data to learn differentiated and compact event integrations.

III. METHODOLOGY
In this section of the article, the proposed keyframes selection approaches and the transfer learning approach are broken down in great detail.Since we are working with CNNs, the frames must be resized to a uniform CNN layer input size, which varies by network.Separating the video into its individual frames is the initial step of any video analysis.The next step is to divide the available data into three sets, which we will refer to as the training set, the validation set, and the test set; the 70% of the video data is utilized for the training set, while remaining 30% for model validation.The validation set will be used for parameter tuning during the CNN model training phase, while the test set will be used to evaluate the performance of the models.An unsupervised learning technique involving a CNN as a feature learning block and a distance function is applied to the video data.This technique examines two consecutive video frames based on a predefined threshold to distinguish between keyframes having distinctive content or redundant frames having replication in the video content.The newly developed features learning block efficiently extracts robust features from images.Only the foreground regions that differ between successive video frames will be detected using this proposed method.The difference between two simultaneous video frames is also calculated in the second keyframes selection, which is a novel method.The total number of pixels in the foreground region is used to determine the area, which is a region property of the image.A thresholding scheme will determine whether a frame should be classified as redundant or a keyframe.Developing a predictive model for violence recognition will require the application of cutting-edge CNN technology as the final step.The model's previous learning is overwritten when training a pre-trained CNN model.The model consists of features, weights, labels, and bias values, and its training involves calculating the weights for a user-supplied dataset.

A. DEEP KEYFRAMES (DEEPKEYFRM)
A keyframe is a frame that has almost distinct features which may not contain redundant video contents.The robust image features are first detected and extracted for keyframe selection using a descriptor.In the proposed scheme depicted in the figure 1, a novel features learning block is developed, which will compute the image features of all video frames in a video.Euclidean distance is a similarity measure technique used to find the distance between two data points, the distance such that if the distance between two data points is less, then we can say that the data point has similar characteristics.In contrast, greater distances show that the data point is less relevant and can be selected as a keyframe.A thresholding scheme is applied, which is tuned according to video nature; in case of video with too much crowd, fast-moving camera, or high fps video, the threshold value should be kept small to consider the minor change in the video frames.The threshold value is tuned to select the most optimal threshold value that will effectively select keyframes data.

B. FOREGROUND AREA BASED KEY FRAMES(AREADIFFKEY)
A novel method employing a hybrid strategy has been proposed as a means of identifying the key video frames within a video.The figure 2 show the overall working flow of foreground pixel selection based key frame selection method.Background subtraction is one of the most common and straightforward techniques for identifying foreground content in real-time video processing.The background subtraction method uses the first frame as the background frame (A), while the frame that immediately follows is the foreground frame (B).When two video frames are compared using the differencing method, the unique information from each frame is calculated and stored in a third frame (C).The objective of this research work is to create an unsupervised decision system that can classify video frames as either keyframes or redundant frames.To calculate the area of the resulting foreground video frame (C), one must first determine the area which is a regional geometrical feature.A thresholding scheme is applied to the region of the foreground image (C), and a condition is applied; if the foreground pixel is greater than the given threshold value, the frame is selected as a keyframe; otherwise, it is discarded from the dataset.

C. EVOKEY NET
The ''EvoKeyNet'' model is a state-of-the-art multi-modal approach for violence detection in video sequences.The figure 3 shows basic framework of proposed EvoKeyNet model for violence detection in video.It was developed with the goal of reducing the computational complexity of violence detection while maintaining high accuracy and efficiency.The model's methodology involves keyframe selection, feature extraction, feature selection, and classification.Keyframe selection is the first step in the process, where important frames that capture essential content and information are identified.This reduces computational complexity by processing only the most relevant frames.The ''EvoKeyNet'' model selects keyframes using a pre-trained CNN model to extract CNN features from video frames and the Euclidean distance measure to compare their similarity.The frames with the least similarity are chosen as keyframes, as they are more likely to capture unique and relevant information.
After keyframe selection, high-level semantic features are extracted from the keyframes using two state-of-the-art CNN models, EfficientNet and Inception-ResNet-v2.These extracted features then undergo a feature selection process using an evolutionary algorithm, which identifies the most important features among them.This helps to reduce the dimensionality of the features and improve the efficiency of the model.The final step is classification, where a CNN classification model is used to classify video sequences into violence and normal classes.The CNN model is trained using the best features selected from the feature selection process as inputs.It consists of three layers, including a fully connected layer, a softmax activation layer, and a classification output layer.The fully connected layer maps input features to a high-dimensional space, the softmax activation layer performs output normalization, and the classification output layer predicts the class label of the input video sequence.The ''EvoKeyNet'' model is evaluated on a benchmark dataset for violence detection in video sequences, where it achieved high accuracy and efficiency compared to other state-of-the-art models.This demonstrates the effectiveness of the keyframe selection, feature extraction, and classification techniques used in the ''EvoKeyNet'' model.

D. KFCRNET
The KFCRNet model is an architecture that combines CNN and RNN models to detect violence in video sequences.123300 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The CNN provides a robust method for feature extraction, while the RNN allows the model to understand the temporal information present in the video.The combination of the two models allows the KFCRNet to effectively handle the variability in video frames, leading to improved performance and accuracy.The KFCRNet model is a powerful and effective solution for violence detection that takes into account both the spatial and temporal information present in the video frames.

E. HYPERPERAMETER TUNING
Hyperparameter tuning is the process of optimizing the hyperparameters of a model to achieve the best performance on the task it was designed to perform.In the case of our violence detection model, the hyperparameters that we need to tune include the number of epochs, batch size, learning rate, and the number of nodes in the RNN layer.The number of epochs determines the number of times the a. Accuracy = EvaluateModel (Y, C) 6.Return Accuracy model will be trained on the entire dataset.In our search of optimizing the performance of the violence detection models, a thorough hyperparameter tuning was conducted.For EvoKeyNet, which consist of CNNs, we explored various configurations.These configurations incorporated various batch-sizes i.e., 16, 32, and 64 with various number of epochs ranges between 10 to 100.In addition, various optimization methods were evaluated including RmsProp, ADAM, and SGDM, for training both the models.Our base learning rate was set at 0.01 while the impact of various dropout ratios was also assessed and found dropout ration of 0.5 to be the best among 0.3, 0.4 and 0.5.Similarly, in the case of the KFCRNet model, the same hyperparameters configuration used for training were applied as those used for EvoKeyNet.This careful selection of hyperparameters allowed us to select the most optimal hyperparameters for both EvoKeyNet and KFCRNet, contributing to the improvement in model performance in term of efficiency and accuracy for violence detection models.

F. PERFORMANCE EVALUATION METRICS
Performance metrics are a part of every machine learning pipeline.They tell you if you're making progress, and put a number on it.All machine learning models, whether classification or regression need a metric to judge performance.Classification problems are one of the world's most widely researched areas.Use cases are present in almost all production and industrial environments.The proposed approach is a classification-based model in which the model has to assign a class label automatically to the video frames as violent or non-violent.Using the actual labels of video frames and model predicted labels are compared and a result a confusion matrix is created.The confusion matrix has four measures i.e., TP Rate, FP Rate, FN Rate and TN Rate, using these measure various performance analysis metrics are calculated such as accuracy, precision, recall and f1-score.To assess the effectiveness of the anomaly detection frameworks we employed two used evaluation metrics; AUC (Area Under the Curve) and ROC (Receiver Operating Characteristic) curve [3], [6], [26], [27].The ROC curve provides a representation that visually illustrates how well a classifier system performs.It compares the rate (TPR) to the false positive rate (FPR), at different threshold values.TPR measures how positive samples are identified while FPR measures the proportion of positive results.A higher TPR and lower FPR indicate a model.The AUC score summarizes the performance of the ROC curve by calculating how often the classifier ranks a case than another positive case.A higher AUC score signifies better model performance.Additionally, it's crucial to consider the alarm rate in real world surveillance scenarios where most videos are normal.For scenarios a robust model should have a false alarm rate.Apart, from comparing using ANOVA and p test statistical tests we also evaluated our proposed models using these metrics.
The confusion matrix in the figure 5 visually represents the performance of a violence detection system.It shows how accurately the system predicts instances of violence (Class A) and normal behavior (Class B).The matrix's diagonal elements (TP for violence and TN for normal) reflect correct predictions, while off-diagonal elements (FP and FN) reveal instances of misclassification, helping assess the system's precision and recall.We conducted these tests to evaluate the importance of the outcomes achieved by the suggested frameworks and identify the model, for anomaly detection.

G. SIMULATION ENVIRONMENT AND HARDWARE SPECIFICATION
The proposed Model is developed, trained and validated using the MATLAB 2021.MATLAB has lot of support for image and video processing, the rich toolboxes of machine learning and deep learning in MATLAB make it easy to develop new algorithm and efficiently validate them.Using Deep Network Designer app which is a GUI based application, MATLAB allows fast development of models and models parameters setting that assists the researcher to implement their idea in a more decent manner.All of the simulations are performed on HP Z-440 workstation with 12 cores of CPU and 48 GB DDR-4 Ram.The workstation is also power with an Nvidia RTX-2070 Super 8 GB DDR-5 GPU to speed up the training and validation process.

IV. EXPERIMENTAL RESULT
In the first phase of the proposed approach, two distinct schemes are used for selecting the keyframes from the video.The aim is to select the optimal frames that do not contain redundant information in the dataset, reducing the size to make the training/validation scheme efficient and avoiding overfitting issues.The table 1 consist of dataset name, total frames in each class and number of selected keyframes from each class.The proposed schemes are KFCRNet, and EvoKeyNet, which use two state-of-the-art CNN models, EfficientNet, and Inception-ResNet-v2, to extract features.An ensemble classification model based on bagging is then applied to make the final decision on whether a frame is violent or normal.KFCRNet is a combination of CNN features and ensemble RNNs (LSTM, Bi-LSTM, and GRU).The third approach, EvoKeyNet, uses an evolutionary algorithm for feature selection, where the classification is performed by a CNN classification block.The hockey fight, violent flow, real-life violence situation, Shanghai Tech, and UCF-crime datasets were used for validating the proposed models.The table 2 consists of details about the keyframes selector, number of keyframes and cross-validation split of frames used for model training and performance evaluation.In the evaluation phase, the performance comparison between the proposed models, KFCRNet and EvoKeyNet, was performed using the Receiver Operating Characteristic (ROC) curve and statistical analysis including Analysis of Variance (ANOVA) and P-test.These evaluations were conducted to thoroughly assess the effectiveness of the models and provide insights into their relative performance.The results of these evaluations were used to determine the optimal model for violent frame classification.

A. TRAINING DETAIL
In this research, we employ two classification models trained on keyframes and optimal feature attributes extracted from learnable features.The first model is a CNN classifier, while the second model is an ensemble of RNNs.In total, we train four models, with both the CNN and RNN being trained on two keyframes sub-datasets: Long DeepKeyFrm and AreaDiffKey.The hyperparameters utilized for training the classification models can be found in Table 3.This table presents a comprehensive overview of the specific hyperparameters employed during the training process, ensuring consistency and facilitating a formal presentation of our approach.

B. NON-SURVEILLANCE DATASET
The proposed frameworks for violence detection was evaluated on three non-surveillance benchmark datasets: Hockey Fight dataset [40], Violent Flow dataset [41] and Real Life Violence Situation (RLVS) [42].The Hockey Fight dataset is a collection of 1000 video clips of various violent events that occurred during hockey matches in the national hockey league, labeled as fight or no fight, and well-balanced with 500 clips in each class.The Violent Flow dataset is a more challenging dataset, containing 246 video clips of both violent and non-violent scenes, divided equally into both classes, the length of the videos varies from 25-200 frames, and resolution of the video clips are 320 × 240 pixels.The videos in these datasets were captured using both static cameras and mobile phones.The results of this evaluation allow us to assess the performance and effectiveness of the proposed framework in detecting violence in non-surveillance settings.
The RLVS dataset consists of 2000 videos, with 1000 videos labeled as violent and 1000 videos labeled as non-violent.The videos were collected from various categories and include a high variety of gender, race, and age.This makes the dataset more diverse and realistic, which is important for developing and evaluating violence detection models that can perform well on real-world scenarios.
The table 4 provides a comparative summary of experimental results conducted on the UCF-Crime Dataset.It highlights the variances in training times and accuracy percentages across different model configurations.This information contributes to a comprehensive understanding of the models performance under different settings on the UCF-Crime Dataset.

1) RESULT ON NON-SURVEILLANCE DATASETS
The results of the two proposed models with two proposed keyframes section approaches for violence detection in the hockey fight, Violent Flow and RLVS datasets are provided in the table 5.The table contains the performance measures of each model, including True Positive (TP), False Positive (FP), False Negative (FN), True Negative (TN), Accuracy, Precision, Recall, F-Measure and Matthews Correlation Coefficient (MCC).All four models, EvoKeyNet+DeepKeyFrm, EvoKeyNet+AreaDiffKey, KFCRNet+DeepKeyFrm, and KFCRNet+DeepKeyFrm, were trained and validated using the benchmark non-surveillance dataset.The datasets consists of two classes, violence and non-violence, and the models were evaluated based on their ability to accurately classify the video clips.The results show that all four models have achieved high accuracy for violence detection, with values ranging from 98% to 99%.This indicates that the models have a high level of performance in accurately identifying violence in the video clips.Additionally, the MCC values are all above 90, indicating that the models have a strong correlation between predicted and actual classifications.
The figure 6 offers a comprehensive and insightful analysis of the performance of the proposed deep learning architectures, EvoKeyNet and KFCRNet, along with two keyframe selection approaches, DeepKeyFrm and AreaDiffKey.The figure explores the mean differences observed between all possible pairwise combinations of these architectures and keyframe selection methods across three distinct datasets: hockeyfight, violent flow, and RLVS.The Tukey method, a robust statistical approach, is employed to perform these multiple comparisons, ensuring a rigorous evaluation of the proposed approaches.y-axis of the figure represents the mean difference, indicating the degree of variation in performance observed between different combinations of deep learning architectures and keyframe selection methods.On the other hand, the x-axis displays the mean comparison values, which provides an indication of the overall performance level across the different combinations.The four vertical bars for each dataset represent the mean differences for the four combinations, namely EvoKeyNet with DeepKeyFrm, EvoKeyNet with AreaDiffKey, KFCRNet with DeepKeyFrm, and KFCRNet with AreaDiffKey.
The table 6 showcases the results of the model comparison using ANOVA and Tukey multiple comparisons for the proposed deep learning architectures.The ANOVA analysis provides an overall assessment of the models' performances across the datasets.The non-significant results of the ANOVA and Tukey multiple comparisons suggest that, within the scope of this study and the chosen datasets, the proposed deep learning architectures (EvoKeyNet and KFCRNet) perform similarly when combined with either VOLUME 11, 2023 123305 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.DeepKeyFrm or AreaDiffKey as the keyframe selection approach.These findings indicate that the models' performance may not significantly differ based on the keyframe selection method utilized.
The table 7 illustrates a comparison of the proposed violence detection scheme using the AUC metric.Based on the AUC scores achieved by the model for violence detection, the analysis demonstrates the model's robustness in identifying violent scenes in videos.Moreover, the inclusion of keyframes and evolutionary search-based techniques significantly enhances the model's detection performance.
The figure 7 displays three subplots, each representing bar chart for different model combinations trained over distinct datasets: Hockey Fight and Violent Flow and RLVS.In figure 7(a), the bar chart represent the performance of four model combinations: EvoKeyNet+DeepKeyFrm, EvoKeyNet+AreaDiffKey, KFCRNet+DeepKeyFrm, and KFCRNet+AreaDiffKey, all trained on the Hockey Fight dataset.The high AUC values of approximately 0.99 for all model combinations in this subplot suggest that they exhibit excellent discrimination ability in distinguishing between positive and negative instances in the Hockey Fight dataset.
The bar chart in the figure 7(b) represent the performance of the same four model combinations, EvoKeyNet+ DeepKeyFrm, EvoKeyNet+AreaDiffKey, KFCRNet+ DeepKeyFrm, and KFCRNet+AreaDiffKey, but trained on the Violent Flow dataset.The high AUC values of approximately 0.99 for all models indicate that they maintain their excellent discriminatory power on this dataset as well.In the figure 7(c), the bar chart curves represent the performance of the same four models trained on the RLVS dataset.The AUC values remain high, approximately 0.99, reaffirming the models' consistent discriminatory power on the RLVS dataset.

C. SURVEILLANCE DATASET
The UCF-Crime Dataset is a comprehensive and widely-used video dataset designed for the evaluation and benchmarking of video-based action recognition and anomaly detection algorithms.The UCF-Crime Dataset consists of 13 distinct classes representing various criminal activities and normal behaviors.Each class encapsulates specific actions that are frequently encountered in real-world crime scenarios.The dataset includes a wide range of criminal activities, such as burglary, assault, robbery, vandalism, and gunfire, among others.It also covers non-criminal activities, serving as a balanced reference for normal behavior recognition.The ShanghaiTech Dataset is a widely-used benchmark dataset for crowd counting and crowd density estimation tasks in computer vision.The dataset is particularly valuable for researchers and practitioners working in the fields of computer vision, machine learning, and crowd analysis.The dataset images are collected from various public spaces, such as streets, squares, and public events, in Shanghai, China.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Table 10 presents the performance of the proposed EVOKEYNET and KFCRNET for violence detection using the UCF-CRIME and ShanghaiTech datasets.Further details regarding the AUC score of the proposed models can be found in Figure 9.The figure 9  In this figure 9(b), the bar chart represent the performance of the same four model combinations, but trained on the ShanghaiTech dataset.The high AUC values of approximately 0.99 for EvoKeyNet+DeepKeyFrm, EvoKeyNet+AreaDiffKey, and KFCRNet+DeepKeyFrm illustrate their exceptional ability to distinguish between positive and negative instances in the ShanghaiTech dataset.However, KFCRNet+AreaDiffKey achieves a comparatively lower AUC value of 0.88, indicating relatively weaker discriminatory power on this dataset.The high AUC values of the other model combinations highlight their robustness in handling the complexities and variations present in the ShanghaiTech dataset.

D. QUANTITATIVE RESULT 1) COMPARISON WITH STATE-OF-THE-ART TECHNIQUES
As shown in Tables 8 and 9, exhaustive experiments and comparisons with state-of-the-art (SOTA) methods were conducted to evaluate the performance of our proposed violence detection frameworks rigorously.The results of these comparisons indicate that our proposed framework outperforms recently developed techniques for challenging anomaly datasets.In addition, we evaluated the effectiveness of the framework on the ShanghaiTech dataset.We compared it to various existing approaches, such as normal frame prediction based on unsupervised anomaly detection techniques [5], [16], feature patterns [38], [42], and skeletal patterns [24], [43].It is essential to note that the unsupervised techniques need access to abnormal videos during training, resulting in inferior performance compared to supervised techniques.However, our proposed framework outperforms both supervised and unsupervised methods in our experiments.Figure 9 depicts the results, illustrating the prominent position of our framework's bar charts, which achieved the highest AUC score of 91%.This represents an increase in precision of 8.11 per cent compared to [3], [27], and [30].
In addition, we evaluated our proposed framework on the ShanghaiTech dataset and compared its performance to recent SOTA approaches [29], [39], [47].Our framework's performance was exceptional, with 93.1% accuracy.This corresponds to a significant increase in accuracy of 21.1%, 19.1%, and 17.2% relative to [16], [24], [27], [42], [43], [44], [45], and [46], as shown in Table 11.In addition, compared to the most recent methods [27], [30], [43], [44], our dual-stream CNN framework increased accuracy by 8.66% on the UCF-Crime dataset and 8.93% on the ShanghaiTech dataset.In addition, we evaluated the proposed dual-stream CNN framework on two non-surveillance datasets (Hockey Fight and Violent Flow).We compared it to several existing techniques, as shown in Table 12.The experimental results demonstrated the efficacy of our surveillance framework both indoors and outdoors.Our framework outperformed contemporary methods on the Hockey Fight dataset, obtaining a notable 1.2% increase in accuracy.Similarly, evaluating the Violent Flow dataset yielded promising results,   outperforming the top models among recent approaches with a 0.8% increase in precision.Our proposed framework's experimental results on large-scale anomaly detection datasets demonstrate its efficacy in outperforming existing techniques by substantial margins.

E. QUANTITATIVE ANALYSIS
In this section, we provide a comprehensive qualitative analysis of our proposed classification models, EvoKeyNet and KFCRNet, along with two keyframe selection approaches, DeepKeyFrm and AreaDiffKey.The evaluation is performed on both surveillance and non-surveillance datasets to assess the models' efficacy in crowd analysis and video-based action recognition tasks.EvoKeyNet and KFCRNet are novel deep learning architectures designed for crowd analysis and action recognition tasks.EvoKeyNet leverages evolutionary optimization techniques to enhance feature extraction and classification, while KFCRNet utilizes VOLUME 11, 2023 123309 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.keyframe-based feature fusion for improved spatial-temporal analysis.DeepKeyFrm and AreaDiffKey are two proposed keyframe selection approaches aimed at identifying informative and discriminative frames to optimize model performance.
The DeepKeyFrm approach effectively selects keyframes that focus on critical crowd movements, leading to improved model interpretability and robustness.KFCRNet exhibits strong crowd density estimation capabilities, enabling accurate crowd counting in high-density scenes.The AreaDiffKey approach enhances the spatial representation of keyframes, facilitating more precise localization and analysis of crowd regions.The combination of KFCRNet with AreaDiffKey results in enhanced crowd density estimation accuracy.
On the non-surveillance dataset, the proposed models demonstrate excellent performance in action recognition tasks.EvoKeyNet achieves state-of-the-art results in recognizing complex human actions, surpassing traditional deep learning architectures.Its ability to capture fine-grained spatial features and long-range temporal dependencies makes it well-suited for action recognition in dynamic scenarios.KFCRNet excels in action localization tasks, accurately identifying the temporal extent of actions within video sequences.The AreaDiffKey keyframe selection approach successfully identifies salient action frames, contributing to precise localization and efficient processing.
The qualitative analysis highlights several key advantages of the proposed models and keyframe selection 123310 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.approaches.EvoKeyNet and KFCRNet demonstrate versatility and robustness across surveillance and non-surveillance datasets, indicating their potential for real-world applications.The evolutionary optimization in EvoKeyNet and keyframe fusion in KFCRNet provide unique advantages in feature representation and spatial-temporal analysis.DeepKeyFrm and AreaDiffKey exhibit complementary strengths, with DeepKeyFrm offering interpretability and AreaDiffKey providing enhanced spatial localization.However, the keyframe selection approaches may require careful parameter tuning and may encounter challenges in handling extremely dynamic scenes.

V. CONCLUSION
The proposed keyframe selection approach and classification models offer significant advantages.Firstly, we focused on improving the efficiency of our approach by reducing model training time.This was achieved by adopting a two-fold strategy: minimizing computational costs through the selection of optimal keyframes and leveraging an evolutionary search technique to identify the most relevant and discriminative feature attributes for violence detection.These methods allowed us to extract the most informative features, leading to faster model training and improved accuracy.EvoKeyNet and KFCRNet are the two proposed classification models that efficiently extract essential features using two CNNs in parallel to ensure a more comprehensive understanding of video content.EvoKeyNet and KFCRNet represent a promising advancement in violence detection technology, making them vital tools for improving public safety and security in dynamic environments.The proposed framework is validated using on five diverse datasets, including Hockey Fight, Violent Flow, Real Life Violence Situations, UCF-Crime, and ShanghaiTech.This research makes important contributions to the field of violence detection, offering efficient keyframes selection methods and classification models that outperform existing methods in terms of accuracy and computational efficiency.To further augment this work in the future, we aim to expand the scope by including additional violent classes.Discovering updated and robust neural network architectures to improve the features extraction and violence detection performance.The proposed model can be integrated and deployed in real-time surveillance systems, and assessing its performance in large-scale deployments would enhance its effectiveness for practical applications.

FIGURE 3 .
FIGURE 3. Proposed EvoKeyNet model for violence detection in video.

FIGURE 5 .
FIGURE 5. Confusion matrix for violence detection system performance analysis.

FIGURE 6 .
FIGURE 6.Multiple comparisons using Tukey method for violence detection models trained on hockey fight violent flow and RLVS database.

TABLE 5 .
Detailed performance comparison of proposed violence detection models trained on hockey fight dataset.

FIGURE 7 .
FIGURE 7. Evaluation of classification model performance on hockey fight, violent flow, and RLVS datasets using AUC score.deep learning models for crowd analysis tasks on both the UCF-Crime and ShanghaiTech datasets.The models exhibit robust performances, with EvoKeyNet+DeepKeyFrm being a prominent performer across various evaluation metrics.The figure 8 displays multiple comparisons between proposed architectures and keyframe selection approaches on UCF-Crime and ShanghaiTech Datasets using Tukey's Method.The x-axis of the figure 8 represents the mean comparison values, indicating the average performance measures for each model combination.The y-axis displays the mean differences, which indicate the variations in performance metrics among the different model combinations.The four vertical bars represent EvoKeyNet+DeepKeyFrm, EvoKeyNet+AreaDiffKey, KFCRNet+DeepKeyFrm, and KFCRNet+AreaDiffKey for each dataset, allowing a direct comparison of their respective performances.The table 9 displays the mean differences, p-adj (adjusted p-values), lower and upper confidence intervals, and the ''Reject'' column indicating whether the null hypothesis is for pairwise comparisons of model combinations.
consists of two subplots, each displaying bar chart for different model combinations trained on two distinct datasets: UCF-Crime and ShanghaiTech.In this figure 9(a), the bar chart represent the performance of four proposed model combinations which are all trained on the UCF-Crime dataset.EvoKeyNet+DeepKeyFrm and EvoKeyNet+AreaDiffKey achieve AUC values of 0.91 and 0.90, respectively, indicating their strong discriminatory power.KFCRNet+DeepKeyFrm and KFCRNet+AreaDiffKey achieve slightly lower AUC values of 0.88 and 0.84, respectively, implying relatively less discriminatory ability on this dataset.

FIGURE 8 .
FIGURE 8. Multiple comparisons using Tukey method for violence detection models trained on UCF-crime and ShanghaiTech datasets.

TABLE 9 .
Tukey's multiple comparisons results for proposed deep learning architectures and keyframe selection approaches on UCF-crime and ShanghaiTech datasets.

FIGURE 9 .
FIGURE 9. Evaluation of classification model performance on UCF-Crime and ShanghaiTech dataset using the AUC score.

TABLE 11 .
Comparison of proposed violence detection performance with state-of-the-art model trained on the UCF-Crime and ShanghaiTech dataset.

TABLE 12 .
Comparison of proposed violence detection performance with state-of-the-art model trained on the hockey fight and violent flow and RLVS dataset.

TABLE 1 .
Summary of datasets used for model training and validation, original frames, and key frames.

TABLE 2 .
Summary of datasets, No. of key frames, and training-validation split.

TABLE 3 .
Hyperparameters of the CNN, RNN, and ensemble classification models for violence frame classification.

TABLE 4 .
Comparative analysis of model performance and efficiency on UCF-crime dataset.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 7 .
Proposed violence detection models performance comparison using the AUC metric.

TABLE 8 .
Performance evaluation of deep learning models on UCF-crime and ShanghaiTech datasets.