VD-Net: An Edge Vision-Based Surveillance System for Violence Detection

The automation of surveillance systems, driven by the rapid development of computer vision technology, has significantly enhanced the analysis of surveillance videos, particularly in recognition of human activity, including behavior analysis and violence detection, thereby bolstering public and industrial security. Despite these advancements, detecting and analyzing violent actions remains challenging, especially for real-time surveillance systems with limited computing power. We propose an artificial intelligence-based framework called VD-Net (Violence Detection Network), enabled by Intelligent Internet-of-Things (IIoT) to detect violent behavior in public and private spaces. The model utilizes lightweight special task temporal convolutional network (ST-TCN) blocks and several bottleneck layers to focus on salient features in the input sequence. The learned features passed from the classifier to discriminate between violent and nonviolent actions. Additionally, our system is supposed to trigger an alert if violence is detected, which is then communicated to relevant departments. We checked the robustness of our system by surveillance and non-surveillance datasets and ensured a 1-4 % improvement in State-of-The-Art (SoTA) accuracy.


I. INTRODUCTION
Technologies exist to automatically detect and flag violent behavior in various digital media formats, such as images, videos, audio recordings, and text [1].The growing volume of digital content and the need to moderate user-generated content has increased interest in the technology [2].Violence detection technology aims to promptly identify and eliminate violent content from online platforms, protecting users from exposure to potentially harmful.Intelligent surveillance systems analyze video patterns to detect violence in heavily populated areas for public safety in smart cities [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin .
The Intelligent Internet of Things (IIoT) can effectively detect and prevent online violence to promote safety.According to a report [3], 48% of deaths in 2015 were attributed to interpersonal violence, and 25% of happened with a sharp instrument, such as knives and razors.Similarly, there were approximately more than a million violent incidents in 2019 at the United States, including fights, aggressive behavior, and mass shootings.Hence, surveillance cameras are now widely used for automatic monitoring systems in both public and private sectors, but detecting abnormal activities with high accuracy remains a challenge [4].However, replacing labor-intensive and tedious manual surveillance systems with automatic systems would further increase public and private security and safety levels.These technologies can help law enforcement agencies detect suspected actions and differentiate between normal and abnormal behaviors.
Researchers from different fields are looking and carefully studying surveillance videos for new methods to recognize different actions and activities [5].However, surveillance videos can be analyzed using activity recognition, summarization, and individual identification to identify specific activities [6].Hence, violence detection is still missing in commercial and industrial monitoring systems due to the use of handcrafted and traditional neural network features.Implementing violence detection techniques in surveillance for real-time application using IIoT has remained a challenge.
In this manner, IIoT-based networks linked to vision sensors could aid law enforcement agencies in preventing crime in smart cities by analyzing input for crucial info like individuals/suspicious things.In such a system, IoT devices share data upon discovering violent things to examine the things and address safety concerns.Implementing this arrangement could allow for automatic monitoring of activity captured by multiple location cameras for managing unusual events.Hence, various algorithms have been developed for violence detection in monitored and unmonitored environments using traditional and deep learning-based methods [7].
However, these SoTA approaches face challenges such as high computational requirements, restricted viewpoints, varying lighting conditions, complex crowd scenarios, and changes in intensity.Furthermore, some limitations in this domain are false positives and negatives, limited detection scope, context dependency, and potential biases and unfairness, especially when models are trained on imbalanced datasets or datasets that contain stereotypes and prejudiced information.For these reasons, we develop a system that avoids perpetuating harmful stereotypes and biases and improves safety and security in public and private spaces.By providing real-time detection, enhanced situational awareness, improved accuracy, scalability, and integration with other systems, an IoT-based system can help prevent violent incidents and reduce the risk of harm to individuals and communities.The proposed VD-Net detects violence in public, private, and industrial settings by combining practical edge computing with cloud servers through three connected surveillance cameras.Considering the challenges and limitations mentioned above, this system will be precious.The main contributions are as follows: • Traditional monitoring systems often have wire failures during installation, resulting in slow response times and increased processing requirements for authorities.
To tackle this challenge, we propose an AI-driven framework for violence detection that leverages the powerful capabilities of Internet-of-Things (IoT) to connect devices for the smooth exchange of information.Moreover, we develop a cloud-based system that enables comprehensive investigations of violent incidents in public and private settings with fast processing.
• To process surveillance data, we need an intelligent edge-based mechanism to extract helpful information during analysis.To tackle this issue, we investigate the use of IoT and introduce a lightweight system that can be implemented on embedded devices.Our system recognizes critical violence to process and transmit over the network for detailed investigation in the cloud instead of all frames.This approach streamlines the process and improves its overall intelligence.
• Traditional IoT methods often use manual features or clustering algorithms, which may not capture longterm dependencies, reducing accuracy.We propose a bottleneck layer in VD-Net to encode spatial and temporal correlations and analyze local motion between frames to overcome this issue.Additionally, the cloud server acquires feature vectors and sends them to an attention unit for identifying salient cues.This is the first time bottleneck layers are utilized for violence, significantly improving accuracy with reduced latency for real-time applications.
• To the best of our knowledge, this article represents the first use of a bottlenecks layer to learn salient cues of violent activity in the IIoT network.The module extracts information from the input layer and determines whether a scene is violent or nonviolent in a public/private environment.We evaluated the proposed VD-Net using publicly available datasets demonstrating outperformed SOTA approaches.Additionally, the system can be used for indoor and outdoor surveillance in IIoT-based systems.
The article are organized as follows.Section II presents a comprehensive overview of the relevant literature pertaining to the topic under investigation.Section III elucidates the proposed system in meticulous detail.Section IV delineates the empirical findings and their corresponding analysis, supplemented by a comparative study.Ultimately, Section V encapsulates the conclusions derived from this research endeavor and posits potential avenues for future directions.

II. RELATED WORK
Overall, violence detection is a complex and challenging research area.Researchers are constantly exploring new techniques to advance the surveillance system for public safety.Ethical considerations surrounding privacy and potential biases in violence detection (VD) algorithms must also be considered in developing and deploying violence detection systems.In contrast, we cover the latest advances in this field, including conventional and deep learning-based techniques that have drawn much interest within the research community.

A. MACHINE LEARNING-BASED VD APPROACHES
This section provides an overview of key research conducted in machine learning for violence detection.One of the early works in this area [8] developed a machine learning-based approach to detect violent movie scenes.The authors used a set of visual and audio features to classify scenes as violent or nonviolent and achieved an 85 % accuracy on the movie fight dataset.Similarly, in [9], the authors presented a machine learning-based approach to detect violent events in surveillance videos using handcrafted features, such as motion and texture, to classify the violence.Furthermore, the authors in [10] introduced violence detection for social media and used a set of visual and audio features to classify the actions accurately.However, [11] and [12] endorsed a new machine learning-based approach to detect violent movie events.
Similarly, a conventional method was proposed in [13], utilizing motion cues derived from optical flow using RGB frames and incorporating appearance as low-level features.By eliminating redundant information, the system developed a bag of words (BoWs).Similarly, [14] developed a system to identify violence in crowded settings based on background motion correction, appearance, and longterm dependencies.In order to demonstrate how violent events are related to scene-scale spatial events, they used late fusion and BoW.Another approach [15] developed a new local descriptor to manage and reduce the coefficient reconstruction error to present a sparse-based model for classification.Furthermore, [16] incorporated pixel-based analysis and object trajectory results to monitor object speed, direction, and smaller movements.Hence, the practice of these methods grew tiresome due to hand-carried engineering.The subsequent section provides an overview of advance techniques.

B. DEEP LEARNING-BASED VD APPROACHES
Recently, deep learning techniques have become more popular for detecting violence.The early works in this area [17] presented a method based on deep learning to identify instances of violent behavior in surveillance videos.The authors used a two-stream convolution neural network (CNN) architecture to extract spatiotemporal features from the videos and achieved 89 % classification accuracy on the surveillance dataset.Similarly, in [18], the authors developed a deep learning-based model for violence detection in social media and extracted visual features and temporal dependencies by a long-short term memory (LSTM) and achieved 94.9 % results on the same dataset.Moreover, the authors in [19] and [20] introduced a deep learning approach for detecting violent events in urban surveillance videos.Furthermore, [21] and [22] presented a new method to detect abnormal events in movie datasets.
Computer vision challenges are being addressed with deep learning based on recent studies.However, there are also concerns that such technology is being used for violence.For instance, a method in [7] represents a frame in a sequence using critical information provided by Hough's feature.Liu et al. [23] utilized a 3D CNN to identify violent scenes in video-applied sampling as a pre-processing step.The researchers developed a deep learning-based model for detecting violent scenes utilizing transfer learning techniques, while [24] introduced a Spark framework for detecting violent scenes by bidirectional LSTM.Similarly, [25] introduced the idea of aggregating the ensembles, and [26] employed a combination of 3-D CNN and support vector machine (SVM) to identify violent actions in videos.However, a comprehensive literature analysis indicates that many existing methods must be revised to overcome several limitations and challenges.These include inadequate integration with state-of-the-art IoT devices, heavy reliance on end-to-end pre-trained models, failure to incorporate cloud-based concepts, and the use of handcrafted features.

C. ATTENTION-BASED VD APPROACHES
There has been a growing interest in using attention and transformer techniques for violence detection in various contexts, such as social media and surveillance videos.A recent work [27] presented a model based on the attention mechanism to detect violent movie scenes.The authors used a two-layered LSTM network with an attention mechanism to identify violent scenes in the movie fight dataset.Similarly, in [28], the authors introduced a transformer-based model for violence detection in social media and used a multiheaded self-attention mechanism to capture the temporal relationships between frames and a global temporal encoding layer to aggregate the features representations.
Another study [29] developed a transformer-based model for detecting violent events in online streaming and used a hierarchical attention mechanism to capture the spatial and temporal relationships in-game violent scenes on a gaming dataset.
Similarly, [30] introduced a model that uses a multiheaded self-attention mechanism to detect violent content in surveillance videos.Hence, [31] introduced a model based on the transformer architecture with a dual-branch structure to detect violent events.They processed the frame-level and shot-level features and achieved high accuracy on surveillance datasets.In conclusion, attention and transformer techniques have shown promising results for violence detection in various contexts, including social media, surveillance videos, and online gaming platforms.In this regard, the proposed model uses attention mechanisms to capture the temporal and spatial relationships between sequences and has achieved high accuracy on several datasets.

III. METHODOLOGY
This section thoroughly explains the proposed violence detection network by examining each module, as illustrated in Figure 1.The authors in [32] developed a violence detection system for industrial settings and covered indoor valences through an advanced IIOT system.This method has some limitations, such as covered indoor activities, computational complexity, and latency for real-time applications.We were motivated by their work and proposed an advanced It deals with data acquisition using vision sensors with limited resources.At the same time, the second stage involves a screening process to collect critical information, such as identifying people or suspicious activities on the scene.Suppose the subjects and actions are identified in the second phase as violent or suspicious.In this case, if there are any violent frames, an alert is generated, and they are sent to the next step for a thorough investigation before the final violence detection phase.
Furthermore, the input data set D is separated, S tr training data, S ts testing data, and S vl validation data, as shown as a pseudo-code in the algorithm 1.After validation, the trained model D mt is obtained.A sequence Se tr of frames from S tr is fed into ST-TCN to produce a feature map F C−m that is forward propagated into attention mechanisms for the final classification, as the output of the trained model D mt .Similarly, the algorithm also reveals the pseudo-code of our system, which operates in real-time for violence detection.The input frames are extracted from various sources, such as surveillance cameras and unmanned aerial vehicles (UAVs).They are initially processed as a sequence S e for significant violence F vio .The inputs are supplied into system, which conduct a detailed study of the final output to determine whether it is violent or not if the frames F vio have suspicious information and actions or activities.The proposed IIoT system transmits information and receives F vio concerning violence from essential humans F h .Algorithm. 1, represents the pseudo-code of the designed system.We propose the ST-TCN blocks as a feature extractor as an alternative of traditional RNNs, which use feedback loops to propagate information through time.But our TCN uses temporal convolutions to capture long-term dependencies in the input sequence as illustrated in Fig. 2(a).The input sequence is fed into a stack of convolution layers.In contrast, each layer applies a filter to a sequence of input values, with the other attributes determining the size of the receptive field.Therefore, by stacking multiple convolution layers with increased receptive field, TCN can capture dependencies over increasingly long periods.We designed three blocks of the ST-TCN in a hierarchical manner to learn features efficiently and make them parallelized for long sequence processing.Each block is connected hierarchically with input to capture deeper and hidden features from long sequences.
Each ST-TCN block consists of Input, a sequence of data points, and a tensor of shape (sequence length, input dimension), convolutional layers.The convolution filters have a fixed size and are convolved over the input sequence with a specified receptive field.After each layer, an activation function is applied to the layer's output with residual connections to improve the flow of gradient in training and alleviate the vanishing gradient problem.Our network often includes residual connections that bypass some of the convolutional layers.This allows the network to learn the input sequence's short-and long-term dependencies.Similarly, downsampling and upsampling are used to reduce the dimensionality of the output and extract the key features from the sequence.Through this, we capture longterm dependencies in the input sequence using a stack of layers, enhancing the system efficiency and scalability for real-time.

B. BOTTLENECK TRANSFORMER NETWORK (BTNET)
The bottleneck transformer architecture was introduced in [33] as a newer version of a standard transformer.The self-attention mechanism is applied to all the input tokens, which can be computationally expensive for large input sequences.The visual workflow of the modified BTNet and baseline is shown in Fig. 3.In a bottleneck transformer, a subset of input tokens is randomly selected and processed by a smaller number of attention layers before being combined with the remaining tokens and passed to the next layer.This reduces the number of attention layers needed, resulting in a more computationally efficient architecture.We used the same strategy, fine-tuned the traditional bottleneck transformer network (BTNet) for violence detection tasks, and trained it similarly to standard transformers using backpropagation and gradient descent optimization.We encoded the content position according to the height and width of the impute tensor and parallel connected with the original content to focus on salient cues.The proposed BTNet performed better than the standard while requiring fewer computational resources.
The learning strategy of the proposed BTNet with the input sequence is split into two parts: ''core'' and ''context'' sequence.The context sequence is processed separately by a set of attention layers.The core and context sequences are combined and fed to the forward layer.The combination is performed by simply concatenating the two sequences and passing them through a linear projection layer.The process can be repeated across multiple layers.While reducing the number of attention layers applied to the core sequence.Due to this strategy, the proposed BTNet achieves more computationally efficient results without losing performance.Additionally, we employed different forms of pattern dropout, which helps to regularize the network and prevent overfitting.This involves randomly dropping out entire patterns of attention weights during training rather than individual weights.The strategy encourages the network to learn more robust representations and improves its generalization ability for new data.

C. BOTTLENECK ATTENTION
The bottleneck attention mechanism [34] is a technique used in deep learning to improve the efficiency of attentionbased models.Attention mechanisms are used to selectively focus on essential parts of input data when dealing with long sequences.Bottleneck attention reduces input dimensionality, cutting costs while preserving crucial information.Our proposed bottleneck includes a sequential and spatial layer, with an attention mechanism operating on the reduced-dimensional representation using standard dotproduct attention.Our suggested module calculates the final feature map by: M s = BN (FC(AP(F))) = BN (w 1 (w 0 AP(F) ), ( The notation M s (F) represents spatial attention features, M c (F) indicates the temporal attention cues, and M (F) shows the final attention weights.The procedure outlined above demonstrates the direct correlation between gradients and attentions value.Higher attention values require higher gradient values, and vice versa.The symbol θ denotes the constraints employed in extraction of features.As a result, we use the attention process like channel-wise to observe, detect, and selectively concentrate on salient cues.Instead of choosing an object region distinct from channel attention, spatial attention removes clutter and chooses important spatial places via dense layers.Due to the use of this complementary information, combining these two attentional systems is essential for tasks involving violence.
Our implementation of attention maps has resulted in a highly effective approach, allowing us to concentrate on each branch's specific goals in the input tensor with precision and accuracy.Our attention module consists of an input sequence, a layer that reduces its dimensionality, such as a linear projection.This layer is often called the ''bottleneck layer'' because it acts as a bottleneck that reduces the amount of information passed to the attention mechanism.Once the input sequence is transformed into a lower-dimensional representation, it is fed into an attention mechanism that evaluates the importance of each element in the sequence by assigning it a weight proportional to its relative significance.The weighted sequence and the resulting attention weights are used to weight the input sequence, giving more weight to important elements and less weight to unimportant elements.This produces a weighted sequence that emphasizes the most relevant information.The resulting attention weights are used for the input sequence and then fed into the rest of the model.Fig. 2 (b), illustrates the visual representation of the proposed bottleneck attention module.

IV. EXPERIMENTAL SETUP AND RESULTS
This section presented the experimental evaluations, comparison, and discussion of the proposed VD-Net system.We also summarize the dataset information, hardware configuration, and qualitative and quantitative results.Our proposed VD-Net is evaluated via various metrics, including the receiver operating characteristic (ROC) curve, F-measurements, precision/recall, and confusion matrix.More detailed information on these evaluations is presented in subsequent sections.

A. DATASET
The rapid advancement of technology has led various sectors to actively engage in violence detection, aiming to address data challenges in surveillance for safety and security purposes.One key challenge is the need to limit surveillance data to specific indoor and outdoor activities.To overcome these limitations, there is a growing interest in leveraging technology to enhance surveillance capabilities by utilizing different datasets to generalize the model's capability.In this study, four datasets, including surveillance fight [35], violent flow [36], hockey, and movie fight [37], are employed to develop a violence detection model.Each dataset is divided into violent and nonviolent classes and split into training, validation, and testing sets following standard procedures.The training set comprises 70 % of the data, while the validation and testing sets account for 20 % and 10 %, respectively.
The surveillance fight dataset [35] includes videos of violence captured by surveillance cameras in various locations that are mostly used to develop and evaluate algorithms for detecting violent behavior in real-time.Likewise, a violent flow [36] includes videos shot in factories, offices, and other settings indoors, outdoors, during the day, and at night.This dataset consists of unaltered surveillance videos.Moreover, the National Hockey League released the hockey and movie fight datasets [37].The initial movie dataset comprises 200 video clips featuring fight scenes from action movies, while the non-fight videos were sourced from public action datasets.The hockey dataset includes 1000 clips and 500 for each fight and non-fight.In contrast to the hockey dataset, where all sequences were recorded in the same format and size, the movie dataset used various resolutions and formats.Still, it was more homogenous in content and format.Fig. 4 visually represents a few samples from each dataset, while Table 1 provides statistical details of each dataset.

B. SYSTEM CONFIGURATION
The proposed architecture uses a Jetson device as an edge server to gather data streams from devices connected to an IoT network.For inference tasks, the Jetson AGX Orin 64GB module is specifically used on the edge device for screening purposes.It incorporates the advanced NVIDIA Orin system-on-chip (SoC), seamlessly integrating multiple ARM cores, cutting-edge GPU architecture, and dedicated AI accelerators.The AGX Orin has a generous 64GB of onboard memory and ample storage capacity for AI models and efficient data processing.Moreover, it offers extensive connectivity options, including Ethernet, PCIe, USB, and MIPI CSI interfaces, ensuring effortless integration with a wide range of sensors and peripherals.
The VD-Net is implemented using the widely-used deep learning framework TensorFlow (version 2+), and Adam is used as the optimizer.Similarly, the training platform uses GeForce 3080-Ti RTX GPUs.The model is trained with the early stopping technique, a defined 16 batch size, with 43802 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.other supporting functions that expedite the model training and performance.The data were split via the standard procedure, a 70:20:10 percent ratio for training, validation, and testing.

C. ABLATION STUDY
Initially, as an ablation study, we used four publicly available datasets of violent behaviors and evaluated the accuracy of the baseline model, called bottleneck transformers [29], and further evaluated the model by adding temporal convolution networks (TCNs) [38] (Baseline + TCNs).Consider these as a baseline and propose a lightweight VD-Net model using modified bottleneck transformers (Mbaseline) with spatial task TCN (MBaseline + ST-TCN).The quantitative experimental evaluation of the proposed VD-net and the baseline method is illustrated in Table 2.
The proposed technique achieved 98.50 % accuracy on the hockey fight, 97.00 % accuracy on the violent flow, 92.50 % accuracy on the surveillance camera fight, and 99.00 % accuracy on the movie fight dataset over the baseline model.Our method improved the violence recognition rate over the baseline model with reduced computational cost as mentioned in Tabel 6.While the accuracy of the baseline model using TCNs to add the ST-TCN blocks improved, the TCNs' performance.The experiments demonstrated that the modified ST-TCN and bottleneck transformers module significantly enhance the baseline's ability to extract spatiotemporal information.The bottleneck connection made the proposed model more sensitive to extracting the salient information of violent behaviors by cross-channel interaction and increased the precision of the model.Furthermore, we conducted an extensive ablation study on various frame sequences to choose an appropriate sequence to better recognize violent/nonviolent actions.The hockey fight dataset is used for sequence selection as illustrated in Table 3.

D. RESULTS AND EVALUATIONS
This section provides a comprehensive evaluation of the proposed VD-Net system based on the testing results obtained from each dataset.We trained and evaluated conventional bottleneck transformers and TCNs for each dataset to compare with the proposed VD-Net as shown in Tabel 2 to highlight the strengths and weaknesses of each method, enabling a better understanding of the proposed approach.
The modern industrialized world and smart cities have established surveillance systems that enable activity tracking on a layer above to combat and monitor violent things.Despite the fact that most current approaches use nonsurveillance datasets, automating this system presents quite a few challenges, as mentioned earlier.To support this technology, we primarily concentrate on the VD-based surveillance setup and assess our method's effectiveness using newly published datasets [35], [36] and compare with SoTA.The surveillance camera dataset endorsed serves as a baseline for security surveillance in industrial and commercial areas.Furthermore, we checked our system on non-surveillance dataset such as hockey and movie fights [37] for more generalization.
We thoroughly review SoTA capabilities and compare them to the proposed VD-Net.We test the proposed system using four datasets, and the results are presented in Table 4 with the confusion matrix depicted in Fig. 5.The proposed system achieves high precision in detecting violence in both indoor and outdoor datasets.For the convenience of readers and researchers, we produced the ROC (receiver operating characteristic) curves with accuracy in Fig. 6 to display the suggested model performance on surveillance and nonsurveillance datasets, respectively.It is common for VD techniques to define any progressing activity as violent in sports, where athletes collide or hit one another, e.g., in a hockey fight.As a result, one method of detecting aggression TABLE 4. The thorough assessment outcomes of the suggested VD-Net are presented based on precision, accuracy, F1-score, and recall utilizing public benchmarks.

TABLE 5.
We conducted a comparative analysis across the proposed and state-of-the-art (SoTA) methods using benchmark datasets in terms of accuracy (%) and their learning strategies.
is to watch how approach one another.However, there is a real risk viewers will mistake a player's hug during a winning celebration for a violent gesture.However, our suggested system encoding special and temporal information is employed to avoid these mistakes and differentiate violent frames.Overall our system is convenient and generalized for indoor and outdoor actions to easily predict/ differentiate violent and nonviolent actions by their testing outcomes.

E. DISCUSSION AND COMPARISON WITH SOTA
The detail discussion and analysis of the our system illustrated in this section.Our proposed model leverages a hierarchical integration of bottleneck layers and a specialized temporal convolutional network to achieve superior violence recognition/detection results.The comparison is presented and demonstrated at Table 5, highlighting the efficiency of our designed system.
The proposed AI-based method achieves a 97 % recognition rate on violent flow, outperforming most SoTA algorithms.The proposed method has just 0.351 % lower accuracy than ViT large [52] on the movie fight dataset but higher on a hockey fight, and violence flows as well as the computation complexity of our model is lower than ViT large [52], which is shown in the subsequent section.Our recognition rate on all four utilized datasets was marginally higher than the pre-trained ResNet50 [39], I3D [40], AR-Net [41], temporal shift module, and temporal adaptation encoder [42], [43].All these pre-trained weights are used and trained on violent datasets, and their results are reported in Table 5 for comparative analysis.
The proposed model produces the best accuracy as shown in Table 5.The authors claimed the lighting of a model in articles for violence detection by combining different modules.The authors used depth-wise separable convolutions and a bottleneck learning strategy to enhance the Vd-Net performance.Furthermore, our method hierarchically used ST-TCN blocks as a primary component of learning the spatiotemporal cues, as seen in the main framework.The accuracy does not significantly increase in some cases, but our model has fewer parameters and requires less computation, which can easily be implemented on edge devices.The computational complexity of the model is the main objective of this research to be installed on edge devices in an IoT environment, which is explained in the subsequent section.

1) COMPLEXITY ANALYSIS
We conducted some experiments about computations of the proposed VD-Net algorithms, which clearly illustrates the significant benefits of our approach as shown in Table 6.We compared the computational costs of the proposed VD-Net with baselines in terms of parameters, model size, and FLOPs (Floating-point Operations).For instance, we utilized a torch model summary to determine this information representing the model's computational complexity variables without requiring a manual calculation.Additionally, TensorBoard is used to record pertinent data, which includes the total time spent during training and testing for all epochs.Our suggested methodology significantly reduces the number of parameters utilized for model training compared to SoTA algorithms.In addition, the training time and model size have significantly decreased, creating ideal circumstances for implementing IoT-based edge devices.

V. CONCLUSION
Technology and surveillance infrastructure have advanced to improve security and protect assets to ensure public safety.Manual monitoring is tedious and time-consuming, especially when analyzing potential threats or violence.To overcome this challenge, several existing methods have attempted to address the issue by utilizing distinct algorithms, which process frames locally without utilizing IoT settings.However, these methods still have limitations that must be overcome to achieve optimal performance and efficiency in surveillance and security systems.
VD-Net offers an innovative and integrated approach to surveillance, providing more efficient and effective security for individuals and organizations by addressing the limitations manual methods and IoT integration.The VD-Net analyzes input frames, extracting crucial information such as humans and violent objects, and shares relevant data within the IIoT network to make a final decision and alert concerned parties in case of violent incidents.Through evaluation, our method achieved significant improvements in accuracy compared to existing methods in the literature.These results highlight the suitability of our approach for deployment in security systems over edge devices and present a significant step in enhancing security and safety in various settings.
Moreover, we plan to further improve the proposed VD-Net framework by exploring real-time data processing techniques and edge computing to reduce processing delays and enable faster detection of violent scenes.To achieve this, we may consider deploying our VD-Net model on more powerful edge devices or cloud servers with GPUs to enable more complex computations in real time.
To further validate the effectiveness and suitability of our approach for real-world deployment, we plan to test our method on more diverse datasets with a broader range of violent and nonviolent activities.This will allow us to assess the model's performance under different scenarios and conditions, identify potential limitations, and make necessary adjustments.Overall, our proposed approach has significant potential for enhancing security and safety in various settings, and we aim to continue improving and refining it to achieve optimal performance and efficiency.

FIGURE 1 .
FIGURE 1. Overview of the proposed violence detection system with four fundamental modules: integrated Network Operation, model Training Procedure, cloud Setup for Data Processing, and IoT-based Communication Devices.

FIGURE 2 .
FIGURE 2. Overview of the proposed ST-TCN (a) and bottleneck attention (b) modules.All blocks are hierarchically interconnected in TCN and parallel in Bottleneck attention.

FIGURE 4 .TABLE 1 .
FIGURE 4. Visually representation of the utilized datasets.(a) show the surveillance, (b) show the hockey, (c) show the movie, and (d) show the violent dataset.

TABLE 2 .TABLE 3 .
Ablation study of the designed violence detection network with baseline using Violence databases, where the B indicates the baseline and MB indicates the modified baseline module.Ablation study of the designed violence detection network using hockey dataset with different sequence lengths (SL).The B represents the baseline, and MB represents the modified version of the baseline model.

FIGURE 5 .
FIGURE 5.The confusion among each dataset's actual and predicted values to better understand the model's performance.

TABLE 6 .
Computational complexity of the proposed VD-Net with baselines.