YOLO-Fall: A Novel Convolutional Neural Network Model for Fall Detection in Open Spaces

Currently, incidents of personal accidents occur frequently in industrial fields, and falling is one of the most common safety hazards. This makes fall detection a research area of significant importance. Timely responses to fall events can significantly reduce the harm caused. However, most of the available fall detection models often suffer from issues such as insufficient detection accuracy or high parameter and computational requirements, making them challenging to deploy on local devices. In response, this paper introduces an enhanced convolutional neural network model, YOLOv7-fall, aimed at promptly detecting fall incidents. Firstly, the paper proposes a novel attention module, SDI, based on the Coordinate Attention and Shuffle Attention. This module enhances the feature extraction capabilities for detecting targets. Secondly, the inclusion of GSConv and VoV-GSCSP modules in the model’s head section is in order to reduce model parameters and computational complexity, making it more suitable for deployment. Thirdly, by replacing the conventional $3\times 3$ convolution in the final ELAN(Efficient Layer Aggregation Networks) module of the Backbone with the DBB (Diverse Branch Block), the model captures features from different layers or types in the image, increasing the network’s diversity. Experimental results demonstrate that YOLO-fall improves mAP by 2.7% compared to YOLOv7-tiny while reducing model parameters by 3.5% and computational requirements by 5.4%. In comparison to existing detection algorithms under similar conditions, YOLO-fall achieves more precise and lightweight capabilities.


I. INTRODUCTION
The industrial production is fraught with uncertainty and unexpected incidents.From construction sites to production workshops, and then to various incidents in the industrial sector, we constantly face potential health risks [1].However, there is a seemingly ordinary but very common type of accident that is often underestimated or overlooked, and that is falling.
Falling occurs when an individual loses balance, leading to an abrupt contact between the body and the ground or another object.Although such incidents are common in The associate editor coordinating the review of this manuscript and approving it for publication was Bing Li .the context of industrial living, the consequences of a fall are far from negligible.Falls can result in fractures, soft tissue injuries, and severe head trauma, posing a significant threat to an individual's physical and mental well-being [2].Furthermore, from a perspective of personal safety, the failure to promptly detect and provide treatment after a person falls may even pose a risk of a simple illness escalating into a more severe condition.For instance, if fractures or internal bleeding are not promptly treated, they can trigger severe complications.From an industrial production safety standpoint, if equipment operators fall due to unforeseen circumstances and the equipment remains unattended, there is a potential hazard that could lead to significant safety incidents [3].In the past, traditional fall detection methods relied largely on sensor-based detection.Hayashida [4] proposed the use of infrared sensors, Kaudki [5] suggested RFID-based detection, and Wu [6] introduced methods involving multiple sensors.Although these approaches have shown promising results, they have not been widely adopted because of limitations related to the detection environment and the scope of application.
In recent years, deep learning has been widely applied to pose and human behavior recognition [7].Compared with traditional manual and sensor-based detection, deep learning offers a more accurate and reliable approach to handle issues such as occlusion and crowded scenes.For example, in situations where the flow of people is high, making it challenging to promptly detect falls from the scene, deep learning provides a solution.In 2015, Muzaffer Aslan [8] introduced a deep-based fall detection system, which used shape-based fall representation and a SVM (Support Vector Machine) classifier.However, it faced difficulties in dealing with multi-class problems and large-scale training samples.In 2018, Li, [9] proposed training using a CNN with two convolutional layers.Nonetheless, as the network becomes deeper, CNN with backpropagation to adjust internal parameters tends to adapt slowly to changes close to the input layer, and optimization via gradient descent can lead to local optima instead of global optima.In 2020, Chiang, [10] researched object detection and recognition using deep neural networks, applied it to fall detection, and achieved an accuracy of 63.33% using yolov3.Even though, if deployed on embedded devices for local detection, the parameters and detection speed of yolov3 are insufficient.Moreover, the accuracy achieved is not ideal for considerations of life safety.In addition to challenges in detection accuracy and speed, existing fall detection methods are often limited to specific environments and cannot be easily applied in open, complex, and diverse scenarios, making it difficult to find a fall detection method suitable for general use in common settings.
The fall detection method we propose offers unique advantages and high accuracy in open environments.Through the YOLO-fall detection model, we can conduct accurate and reliable fall detection in various settings, such as crowded places [11] and indoor environments [12].This approach takes into account the variations and complexities of different environments, providing strong support for early warning of fall events.
The contributions of this paper can be summarized as follows: The YOLO-fall algorithm structure is fundamentally similar to YOLOv7-tiny.Compared to YOLOv7-tiny, YOLO-fall adds the SDI (Shuffle Dimensionally Integrated) attention module to the ELAN in the Backbone and SPPCSPC in the head, enabling the model to focus on key areas or typical features, thus improving object detection accuracy.
YOLO-fall employs improvements based on the GSConv and VoV-GSCSP modules to reduce unnecessary parameters and computational complexity without compromising target accuracy.
YOLO-fall replaces the 3×3 convolution in the last ELAN module of the Backbone with the DBB module, further enhancing model accuracy.
The combined impact of these three improvements in the YOLO-fall model structure is depicted in Fig 1.
Specific innovations in our work include the following: a) We introduced the SDI attention module, which can be embedded in both the Backbone's ELAN and the Head's SPPCSPC modules.SDI works by partitioning input feature tensors into multiple sections and directing them to various attention modules through shuffling.This approach enhances feature detection and fusion across spatial, channel, width, and height dimensions.After processing, the feature tensors obtained from each channel are concatenated, improving the perception of objects.The subsequent sections of this paper are organized as follows: In Section II, we describe the improved modules and details used in YOLO-fall.Section III covers the dataset and experimental environment, while Section IV presents the experiments and analysis.The conclusion and future prospects are provided in Section V.

II. RELEVANT THEORETICAL KNOWLEDGE A. ATTENTIONAL MECHANISMS
In the development of deep learning over the past decade, attention mechanisms have played a pivotal role.Attention mechanisms first caught fire with the 2014 paper published by the Google Mind team [13].Subsequently, with the advancement of computer vision, attention mechanisms were introduced to emulate the way humans allocate attention during information processing.This allows models to selectively focus on different parts of the input data, thereby enhancing the model's performance and effectiveness.

1) SE
SE (Squeeze-and-Excitation) Attention [14] utilizes the methods of Squeeze, Excitation, and Scale to dynamically adjust the importance of different channels in the feature map, enhancing the network's focus on specific features.
Squeeze: This step involves compressing the 2D features (H*W) of each channel into a single real number using global average pooling.It aims to obtain channel-level global features, transforming the feature map from [h, w, c] to [1, 1, c] format.
Excitation: The operation establishes correlations between feature channels using two fully connected layers, generating weight values for each feature channel.It outputs weight values equal to the number of channels in the input feature map, transforming the feature map from [1, 1, c] to another [1, 1, c] format, with each channel corresponding to a weight value.
Scale: In this step, the normalized weights obtained from the ''Excitation'' operation are applied to the features of each channel, weighting the features channel by channel.This process involves multiplying the weight coefficients with the feature values channel-wise, resulting in a weighted feature 2) CBAM CBAM (Convolutional Block Attention Module) [15] consists of two sub-modules: CAM (Channel Attention Module) and SAM (Spatial Attention Module).CAM helps identify critical channels and emphasize their importance, while SAM aids in determining important spatial regions.These attention mechanisms collectively enable the model to selectively focus on channels and spatial locations, enhancing feature representation and improving the model's ability to understand and interpret images effectively.The structure of CBAM is shown in Fig. 3.
3) CA CA (Coordinate Attention) [16] applies pooling operations independently in both the height and width dimensions to a feature map with the shape of C × H × W. It uses pooling kernels (H, 1) and (1, W) to perform average pooling on the feature map, as shown in ( 1) and (2).
This process generates feature maps with shapes of C × H × 1 and C × 1 × W.
The generated feature maps undergo a transformation, and then a concatenation operation is performed.Equation (3) shows the process of transformation and concatenation.
By concatenating z h and z w and then applying a 1 × 1 convolution for dimension reduction and activation, we can obtain a feature map f ∈ R C/r×(H+W)×1 .By splitting f in the spatial dimension, we obtain f h ∈ R C/r×H×1 and f w ∈ R C/r×1×W .Next, increasing the dimension of each of them using 1 × 1 convolutions.Finally, applying a sigmoid activation function results in the final attention maps g h ∈ R C×H×1 and g w ∈ R C×1×W .In the end, the output of the Coordinate Attention Block can be expressed as (4): The CA structure is shown in Fig. 4.

4) SDI
Shuffle Attention groups the channel dimension into multiple sub-features and processes them in parallel.For each subfeature, SA (Shuffle Attention) [17] utilizes the Shuffle Unit to depict feature dependencies in both spatial and channel dimensions.Subsequently, all sub-features are aggregated, enabling information communication between different sub-features through the ''channel shuffle'' operator.However, we primarily focus on spatial and channel dimensions, overlooking the width and height dimensions.Therefore, we propose SDI (Shuffle Dimensionally Integrated) here, incorporating the advantages of all the aforementioned attention methods.SDI achieves higher detection results with fewer layers and lower data volume.The module's structure is depicted in Figure 5. Specifically, SDI consists of three sub-modules: Feature Grouping, Multi-branch Attention and Aggregation.
Feature Grouping: In the context of input feature maps, C, H, and W represent the number of channels, spatial height, and width, respectively.SDI divides the channel dimension into G groups through feature grouping, denoting as X = [x 1 , . . ., x G ], where x k ∈ R C/G×H×W , in which each sub-feature x k gradually captures a specific semantic response during the training process.Subsequently, an attention module is employed to generate the corresponding importance coefficient for each sub-feature.Specifically, at the outset of each attention unit, the input x k is bifurcated into three branches along the channel dimension, namely x k1 and Multi-branch Attention: The Multi-branch Attention module consists of spatial attention, channel attention, height attention, and width attention.As illustrated in Figure 5, we utilize three branches for attention mechanism.In the first branch, denoted as x k1 , we generate a channel attention map by exploring inter-channel relationships.Channel Attention is obtained by applying the SE method to x k1 and performing Global Average Pooling over the spatial dimensions HxW to obtain the result z, as shown in (5).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Equation (6) shows the final output of channel attention.
The second branch focuses on achieving horizontal and vertical attention by applying Weight-Adjustable Strip Pooling [18] on x k2 .The x k2 tensor is fed into parallel pathways consisting of horizontal and vertical pooling.Each pathway includes a respective horizontal or vertical pooling layer, followed by a 2D convolutional layer with a 3×3 kernel to extract additional contextual information.Consequently, we obtain x H k2 ∈ R C/2•G×H×1 and x w k2 ∈ R C/2•G×1×W .Next, the pooled results from the 2D convolution are replicated and expanded (horizontal pooling results are horizontally copied, and vertical pooling results are vertically copied) to match the original input dimensions in the current channel.Finally, the horizontally and vertically expanded pooling results are fused.Equation (7) shows the process of specific fusion.
Considering diverse target shapes and varying aspect ratios, we introduce adjustable weights to x ′ k2 to adapt accordingly.To address variations in object shapes across different datasets, we propose Weight-Adjustable Strip Pooling to adjust the weights of the length and width attention assignments to achieve the optimal detection state, as shown in (8).
The fused feature tensor x ′ k2 ∈ R C/2•G×H×W containing enriched global contextual information is processed using the Scale method, as shown in (9): Simultaneously, the final branch x k3 is dedicated to producing a spatial attention map by leveraging inter-spatial relationships among features.This multi-branch approach ensures that the model can effectively emphasize significant features across different dimensions.
Aggregation: Following this, all sub-features are combined.Inspired by the approaches in ShuffleNet v2 and SA, we incorporate a ''channel shuffle'' operation to effectively aggregate all sub-features.This operator promotes cross-group information exchange.The resulting output preserves the original size of the input feature map, X,  eliminating the need for additional convolutional adjustments to channel numbers.

B. DBB
The DBB [19] combines different branches using a single convolution.Compared to regular convolutions, DBB employs a multi-branch structure during training, with each branch responsible for extracting different types of features.It enhances the original K × K layer by using 1 × 1, 1 × 1 − K × K, 1 × 1 − AVG, and K × K in place of a regular K × K convolution for feature extraction.Different branches typically focus on extracting features of varying levels or types to enhance network diversity.After extracting features from multiple branches, these features are fused to obtain the final comprehensive feature representation.During inference, this complex DBB structure can be equivalently transformed into a single convolution layer for deployment.The DBB structure is illustrated in Figure 7 for reference.

C. GSCONV
GSConv [21] is composed of DWConv [20], group convolution and shuffle operations.This combination significantly reduces the model's computational complexity and parameter count.Compared to regular convolutions, GSConv offers relatively higher accuracy while speeding up model convergence and detection.While GSConv can replace regular convolutions at any point in the model, using it in the Backbone would make the network layers deeper, creating resistance in the data flow and significantly increasing inference time.Therefore, it is considered in the Head because, at this stage, feature maps have become slender and don't require transformation.GSConv is better at handling  feature maps, with a computational cost of approximately 60% to 70% of standard convolutions, fewer redundant and repetitive information, no need for compression, and improved attention module effectiveness.
Building upon GSConv, GSbottleneck is introduced, and Figure 9 illustrates the structure of the GSbottleneck module.
Lastly, a cross-stage local network module, VoV-GSCSP, is designed using an aggregation approach, as depicted in Figure 10.The VoV-GSCSP module reduces computational and network structural complexity while maintaining sufficient accuracy.

D. YOLOV7-TINY
YOLOv7 [22] is the next-generation YOLO object detection network released by the author of YOLOv4 in 2022.The YOLOv7 model is an anchor-based target detection algorithm that excels in achieving rapid detection while maintaining high accuracy.It encompasses seven versions: YOLOv7, YOLOv7-d6, YOLOv7-e6, YOLOv7-e6e, YOLOv7-tiny, YOLOv7x, and YOLOv-w6, tailored to varying application scenarios and computing resources.
YOLOv7-tiny, a derivation from YOLOv7, follows a cascade-based model scaling strategy, ensuring enhanced detection accuracy with fewer parameters and faster detection speed.This capability allows it to yield excellent results with minimal computational resources on various public datasets.

VOLUME 12, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The YOLOv7-tiny framework comprises three components: backbone and head.As shown in Figure 11.
This network features a novel backbone network and introduces the ELAN structure, which reduces model computational complexity while maintaining performance.It enhances the feature pyramid by introducing a novel feature pyramid fusion mechanism called SPPCSPC, which enables tighter integration of features from different hierarchical levels.The output of this model incorporates an IDetect detection header, similar to the YOLOR [23].However, it introduces an implicit representation strategy to refine prediction results based on fused feature values, enhancing model performance.

III. DATASET AND EXPERIMENTAL ENVIRONMENT
The dataset utilized in this study consists of custom data collected from images acquired through the network.
Additionally, during the experimentation, we observed a certain ambiguity in distinguishing between ''up'' and ''fall'' poses when assessing squat positions.Therefore, we introduced a ''bending'' class [24], [25].These three states: ''fall,'' ''bending,'' and ''up,'' constitute the dataset utilized for the objective of object detection.The dataset was annotated using the Make Sense labeling platform.It comprises a total of 4016 images, divided into training, validation, and testing sets following a 7:2:1 ratio.
Since most of the datasets on the web are labeled only for the fall state and the not-fall state, and lack the labeling for the intermediate state bending, we summarized the images from multiple datasets and then re-labeled them.We selected 3463 images from Multiple cameras fall dataset [26], Fall detection Dataset [27] and UR Fall Detection Dataset [28] for relabeling.Specifically, Fall detection Dataset utilizes data augmentation to build richer training data.To enhance  the multi-object detection capability for industrial production activities, we collected 553 photos from the internet depicting rescue and accident scenes for annotation.Some the dataset images are shown in Figure 12.Furthermore, mosaic method in the model was employed for the data augmentation before training.
All experiments were conducted in a Windows 10 environment using VSCode software with the following configurations: CUDA 11.7, Python 3.8.17,and PyTorch 2.0.1.The hardware environment and model training parameters are outlined in Tables 1 and 2, respectively.

IV. EXPERIMENT AND ANALYSIS A. EVALUATION METRICS
In this study, we used several commonly used evaluation metrics in object detection, including Precision, Recall, mAP (mean Average Precision), GFLOP and Parameters.Among these metrics, TP (True Positive) represents the cases where the classifier correctly predicts a positive sample, and the actual data is also a positive sample.TN (True Negative) signifies the cases where the prediction is a negative sample, and it indeed is a negative sample, representing the accurate identification of negative samples.FP (False Positive) and FN (False Negative) refer to the number of false alarms and missed detections, respectively.
Precision characterizes the accuracy of correctly predicted positive samples.Precision is calculated as the number of correctly predicted positive samples divided by all predicted positive samples.A higher Precision indicates fewer false alarms, while a lower Precision suggests more false alarms.Equation (10) shows the performance evaluation metrics for Precision.

P =
TP TP + FP (10) Recall represents the coverage of correctly predicted positive samples.It is calculated as the number of correctly predicted positive samples divided by the total number of actual positive samples, which is the sum of TP and FN (True Positives and False Negatives).A higher Recall indicates fewer missed detections, while a lower Recall suggests more missed detections.Equation (11) shows the performance evaluation metrics for Recall.
AP (Average Precision) is a metric used to evaluate the performance of a single class or object detection model.In object detection tasks, the model predicts the positions and categories of objects within an image.AP quantifies the model's performance by calculating the area under the curve of Precision (P) and Recall (R).Precision is the proportion of correctly predicted positive objects among those detected as positive, while Recall is the proportion of all truly positive objects that were correctly detected.A higher AP value indicates better model performance.
mAP (mean Average Precision) is a metric used to provide a comprehensive evaluation of the performance of a multi-class or multi-object detection model.In multi-class object detection tasks, each class has its own AP.mAP is the average of all class APs, measuring the average level of detection accuracy across multiple classes.Among these metrics, mAP@0.5 represents the average mAP at an IoU (Intersection over Union) threshold of 0.5, which is commonly used in object detection evaluations, as shown in (12).
N represents the total number of classes or object categories.AP i is the Average Precision for class.
GFLOP stands for ''Giga Floating Point Operations Per Second'' and is a commonly used unit to measure the computational complexity of deep learning models.It represents the number of billion floating-point operations (FLOPs) performed per second.GFLOP is used to gauge the computational demands of a model and is typically employed to assess the model's requirements for computational resources, speed, and efficiency.A higher GFLOP value implies that the model requires more computational resources, such as CPUs, GPUs, or TPUs, for training or inference.It may also indicate that more time is needed for the model to complete its computations.
Parameters are a crucial component of neural network models.These parameters are variables that the model learns and adjusts autonomously, and they can be categorized into two main types: weight parameters and bias parameters.
Weight Parameters: Weight parameters determine the strength of connections between layers in the network.They define how input data propagates and transforms within the network.During the training process, these parameters are continuously adjusted, allowing the model to gradually learn features and patterns in the data to optimize its performance.
Bias Parameters: Bias parameters adjust the input to each neuron's activation function and play a significant role in the network's fitting capability.They enable the model to adapt to more complex data distributions.
The adjustment of parameters directly affects the model's performance and effectiveness.By tuning parameters appropriately, the model can better fit the training data and improve its generalization to new, unseen data.It's essential to find the right balance in parameter tuning to achieve optimal model performance.
Training loss: Training loss is a metric used to assess the fit of a model to the training data.To be specific, it is calculated by summing the error for each example in the training set, visualized by plotting a training loss curve.
Validation loss: Validation loss serves as a metric to evaluate the performance of a model on a validation set.It is computed based on the total error for each example in the validation set and is measured after each epoch.Typically, visualizing the learning curve of validation loss is achieved by plotting the results, and adjustments to the model may be considered based on the observed trends.

1) IMPROVEMENTS TO THE YOLOV7-TINY BASED ON SDI
To pinpoint key areas, suppress irrelevant information, and enhance the model's sensitivity and accuracy in perceiving objects, especially in complex backgrounds or multi-object scenarios, six different types of attention mechanisms are embedded into the Bottleneck of the ELAN module in the Backbone of the YOLOv7-tiny network structure for experiments.A comprehensive analysis and comparison of the impacts of seven distinct attention mechanisms in YOLOv7-tiny is conducted, aiming to determine the most effective attention mechanism without altering the original meaning.We conducted an analysis with the results presented in Table 3, By comparing the effects of the attention mechanisms added in the ELAN and SPPCSPC modules, we can conclude that incorporating the SDI attention mechanism yields the highest mAP, reaching 94.5%.Compared to the original YOLOv7-tiny, the mAP improvement of 2.2%.This indicates that the SDI method performs the best in object detection tasks, outperforming the other four methods in terms of performance metrics.The mAP metric is particularly valuable as it better reflects the accuracy of the algorithm for detecting various classes of objects and holds higher practical significance in real-world applications.The SDI heatmap is shown in 13.

2) IMPROVEMENTS TO THE YOLOV7-TINY BASED ON DBB
The DBB module introduces a multi-branch structure during training to enhance detection performance at the cost of increased computational complexity.However, at the testing stage, it can be equivalently transformed into a single standard convolution, achieving an mAP improvement without increasing parameter count or computational load.Compared to YOLOv7-tiny, this integration resulted in a 1.6% increase in mAP.The performance using the DBB module is shown in Table 4.

3) IMPROVEMENTS TO THE YOLOV7-TINY BASED ON GSCONV
By modifying the structure of the model's head, replacing standard convolutions with the GSConv module, and replacing ELAN with VoV-GSCSP, this method successfully reduced the model's parameter count and computational complexity.It resulted in a decrease in GFLOPs and parameters.Compared to YOLOv7-tiny, the number of parameters and the computational effort of the model were reduced by 3.5% and 6.2% after the installation of the improved modules GSConv and VoV-GSCSP, respectively.This indicates that the improvements in the head, such as GSConv + VoV-GSCSP replacement, achieve lightweighting and reduce computational overhead.This observation reinforces our confidence in the model's performance and its potential for robust predictions on new data.

5) ABLATION EXPERIMENTS
The SDI attention mechanism is integrated into the last ELAN in the Backbone.In this final ELAN of the Backbone, we replaced the standard 3 × 3 convolution with the DBB module.Additionally, SDI attention is introduced into the SPPCSPC module.In the Head, parts of the standard convolutions are replaced with GSConv, and VoV-GSCSP replaces all ELANs in the Head.These modifications aim to increase the focus on critical features for detecting targets in various regions, while reducing computational complexity and parameter count.They alleviate the detection load and Combining the results in Table 5 and Figure 15, we can observe that our proposed model has improved the mAP in each class to varying degrees, with a significant improvement in the ''fall'' and ''bending'' categories.By introducing the SDI attention mechanism, YOLO-fall gained better capabilities to capture complex features.Furthermore, the incorporation of GSConv and VoV-GSCSP noticeably improved the trade-off between Parameters and GFLOPs, and we found that adding GSConv and VoV-GSCSP on top of SDI led to mAP improvements, while using GSConv and VoV-GSCSP modules alone caused a decrease in mAP.We believe this is due to the enhancement of the SDI attention's effectiveness and the model's perceptual capabilities.The addition of the DBB module, without increasing parameters and computational complexity, led to an mAP improvement.
These combined improvements in YOLO-fall resulted in a 3.5% reduction in Parameters and a 5.4% reduction  in GFLOPs compared original YOLOv7-tiny, while maintaining a high mAP of up to 95%.In complex scenarios, Model 8 was selected as best model, demonstrates strong capabilities in recognizing and detecting human falling incidents.

6) COMPARISON EXPERIMENTS
The YOLO-fall algorithm presented in this article is compared with extensively utilized algorithms in recent years, including YOLOv3, YOLOv4, YOLOv5, YOLOv7-tiny.The comparative outcomes are illustrated in Table 6.Compared to YOLOv3 [29], YOLO-fall shows a 4.7% improvement in mAP, with no significant change in precision, and a 7.4% increase in recall.Additionally, computational complexity and the number of parameters have been reduced by an order of magnitude.In comparison to YOLOv4 YOLO-fall demonstrates a 3.5% increase in mAP, with improvements in both accuracy recall.When compared to YOLOv5 [31], YOLO-fall achieves a 3.2% increase in mAP while reducing computational complexity by 63.2% and the number of parameters by 17.1%.An intuitive comparison between YOLO-fall and other mainstream models can be seen in Figure 16, showing that YOLO-fall outperforms other models in terms of precision, recall, mAP, as well as performance related to computational parameters such as GFLOPs and Parameters.
To further assess the model's generalization capability, we conducted validation using the CAUCAFall [32] dataset.The results are presented in Table 7.On this indoor dataset, our model demonstrates a precision of 96%, a recall of 97.7%, and an mAP of 99%, affirming its strong generalization performance.

C. ANALYSIS OF DETECTION RESULTS IMAGE
To validate the improved model's performance, images containing ''fall,'' ''up,'' and ''bending'' were selected for detection under the same hardware conditions.The detection results are presented with the left image displaying the detection using the original YOLOv7-tiny algorithm and the right image showing the detection using our improved YOLO-fall algorithm.A comparison reveals that the original YOLO-tiny algorithm yields lower confidence values in detection.In contrast, the proposed YOLO-fall algorithm, which integrates three key improvements, significantly enhances the model's feature extraction capabilities while reducing parameter count and computational load.This,

V. CONCLUSION AND PROSPECT
This work tackles the challenges of high parameter count, computational load, and low detection accuracy in human pose detection, specifically within industrial production monitoring.We propose a fall detection model called YOLO-fall, which is an improvement over YOLOv7-tiny.Empirical testing results indicate that the improved YOLO-fall model maintains a high level of detection accuracy while notably decreasing computational demands.Nevertheless, given the variability of fall detection environments, there remains a possibility of false negatives in complex scenarios such as smoke or fire incidents.
In the future, we will strive to enhance our model's capability to meet the requirements of fall detection projects in the following aspects: 1) Constructing a more suitable and dedicated mixedscenario dataset that better reflects real-world scenarios and target features.
2) Improving detection accuracy while maintaining the model's lightweight nature to enhance its detection capabilities.
3) In the future projects, an embedded system would be designed to deploy the YOLO-fall model with inductive sensors, exploring possibilities for practical applications in industrial safety.
b) We reduced the model's parameter count and computational complexity by introducing VoV-GSCSP and GSConv in the Head.c) For feature extraction, we proposed a new DBL module, which consists of DBB, a Batch Normalization layer, and a LeakyReLU activation function.d) We curated a multi-pose falling detection dataset from the network and compared YOLO-fall with YOLOv7-tiny, demonstrating the advantages of YOLO-fall.

FIGURE 5 .
FIGURE 5.The structure of SDI.

FIGURE 6 .
FIGURE 6. Specific structure of the second branch.

FIGURE 14 .
FIGURE 14.The model loss curves of YOLO-fall.

Figure 14
Figure 14 represents the visualization of the training loss and the validation loss.During the initial 100 epochs, we observed a rapid decline in both training loss and validation loss, indicating that the model is actively learning and adapting to the training data.The initial decrease reflects the model's capability to capture relevant patterns and features of the dataset.After 100 epochs' point, we noticed that the loss values gradually stabilize, indicating the model has converged.This stabilization suggests that the model has effectively learned from the training set and is not overfitting to the data.The convergence of the training loss and validation loss indicates that the model is wellfitted, and demonstrates the model's ability to generalize.This observation reinforces our confidence in the model's performance and its potential for robust predictions on new data.

FIGURE 15 .
FIGURE 15.The mAP changes for each class in the ablation experiments.

TABLE 3 .
Comparison of six attention mechanisms.
FIGURE 13.The heatmap of SDI mechanism.

TABLE 4 .
Performance comparison after DBB modules.

TABLE 5 .
Comparison of ablation experimental results.
FIGURE 16.The mAP changes for each class in the ablation experiments.

TABLE 6 .
Comparison with other object detection algorithms.

TABLE 7 .
YOLO-fall performance evaluation metrics on the CAUCAFall dataset.