Human Fall Detection Based on Re-Parameterization and Feature Enhancement

It is easy to fall when the stairs, subway stations, bus stations, and factories are crowded. Real-time detection of human falls is helpful for timely assistance. In this paper, we propose an efficient and real-time detection network ED-YOLO for human fall detection. Firstly, a re-parameterization backbone is proposed. The shallow convolution (conv) modules in the backbone are replaced by DBBConv and DBBC3 modules, which 1*3 and 3*1 convs are used to replace pooling in DBBConv, and DBBConv is used to replace normal conv in DBBC3. The deep conv module in the backbone are replaced by E-DBBConv and E-DBBC3 modules, which can improve the ability to extract detailed features. Then, a novel feature enhancement module (FEM) is proposed to enhance the features representation of the region of interest and the fusion of features. FEM is added to the feature pyramid network (FPN) to improve detection accuracy. Finally, the CIoU Loss is replaced by Gradient Smoothing-SIoU loss (GS-SIoU Loss), and gradient smoothing is introduced to improve the regression speed and accuracy of the prediction box. In order to further reduce the inference over-head of the model, the network proposed in this paper is pruned. The mAP of the proposed network achieves 96.25%, while the parameters of the model are only 6.34M, and the detection FPS reaches 31 in RTX2080ti. The proposed network and other mainstream lightweight networks are tested on the test set. The experimental results show that the performance of human fall detection in the proposed network is superior to other networks. Especially, the mAP is 2.42% higher than YOLOv5s, and the detection speed is 14.8% faster than it.


I. INTRODUCTION
Research on human fall detection using huge machines in an open field setting with high-crowded occasions has significant difficulties.In such scenarios, there are always overlaps between people and people or people and machines, which contributes to missed and false detections easily.Moreover, the targets are mostly small, and have less detailed textures and features, making it thorny to extract enough features in the feature extraction process to meet the needs of real-time high-precision detection.In public places, if a fall occurs without timely detection and treatment, dangerous accidents will happen and people's lives will be threatened.
The associate editor coordinating the review of this manuscript and approving it for publication was Kumaradevan Punithakumar .Thus, it is of great significance for scholars to study the detection of human fall in scenes with severe overlapping and small targets.
In recent years, deep learning based target detection models have been widely used in various aspects.Target detection models, such as SSD [1], R-CNN [2], Fast RCNN [3], Faster RCNN [4], and YOLO series [5], [6], [7], [8], are widely used in unmanned vehicles, automatic navigation, and pose detection.Target detection algorithms are divided into one-stage and two-stage algorithms based on the detection steps.Two-stage algorithms are typically represented by R-CNN, Fast RCNN, and Faster RCNN, which divides the detection into two steps with a clear division of labor, the classification and regression for target detection.Although it can improve some detection accuracy, the detection speed of the two-stage is much slower than that of the one-stage.The one-stage algorithms are direct target detection by regression, and their typical representatives are the YOLO series and SSD algorithm.Deep multi-scale residual network is a matrix classification network, and Haiyang et al. [9] proposes a multi-class fuzzy support matrix machine (MFSMM) by establishing nonparallel hyperplane objective function and integrating fuzzy attributes, which can reduce the influence of interference on classification results.However, it does not work well in the case of unbalanced samples.In addition, in order to improve the robustness of the classification model, and reduce the sensitivity of the model to outlier samples, Haiyang et al. [10] proposes twin robust matrix machine (TRMM).Google proposed a lightweight network MobileNet for mobile devices in 2017, and its biggest innovation is the proposed DWConv (Depthwise Separable Conv).There are three versions of MobileNetV1-MobileNetV3 [11], [12], [13].Alexey et al. proposed Vision Transformer (ViT), a visual model with only a Transformer structure [14], which is the pioneer of visual Transformer and achieves an effect comparable to the SOTA (State of the Art) model in CNN (Conval Neural Network).In 2021, Ze et al. propose Swin Transformer [15], a visual model based on window attention.It achieves a higher efficiency by limiting the computation of self-attention to a local window through shifted window operation.Ze et al. propose Swin Transformer V2 [16], which improves on Swin Transformer to make the model larger and more adaptable to different resolutions of images and different sizes of windows.Since the Transformer model is too large to achieve real-time detection, therefore, YOLOv5s [17] is used as our baseline after considering accuracy and speed.In this work, we propose a real-time network Efficient Diverse Branch Block-YOLO (ED-YOLO).
Our contributions are as follows: • A new Reparameterization backbone is proposed, which introduces the conv branches of multiple shape, and improves the ability to extract detailed features.
• An Efficient-Coordinate Attention module (E-CA) and a novel feature enhancement module (FEM) based on E-CA are proposed.They can be used in FPN to enhance features representation of the region of interest and the fusion of features.
• A Gradient Smoothing Scylla-IoU (SIoU) loss (GS-SIoU loss) is proposed in place of Complete IoU (CIoU) loss in this paper, the gradient smoothing is introduced to improve the regression speed and accuracy of the prediction box, which can automatically assign appropriate gradient values according to the size of the IoU.
• An efficient and real-time detection network ED-YOLO for human fall detection is proposed in this paper.It achieves fast speed and high-accuracy detection in complex scenes.
The rest of this paper is organized as follows: Details of the related work are discussed in Section II.The architecture and implementation details of our network are presented in Section III.Experimental results are given in Section IV and conclusions in Section V.
The human fall detection task in this paper is specific, which is one of the many detection tasks for pedestrian targets and pose estimation.Many studies have been done for the detection of pedestrian targets.Wang et al. [18] propose a high-quality feature generation pedestrian detection algorithm to improve the detection performance.They hold that humans can better predict the presence of a target by considering the mutual cues of all available instances in an image, and that the fusion of multimodal features can express this process.In order to utilize and fuse multimodal features, Xue et al. [19] propose a novel multimodal attention fusion MAF-YOLO for real-time pedestrian detection to improve the accuracy of nighttime detection.Cao et al. [20] present a novel multi-spectral pedestrian detector performing locality-guided cross-modal feature aggregation and pixel-level detection fusion.Sweta et al. [21] propose an improved lightweight MS-ML-SNYOLOv3 network that extracts hierarchical feature representations and adds a larger perceptual field in the extension to improve detection accuracy.Di et al. [22] propose a new pixel-by-pixel prediction way of an infrared pedestrian detection network to solve the problem of difficult infrared pedestrian detection.For pedestrian detection, the result of label assignment directly determines the performance of model detection, and the loss function is also the key to updating the parameters of a detection model.Zheng et al. [23] propose loss-aware label assignment (LLA) to improve the performance of pedestrian detectors in crowded scenes.Gao et al. [24] propose a network using key point monitoring and grouping feature fusion network to solve the local occlusion.In addition, a lot of scholars have done many studies on attitude estimation tasks.Dong et al. [25] propose a human pose estimation framework characterized by the joint usage of both global and local attention modules in an hourglass backbone network.The global attention module aims to reduce the negative impact of background, and the local attention module is designed to help refine each joint.Luo et al. [26] propose an efficient high-resolution network for human pose estimation, named FastNet.The model has higher accuracy than other popular lightweight models.Wang et al. [27] design a lightweight bottleneck block with a reparameterized structure, and their model achieves higher accuracy with the same computational cost.Yu et al. [28] propose a new method, called the scale-aware heat map generator, to solve the problem of multiscale pose estimation.Zhou et al. [29] perform mutual learning inside the compositional model for human pose estimation, which facilitates the interaction of information between limbs and joints and successfully invoked the overall information.Taufeeque et al. [30] propose an LSTM-based approach to human pose estimation, which supports multi-camera and multi-target tracking.By studying potential fall recognition processes, Zhang et al. [31] reveal the patterns of network learning and the reasons that seriously 133592 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
affect the performance of model detection.Zhang et al. [32] propose a new image-based fall detection method in bus compartment scene, this method combines object detection and pose estimation, and introduce the fall identification network, and achieves 90% accuracy in the bus scene.Fan et al. [33] propose an improved YOLOX algorithm to enhance the expression ability of features by introducing SIoU loss and recursive gated convolutional modules, and achieve 78.45% mAP in complex substation scenes.Danilenka et al. [34] propose a lightweight long short-term memory model for fall detection, capable of operating in an IoT environment with limited network bandwidth and hardware resources.The accuracy and F1-score of the model on the collected dataset are shown to exceed 0.95 and 0.9, respectively.Although existing detection algorithms have made some progress in solving the human fall detection problem, the detailed feature extraction problem for small targets and the multiple occlusion is still not well addressed.In addition, in public scenes, human fall detection algorithms should take both accuracy and real-time performance into account.

B. RELATED WORK IN ATTENTIONAL MECHANISMS
In recent years, attention mechanisms have been widely used in the field of computer vision.Two common attention mechanisms are channel and spatial attention.Hu et al. propose a channel attention mechanism Sequee-and-excitation (SE) [35], which emerges to solve the problem of different importance of different channels of the feature graph in the pooling process.In the traditional pooling process, the importance of each channel of the feature map is equivalent, but the importance of different channels is experimentally proven to be different.Woo [38], a mechanism that embeds location information into channel attention.It captures long-range dependencies along one spatial direction.The other spatial direction retains accurate location information to capture regions of interest on the feature map.Inspired by CA, a feature enhancement fusion module is designed in this paper.And the results obtained after enhancement fusion are used to capture the targets of interest with CA, being a way to model the long-range dependencies between multi-scale features.

C. RELATED WORK IN STRUCTURAL RE-PARAMETERIZATION
The key problem in the field of structure Re-parameterization is to explore a method that can improve the network performance without increasing the computational overhead.
And the essence of structure Re-parameterization is that the structure in training corresponds to one set of parameters, and the structure we want in reasoning corresponds to another set of parameters.As long as the parameters of the former can be equivalent converted to the latter, the structure of the former can be equivalent converted to the latter.Ding et al. propose a new convolutional structure Asymmetric Convolution Net (ACNet) [39], which will be convolutionally extended during training time.The training time is to replace each 3*3 conv in the existing network with a 1*3 conv, 3*1 conv and 3*3 conv for a total of three conv layers.Finally, the computational results of these three conv layers are fused to obtain the output of the conv layer.The process is named asymmetric conv because of the asymmetry between the two convs introduced.The schematic for asymmetric conv is shown in Figure 1.
In 2021, Ding et al. propose the Diverse Branch Block (DBB) [40], a general module component that can improve the performance of CNN and ''inference time consuming without loss'', which has a diverse branching structure designed similar to ACNet, and it adds average pooling to increase the expressiveness of the model, as shown in Figure 2.
In addition, Ding et al. [40] propose RepVGG with the of structural heavy parameter technology, without attention, without a variety of novel activation functions, and even without branch structures, using only 3*3 conv and ReLU to improve the reasoning speed and accuracy of the model.Later, Ding et al. [41] propose RepMLP, which combines MLP and conv, not only retaining the global modeling ability and positional prior properties of MLP, but also integrating the local prior properties of con.

III. PROPOSED NETWORK
The network proposed in this paper is mainly described in this section.The network structure of ED-YOLO is introduced in Section II-A, while the improvement of IoU Loss and model pruning are introduced in Sections II-B and II-C.

A. NETWORK STRUCTURE OF ED-YOLO
The overall network structure proposed in this paper consists of three parts: backbone, neck (FPN), and head, as shown in Figure 3.To solve the problem of difficult detection in occlusion scene and improve the detection accuracy, we borrow the idea of reparameterization and improve the DBB module by introducing 1*3 conv and 3*1 conv on top of the original reparameterization, so as to increase the reparameterization branch and improve the ability of the model to extract fall information.In the backbone part, the structure of darkNet53 is used as the feature extraction network.The Focus module is used in the first layer and the DBBConv (Figure 2), and DBBC3 modules are used in layers 2-5, which is composed by replacing the last conv block of Bottleneck in the C3 module with DBB-Conv.The E-DBBConv and E-DBBC3 modules are used in layers 6-9, and the SPPF module is used in the last layer.In darkNet53, the role of the Focus module in the first layer is  to slice the image before it enters the backbone.An ordinary conv module is used by the backbone with a kernel size of 3*3 and a step size of 2. Therefore, the structurally reparameterized DBB is used to replace this 3*3 conv.Since the last layer of the backbone is the pooling layer, a lot of semantic information would be lost due to too much pooling operation if the pooling operation continues in the deep network.Thus, the DBB is improved for the deep network.The improved DBBConv is called Efficient-DBBConv (E-DBBConv), while the improved DBBC3 is called Efficient-DBBC3 (E-DBBC3).In the neck part, the network in this paper adopts the feature fusion structure of PANet [42], which is a bi-directional fusion network with top-down and bottom-up structures.In addition, since the information of fall in heavily obscured scene is often not adequately expressed, in order to solve this problem, a Feature Enhance Module (FEM) is designed in this paper, which is composed of a Receptive Field Block(RFB) [43], an Efficient-Coordinate Attention (E-CA), and a convolutional module cascaded with a convolutional kernel size of 1*1.We improve the CA module and design a new feature enhancement module (FEM) and add it to the network layer with a smaller number of channels in the neck part to make the network pay more attention to the fall area.It is worth mentioning that the RFB module in FEM is composed of a set of inflated convs, which can be computed by taking points at intervals.In the case of occlusion, it can use this computation to obtain valuable pose information and ignore the occlusion, and combine with the feature extraction ability of the heavily parameterized backbone.Finally, the FEM is added behind a detection layer with feature map sizes of 40*40 and 80*80, if the FEM module is added before the 20*20 feature layer, more parameters will be added because of the numerous channels in the 20*20 feature layer.

1) E-DBBCONV AND E-DBBC3
The structure of Efficient-DBBConv is shown in Figure 4. shows the overall structure of the E-DBBC3 module, which consists of a tandem passed channel and a 1*1 conv channel after Concat operation plus a conv layer.In the main branch, there is a bottleneck module layercomposed of N bottleneck modules, and the number of N can be chosen as 3, 6, or 9.In the bottleneck module, there are also two fusion branches.One of the branches is a shortcut, and the other is a tandem branch of 1*1 conv and E-DBBConv.It is worth noting that the operation to achieve the fusion here is to Add rather than Concat.The former is a directaddition of values, while the latter is a splicing on the channel.The DBBC3 module replaces the E-DBBConv module in E-DBBC3 with DBB-Conv (Figure 2) module.
As shown as Figure 4 (b), the four parameters of the convolutional module and E-DBBConv module respectively represent input channels, output channels, kernel size and stride.Two parameters of bottleneck are input channels and output channels, and the parameter of concat indicates output channels.

2) FEATURE ENHANCE MODULE
In order to solve the problem of the information of fall in heavily obscured scene is often not adequately expressed, a feature enhancement module is designed in this paper, 133594 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.which is mainly composed of the RFB, Efficient-Coordinate Attention (E-CA), and the 1*1 conv.The structure is shown in Figure 5.The Concat module fuses the feature maps of the upper layer with the corresponding feature maps of the same size in the backbone, which is not used by the FPN on the channels.The RFB enhances the input fusion information, passes it into the 1*1 conv layer to squeeze the number of output channels and semantic information, and finally passes it into the E-CA module to increase the feature representation of the target of interest.
The RFB is mainly composed of dilated convs with different dilated rates, and the dilated rate indicates the number of non-convolutional pixel points between adjacent convolutional pixel points.Specifically, the convolutional field of an ordinary convolutional is 3*3 with its kernel size being 3 and the step being 1, while the field of a convolutional kernel size is 3*3 with the step being 1 and the dilated rate being 1, that is, 5*5.Under the condition that the convolutional kernel size and step size are the same, the dilated conv has a larger receptive field than the ordinary conv.The RFB in this paper is composed of 3*3 dilated conv with dilated rates of 1, 3, and 5, respectively parallel composition.If the dilated rate is too small, there is no large receptive field, on the contrary, if the dilated rate is too large, a lot of useful information will be lost, which will lead to the unavailability of information.
Coordinate Attention (CA) is a lightweight attention mechanism that embeds information on location into channels, a practice that greatly reduces the calculation amount while sensing location.Conv is used by the CA module to squeeze the number of channels after the pooling operation.And then conv is used again after the BN layer to revert to the original number of channels, reducing the number of parameters of the CA module in this way, with a compression factor of r.Although the number of parameters can be reduced with this approach, too much semantic information is inevitably lost in the process of the squeeze.Therefore, the CA module is improved by removing the two conv layers which are used to compress the number of channels, and an Efficient-Coordinate Attention (E-CA) module is proposed in the paper.The schematic comparison of the Coordinate Attention block and the Efficient-Coordinate Attention is shown in Figure 6.
For a given input X, for each channel encoded in the horizontal and vertical directions respectively, the output of the c-th channel at height h can be expressed as: In the Eq. 1, denotes horizontal of spatial extents of pooling kernels.
Similarly, the c-th channel of width w can be expressed as: In the Eq. 2, denotes vertical of spatial extents of pooling kernels.
The above output will be sliced by E-CA into two separate tensors along the spatial dimension: f h ∈ R C×H and f w ∈ R C×H .Then the BN layer and the h-swish activation function will be gone through, and the output values will finally be mapped to between 0 and 1 by using sigmoid.
In this process, h-swish activation function can be expressed as: The output Y of the whole E-CA can be expressed as: In the Eq. ( 4), g h and g w denote the attention weights of E-CA.
According to Eq. ( 1) -( 4), the feature representation of the key regions of fall is enhanced.

B. IMPROVEMENT OF IOU LOSS
Although GIoU [44], CIoU, and DIoU [45] have had a positive impact on the training time and the final results of the model, there is great uncertainty in the direction of regression, which cannot be quickly returned to the vicinity of the ground truth box.Zhora Gevorgyan proposes a new loss function, SIoU [46], in which the penalty metric is redefined considering the vector angle between the desired regressions.Simultaneously, the vector angle is introduced into the IoU loss for the first time, which reduces the degrees of freedom of the loss and allows the loss to converge faster, helping the prediction frame to approach the GT frame more quickly.
The SIoU loss function consists of four Cost functions, which are Angle, Distance, Shape and IoU cost.

1) ANGLE COST
The model first brings the predictions to the X and Y axes and then gradually approaches them along the relevant axes.When α ≤ π 4 , the convergence process will focus on α, minimizing α, otherwise minimizing β.To achieve this, the Loss function is defined by Eq. (4): Among the Eq. ( 5), 2) DISTANCE COST Distance cost is redefined according to Angle cost: Among the Eq. ( 7), According to Eq. ( 7), it can be seen that the value of the contribution of Distance cost tends to decrease when α → 0.

3) SHAPE COST
The attention to Shape cost should be different because of different datasets.Thus, a parameter is needed to control the loss function attention to Shape cost.Shape cost is defined as: Among the Eq. ( 9), For the parameter θ, the range of definition is from 2 to 6.

4) IOU COST
IoU cost can be defined as:

5) SIOU LOSS
In summary, the final regression loss function is defined as follows: The SIoU loss considers the matching direction of the prediction frame.The vector angles of the prediction frame and the GT frame are added to the loss function, which makes the prediction frame move to the nearest axis very rapidly.

6) GRADIENT SMOOTHING SIOU LOSS
IoU is the intersection ratio of the prediction box and the ground truth box, which can be used not only to determine the positive and negative samples, but also to evaluate the distance between the prediction box and the ground truth box.From the above equations, we know that the expression of SIoU loss is L IoUcost = 1−IoU + + 2 , which is too simple and cannot distinguish targets between high IoU and low IoU.This inability to distinguish is reflected in the gradient, both high IoU and low IoU targets show -1 in the gradient of the IoU level, which leads to the low accuracy of the prediction box in the regression process.In particular, in the original yolov5, the target is detected by default as long as the IoU of the prediction box and the ground truth box is higher than a certain threshold, which will cause further decrease in the accuracy of the prediction box.To solve this problem, a Gradient Smoothing-SIoU loss (GS-SIoU loss) is designed in this paper, which can increase the gradient of loss in the case of low IoU to speed up the convergence of the prediction box, and decrease the gradient of loss in the case of high IoU to enable the prediction box to be fine-tuned enough.The improved IoU loss expression is as follows.
In the equation above, α denotes the exponential offset coefficient, β denotes the exponential scaling factor, and δ denotes the correction offset.IoU has a value between 0 and 1, and α and β together determine the gradient change of SIoU loss between 0 and 1. δ is used to correct the range of the function so that it falls within the original range, without changing the range of the original loss.
Since the research objective is IoU and the range is (0-(1), in order to ensure that the range of loss does not change, the range of the regular term in IoUcost is set as a variable c, when the IoU is 1, the value of c is 0, so the range is (0,1+c), namely: Put the range of range into equation ( 13), we can get: Bringing it into the range yields: Remove the exponent e, and we can get: where, it satisfies equation ( 14), so: Because the range of L IoUcost is (c,1+c), the boundary conditions are equal, where: This is a system of ternary, first-order equations, but there are only two constraint terms, so there are an infinite number of solutions.In addition, it is also necessary to ensure that the two endpoints, which is IoU = 0, 1, and the gradient meets the following conditions: By combining the equations ( 19) and ( 20), we can obtain: The solution space of this system is the value space of α, β and δ.In order to ensure the significance of the gradient improvement, the larger the distance between them and -1, the better, we recommend more than 50%, that is: The three parameters must satisfy the above equations.

C. MODEL PRUNING
Model pruning is to remove redundant and non-critical weights (channels).By model pruning, the model size can be significantly reduced and the inference process of the model can be accelerated without causing any significant reduction in detection accuracy.The dataset in this paper is a human fall dataset with merely a single category, which necessarily has many redundant channels.Therefore, in order to solve the problem of slow detection speed, it is necessary to perform model pruning and remove the unimportant channels.Batch Normalization (BN) [47] is a standardized method proposed by the Google team, which helps to solve the problem of changing the distribution of data in the middle layer during the training time, so as to prevent the gradient from disappearing or exploding and to speed up the training.Chiliang et al. [48] propose to use Channel Threshold-Weighting (T-Weighting) modules to choose and prune unimportant feature channels at the inference phase.As the pruning is done dynamically, it is called Dynamic Channel Pruning (DCP).Liu et al. [49] propose a Conditional Automated Channel Pruning (CACP) method which simultaneously produces compressed models under different compression rates through a single channel pruning process.Yu et al. [50] propose a one-shot global pruning approach called Gate Trimming (GT), which is more efficient to compress the CNNs.
The BN is defined as: In the Eq. ( 23), Z out denotes the output of the BN layer, γ denotes the weight coefficient of the channel, β denotes the translation coefficient, µ and σ 2 B denote the mean and variance, and ε denotes a constant.
According to Eq. ( 12), the corresponding BN activation values will also be small when γ is small.Therefore, those channels with very small γ should be removed.However, there are few values of model weight around 0 when γ satisfies the normal distribution.Therefore, it is necessary to add the L1 regular constraint to sparse the parameters and then to prune the channels with very small γ .The regular constraint is shown as follows: 133598 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In the Eq. ( 24), λ denotes the regular coefficient.The sparsity coefficient is very important for the pruning performance of the entire network.If the value of λ is too small, it will lead to the degree that the BN layer coefficient tends to 0 is not high, and high-intensity pruning of the network cannot be carried out.Conversely, if the value of λ is set too large, it will affect the detection performance of the entire network and reduce mAP, in this paper, the value of λ is 0.01.
Considering that the model backbone is a heavy parametric structure, we choose to prune in the neck part of the network with many layers.There are many modules and a large number of parameters in this part, so it is better to prune them here.In order to get a light model weight, we need to adopt a reasonable pruning strategy for the characteristics of the network in this paper.Since heavy parameterization is used in the backbone part to obtain higher accuracy, we do not use any pruning operation in this part, and the pruning focuses on the neck part.First, we freeze the weights of the backbone part and do sparse training.The sparse weights are only for the neck part.Then the channels whose bn coefficients tend to be close to 0 are pruned with a pruning rate of 50%.In order to meet the requirements of front-end acceleration, we add the judgment logic to keep the number of channels of the pruned model as a multiple of 4. Since the pruned model cannot be aligned on the channels, and the C3 module has added operation in the neck part, this means the results of two convs are added.Consequently, we fuse the two convs containing add operation and then prune them, so that we get the pruned model.Finally, since the pruned model will lose some accuracy, we retrain it and get the final model weights after 100 training iterations.

IV. EXPERIMENTAL RESULTS AND ANALYSES
The deep learning environment and the framework used for this experiment are shown in Table 1, which are applied to other networks with the same configuration.

A. DATASET AND EXPERIMENTAL ENVIRONMENTS
The dataset labels used in the experiments are made independently, with 4697 images coming from the public fall detection dataset [56] and 792 images from video frames.There are 5489 images, with 4500 images being used for the training set 489 images for the validation set, and 500 images for the test set, as shown in Figure 8.In addition, 3124 pure background images are added to the dataset of fall, which are split in the same proportion as the dataset.The data enhancement method of the mosaic is used before the  training.The input image size of the network is uniformly resized to 640*640, and batch size is set to 16.
Besides, we have added an additional set of field da-tasets for field testing to verify the generalization ability of the model, which consists of 1500 images.
In this paper, the hyperparameters of the network is set, as shown in Table 2.

B. PERFORMANCE EVALUATION CRITERIA
In this paper, the Precision (P), Recall (R), Average Precision (AP), and Mean Average Precision (mAP0.5)are used to evaluate the detection accuracy, as shown in the equations ( 25) - (26).In order to evaluate the performance of the proposed network, the ED-YOLO algorithm is compared with other existing algorithms in the validation set.
In the above equations, TP (True Positives) denotes the number of positive samples correctly identified as positive samples, FP (False Positives) denotes the number of negative samples incorrectly identified as positive samples, and FN (False Negatives) denotes the number of positive samples incorrectly identified as negative samples.The meaning of AP is the area enveloped by the P-R curve.Since there is only one category of Fall in the dataset of this paper, N = 1.AP is equal to mAP when N = 1.Therefore, the larger the mAP is, the better the network performance will be.

C. ABLATION EXPERIMENT
The ablation experiment is conducted on the back-bone (DBBConv, DBBC3, E-DBBConv and E-DBBC3), neck (FEM), and IoU Loss, as shown in Table 3.
From table 3, it is visible that the original YOLOv5s achieve 94.03% mAP on the validation set.According to the comparison between baseline and 1, the improvement of the backbone part can improve the mAP value by 1.16%, but it also brings a 3.51M number of parameters and 10G computation in the training time.However, in the inference time, as the parameters fuse, the number of parameters and computation will decrease and the inference speed will accelerate.In addition, by comparing 3 and the baseline, SIoU does not bring additional parameters and computation to the network, and it can bring a 0.34% improvement to the model.There are E-CA module, RFB, and 1*1 conv module in the FEM, in which the E-CA is the modified module of the CA.In order to investigate the mechanism of action and influence factors in the FEM module, ablation experimental analysis is done based on the CA module, E-CA module, and RFB, as shown in Table 4.
Table 4 shows that CA does not bring improvement to the network according to the comparison between the baseline and 1.By comparing the baseline with 2, E-CA can give a boost to the network model, which confirms our view that CA reduces the number of parameters by compressing the middle layer channels, though losing some semantic information.
As shown in Figure 9, the network in this paper achieves better results than other lightweight networks.YOLOv5s achieve 94.03% mAP and ED-YOLO (No pruning and Rep) in this paper achieves 96.53% mAP, which is 2.5% higher than the original YOLOv5s.In order to obtain a faster detection model, the trained ED-YOLO network is pruned.The metrics of the network model after 50% pruning are compared with those before pruning, and the metrics of ''reparam'' and no ''reparam'' network model are compared, as shown in Table 5.
In Table 5, the pruned ED-YOLO has a smaller number of parameters compared to the original network.There is almost no loss in detection accuracy, with a detection FPS of 31 on NVIDIA RTX2080Ti.To compare the various metrics of ED-YOLO and other lightweight models, separate tests have been done on the YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, GhostNet-YOLOv5s, ShufffleNetv2-YOLOv5s, MobileNetv3stem-YOLOv5s, EfficientNetv2-YOLOv5s, YOLOv7-tiny, SAM-Net, FastestDet, FasterNet-T2 and ED-YOLO (Pruning and Rep).The test results are shown in Table 6.
Firstly, Table 6 shows that the parameters of the proposed ED-YOLO network reduce by 9.7% compared with YOLOv5s.On the validation set, the mAP0.5 of ED-YOLO network proposed in this paper is improved by 2.58%.On the test set, it is improved by 2.42%, and the mAP0.6 is improved 133600 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.by 2.68%.Meanwhile, detection speed of the proposed ED-YOLO network increased by 14.8%.Then, compared with other lightweight backbone's YOLOv5s, YOLOv3tiny, YOLOv4-tiny and YOLOv7-tiny, ED-YOLO is able to achieve higher accuracy, it is worth mentioning that the performance of yolov7-tiny in the fall data set is weaker than that of yolov5s, which is reflected in the accuracy and speed, and the ED-YOLO has a 3.58% mAP increase and a 29.2% FPS increase compared to the current stateof-the-art yolov7-tiny.Finally, the number of param-eters in this model is reduced compared to yolov5s, and the theoretical computation is reduced by 0.6 Gflops.With respect to GhostNet-YOLOv5s, ShufffleNetv2-YOLOv5s, MobileNetv3stem-YOLOv5s, EfficientNetv2-YOLOv5s, and other smaller models, their parameters and Gflops are small, but their detection accuracy is not high and their network structure is not conducive to GPU computation.For example, ShufffleNetv2-YOLOv5s is only 0.44M and Gflops is only 1.3G, but its FPS is far worse than ED-YOLO, which has a much larger number of parameters.
In comparison with the number of network layers, the number of layers of the network in this paper is more than that of YOLOv5s and YOLOv7-tiny.But, after reasonable structural designing and pruning, the number of parameters is reduced and the inference speed is accelerated.Also, we do some comparison experiments on non-YOLO networks such as SAMNet, FastestDet and FasterNet-T2.Through experimental comparison, ED-YOLO's mAP value is higher than theirs.Overall, the network in this paper achieves a balance between detection speed and detection accuracy, most of the metrics of the proposed network have been significantly improved, and it can achieve the purpose of realtime detection.
In order to demonstrate the generalization ability of the model in the wild scene, we conducted experiments on an additional 1500 wild samples, and compared ED-YOLO with YOLOv5s (baseline), YOLOv3-tiny, YOLOv4-tiny, YOLOv7-tiny, SAMNet, FastestDet and FasterNet-T2, the experimental environment was consistent with the above, and the experimental results are shown in Table 7.  Through experiments, the mAP of the proposed model in the field dataset is 95.17%, and compared with the results on the test set, the mAP only decreases by 1.08%, so the proposed model has strong generalization ability in different scenes.

D. PERFORMANCE ANALYSIS
In order to verify the detection performance of the proposed ED-YOLO network, the proposed network is used to detect multiple occlusions.The comparison of the detection results is made with EfficientNetv2-YOLOv5s, GhostNet-YOLOv5s, MobileNetv3stem-YOLOv5s, Shufffle Netv2-YOLOv5s, YOLOv5s, YOLOv3-tiny, YOLOv4tiny, YOLO-v7-tiny and some Non-YOLO networks such as SAMNet, FastestDet and FasterNet-T2 as shown in Figure 10.
From Figure 10, it is obvious that the results of EfficientNetv2-YOLOv5s, GhostNet-YOLOv5s, Mobile-Netv3stem-YOLOv5s and ShuffflNetv2-YOLOv5s are still have many cases of missed and false detections.It is worth noting that the detection results of EfficientNetv2-YOLOv5s, MobileNetv3stem-YOLOv5s, ShufffleNetv2-YOLOv5s, SAMNet, FastestDet and FasterNet-T2 are bad, but they can still detect a few targets, as shown in Figures 10 (a) (c) (d) (i) (j) (k).However, the detection effect of GhostNet-YOLOv5s is very poor, and no one who falls is detected, as shown in Figure 10 (b).It indicates that the GhostNet-YOLOv5s network has a poor detection effect in the heavily occluded scene, even inferior to the more lightweight EfficientNetv2-YOLOv5s, MobileNetv3stem-YOLOv5, and ShufffleNetv2-YOLOv5s.The reason for this phenomenon is that GhostNet loses local information when extracting features, and thus it has a poor detection effect in heavily occluded scene with severe occlusion that are not detected well.The detection accuracy of YOLOv5s is similar to that of the YOLOv3-tiny, as shown in Figures 10 (e) and (f).These two networks have the phenomenon of missed detection, and the target on the right side of the image is not detected.Although YOLOv4-tiny and YOLOv7-tiny can detect the falling person on the right side of the image, their detection accuracy is far lower than that of the proposed ED-YOLO network.Besides, YOLOv4-tiny has false detection in the middle part, while YOLOv7-tiny has missed detection, as shown in Figure 10 (g) and (h).From Figure 10 (l), it is visible that the detection performance of the proposed ED-YOLO network is superior to other networks.It indicates that the ED-YOLO network in this paper can cope with the detection problem in heavily occluded and small target scene.
From Figures 11 (b) and (d), it is obvious that the GhostNet-YOLOv5s and ShufffleNetv2-YOLOv5s cannot detect the person who falls.The reason for these results is that the feature extraction ability of these two methods is not strong, and it is easy to cause the loss of feature information.The detection accuracy of Efficient-Netv2-YOLOv5s is almost the same as YOLOv4-tiny, as shown in Figures 11 (a) and (g).Although the detection accuracy of MobileNetv3stem -YOLOv5s is the same as EfficientNetv2-YOLOv5s, it detects multiple false small targets, as shown in Figure 11 (c).Even though FastestDet is able to detect the fall target, the confidence level is low, as shown in Figure 11 (j).The detection accuracy of YOLOv3-tiny is slightly higher thanYOLOv5s.But, false detection occurs in this method, as shown in Figure 11 (f) and (i).In the small-target scene, the detection effects of YOLOv7-tiny and FasterNet-T2 are better than that of YOLOv5s, as shown in Figures 11 (c) (h) and (k).The detection accuracy of the proposed ED-YOLO for small targets is more than 80%, and the detection performance of the proposed ED-YOLO is better than other networks, as shown in Figure11 (l).It illustrates that the model in this paper achieves the best detection results in the small target scene.
Figures 12 (a), (b), (c), (d), (j), (k) show that EfficientNetv2-YOLOv5s, GhostNet-YOLOv5s, Mobile-Netv3stem-YOLOv5s, ShufffleNetv2-YOLOv5s, FastestDet and FasterNet-T2 have obvious error detection, such as missed detection and repeated detection (severe overlapping).Although the YOLOv5s, YOLOv3-tiny, YOLOv4-tiny and SAMNet do not have obvious problems of false and missed detection, they do not detect the little girl falling on the left side of the picture (severe occlusion and small target), as shown in Figures 12 (e), (f), (g) and (i).However, the detection performance of YOLOv7 in this scene is poor, which has a lot of missed detection, as is shown in Figure (h).Through a large number of experiments, it is concluded that YOLOv7 performs well in the small-target scene but poorly in the multiple-occlusion scene.It can be seen from Figure 12 (h) that the proposed ED-YOLO network not only detects the falling person who is blocked but also detects it  with a small target.It indicates that the model proposed in this paper has a good detection effect in severely occluded and small target scenes.
As is shown in Figure 13, all the methods above can detect the person who falls in confusing scene, but in this scene, the similarity between a person who falls and an electric car is very high.EfficientNetv2-YOLOv5s and YOLOv3-tiny misdetect two electric cars as people who fall as shown in Figure 13 (a) and (f). Figure 13 (b), (c), (e), (g), (h), (i), (j) show that GhostNet-YOLOv5s, MobileNetv3ste-m-YOLOv5s, YOLOv5s, YOLOv4-tiny, YOLOv7-tiny, SAMNet and FastestDet misdetect an electric car on the left for a fall person.ShufffleNetv2-YOLOv5s, FasterNet-T2 and ED-YOLO can correctly detect two fall people without misdetecting a fall electric car as fall person as shown in Figure 13 (d), (k) and (l), it can be seen that the network in this paper still maintains high accuracy detection in scene where error detection is very likely to occur.
By comparing the detection results in all scenes (Figures 10-13), it can be found that the proposed network has better detection results compared with other main-stream lightweight networks and better detection ability in complex scenarios, and the detection speed of the proposed network in this paper reaches 31FPS, which can achieve the goal of real-time fall detection.
As is shown in Figure14, ED-YOLO can achieve a high-precision detection effect in complex situations and accurately locate people who fall.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
performance of the network.Experimental results show that the proposed network achieves better results than other mainstream lightweight networks in the dataset of human fall detection.The final mAP of it reaches 96.25%, and the detection FPS reaches 31 in RTX2080ti.Simultaneously, the proposed network is not only suitable for simple scenes but also for complex scenes in the field of human fall detection.
In the future, our research will be carried out from the following two aspects.First, expand the dataset by collecting more fall pictures in more scenes, so that the trained model has a stronger generalization ability.Second, integrate the segmentation model with the detection model to improve the practicality of human fall detection.

FIGURE 6 .
FIGURE 6.Schematic comparison of the Coordinate Attention block (a) and the Efficient-Coordinate Attention (b).Avg Pool is the global average pooling.

FIGURE 7 .
FIGURE 7. The scheme for the calculation of angle cost contribution into the loss function.

FIGURE 9 .
FIGURE 9.The curve of mAP.((a) shows the curves of precision; (b) shows the curves of recall; (c) shows the curves of mAP0.5).

FIGURE 14 .
FIGURE 14.The detection result of ED-YOLO.
In this paper, a lightweight model ED-YOLO for human fall detection is proposed.First, the DBBC3 module is proposed.In the backbone C3 module, the reparameterized DBBConv module is used to replace the 3*3 conv on the C3 module.Second, the pooling branch from DBBConv is removed.Simultaneously, a new conv group is added to design the efficient DBBConv (E-DBBConv) and DBBC3 module (E-DBBC3).The shallow Conv and C3 modules in the backbone are replaced by DBBConv and DBBC3 modules, and the deeper modules are replaced by E-DBBConv and E-DBBC3 modules, which can improve the ability to extract detailed features.Third, a novel Feature Enhance Module (FEM) in the neck is proposed, which consists of an RFB, 1*1 conv, and E-CA module.FEM can enhance the features representation of the region of interest and the fusion of features, which is added to FPN to improve the detection accuracy of the network model.Additionally, the E-CA module inside FEM is proposed, which is an efficient attention mechanism.Finally, we use GS-SIoU Loss instead of CIoU Loss, the gradient smoothing are introduced to improve the regression speed and accuracy of prediction box.According to the ablation experiment, E-DBBConv and DBBConv are the main factors to improve the detection 133604 VOLUME 11, 2023

TABLE 2 .
The hyperparameters of the network.

TABLE 3 .
Indicators for different parts of the model.

TABLE 5 .
Comparison of model reparameterization and pruning.

TABLE 6 .
Comparison of different lightweight models.

TABLE 7 .
Comparison in wild datasets.