Automatic Abdominal Hernia Mesh Detection Based on YOLOM

As a new 3-D ultrasound imaging method, an automated breast ultrasound (ABUS) has been widely used in breast abnormality examinations. Because of its excellent 3D visualization, ABUS is also well suited to the detection of an abdominal wall hernia mesh. Due to the inherent low signal-to-noise ratio of ultrasound imaging and the large amount of data generated during ABUS scanning, mesh detection based on subjective observation is extremely time-consuming and prone to missed detection. Therefore, we proposed a novel abdominal hernia wall mesh detection method based on the you only look once version 3 (YOLOv3) method named the YOLOv3 for mesh (YOLOM) method to detect abdominal wall hernia mesh to speed up the ABUS reading process. To make a YOLOM method with a good detection efficiency, we utilized a lightweight cross stage partial attention network (CSPA-Net) as the backbone and applied a feature enhancement network (FEP-Net) to boost the mesh detection accuracy. An improved loss function with completed intersection-over-union (CIoU) and the Swish activation function were also employed to optimize the proposed YOLOM method. We designed ablation study to verify the validity of the proposed method. The average mesh detection precision reached 98.36%, which was 12.51% and 2.35% higher than that of the YOLOv3 and you only look once version 4 (YOLOv4) methods, respectively. The experimental results and comparisons demonstrated that the proposed YOLOM detector is efficient for abdominal wall hernia mesh detection.


I. INTRODUCTION
An abdominal wall hernia is one of the most common complications of abdominal surgery. According to clinical data statistics, the probability of an abdominal wall hernia occurring is 2% -11%. If infection occurs, the probability of occurrence will increase to 23% [1]- [3]. Abdominal wall hernias do not heal on their own and gradually expand outward from the infected area; therefore, prompt surgery is the only treatment option [4], [5]. Mesh repair surgery has become the standard procedure for repairing abdominal wall hernias worldwide. However, a variety of mesh-related complications, such as mesh infection, migration, hematoma The associate editor coordinating the review of this manuscript and approving it for publication was Ravibabu Mulaveesala . and intestinal adhesion are possible [6], [7]. Therefore, the correct preoperative detection of an abdominal wall hernia mesh can help surgeons adjust the surgical plan to predict the difficulty of an abdominal wall hernia mesh surgery and reduce the incidence of related complications or the removal of a previous mesh.
With the rapid development of mesh materials, meshes have become increasingly lightweight. Lightweight (LW) mesh is the first choice for abdominal hernia surgery. However, because the detection range of a handheld ultrasound (HHUS) probe is relatively narrow, it is impossible to completely and reliably detect and identify a LW mesh at the same time. It is difficult to detect meshes in either the axial or sagittal plane using 2-D ultrasound due to the light and thin characteristics of the LW mesh. As a new 3-D ultrasound imaging method, ABUS provides more comprehensive diagnostic information through the coronal plane [8], [9] and has been a concern of sonographers and scholars [10]- [14]. Compared to narrowing the HHUS detection range, the ABUS scanning range has been greatly improved, but in practical applications, ABUS may need to scan repeatedly to obtain an image of a large area, which leads to the amount of data generated by ABUS for each patient being tremendous. This causes the following two problems in the detection of a LW mesh: 1) It is time-consuming and labor-intensive to manually examine the ABUS ultrasound images. 2) The detection accuracy is heavily dependent on the experience of sonographers, which easily leads to missed diagnoses and misdiagnosis. Therefore, imaging studies have become important for the detection and identification of abdominal wall hernia meshes, which can provide effective guidance for follow-up surgery and treatment.
Two-stage detectors, such as a region-based convolutional neural network (R-CNN) [15], Fast R-CNN [16] and Faster R-CNN [17], are performed in two stages. In the first stage, a region proposal network is used to process images and generate box proposals where objects may exist. In the second stage, these box proposals are used as features from the intermediate feature maps. Then, these features are fed to the final layers to localize and classify the objects of each box proposal. However, two-stage detectors usually use more proposal regions, which help to obtain local optimal solutions and improve detection accuracy at the cost of longer computational time. In contrast, single-stage detectors, such as the You Only Look Once (YOLO) [18]- [21] and Single Shot Multibox Detector (SSD) [22] algorithms, are usually faster and yield less desirable results than two-stage detectors. The YOLO series is a typical single-stage detection method. Compared with the two-stage algorithms, the YOLO series algorithm does not require the region proposals stage, and directly predicts the category probability and location information of the target. It transforms an object detection problem into a regression problem, which can completely achieve end-to-end detection. However, the limitations of the YOLOv3 [20] algorithm for object detection on ultrasound images are as follows: (1) The performance in detecting small objects is often not good when the image noise is large, the resolution is low and the background is complex. (2) The YOLOv3 method fuses all low-level features directly with high-level features. In fact, not all the low-level detail features are beneficial to detection. (3) The aspect ratio of the bounding box is different from the ground truth, which is not conducive to postoperative evaluation of the implanted mesh. Recently, the YOLOv4 [21] algorithm, which uses the crossstage partial network as the backbone network to extract the feature and applies a large number of data augmentation techniques, has received widespread attention. However, the number of parameters and the module storage size of the YOLOv4 algorithm are much larger than those of the YOLOv3 method, which increases deployment costs and reduces training and reasoning speed.
To solve these problems, we proposed a novel realtime detector for abdominal wall hernia mesh based on the YOLOv3 algorithm named the YOLOM method to improve the detection efficiency. Our main contributions are as follows: 1) In the feature extraction process, we introduce a crossstage partial network to design a more lightweight feature extraction network, which enhances the feature extraction capability of the backbone, improves the detection effect for small targets, and reduces the amount of calculation and model storage size. At the same time, we introduce a channel attention mechanism SE-Net to make the network pay more attention to the features of the mesh target during the feature extraction process.
2) Multiscale spatial pyramid pooling (SPP) is added to the convolutional layer at the end of the proposed backbone. The SPP block performs pooling operations on the input feature map at different scales and connects the three pooled feature maps and the input feature map to increase the receptive field of the feature map in such a way that the YOLOM method can detect the object more comprehensively.
3) To ensure the consistency of the bounding box aspect ratio and accelerate the network convergence, we introduce Complete-IoU (CIoU) to optimize the YOLOv3 loss function. The shape of the postoperative mesh is an important guide for doctors to evaluate the condition of the implanted mesh and the recovery of the hernia area.
The rest of this paper is organized as follows. The related work is introduced in the second part. The third part presents the overview of the proposed YOLOM detection method, including the motivation and the structure of the YOLOM method. In part four, experimental results and discussions are given, where the billion floating point operations per second (BFLOP/S), the model size, mean average precision (mAP), the different subscripts of mAP represent mAP calculated under different IoU conditions. The real-time performance of frames per second (FPS) are compared. Furthermore, the comparison results between the YOLOM and other state-ofthe-art detection algorithms are given in part four. Finally, conclusions are drawn in part five.

II. RELATED WORK A. MESH DETECTION DATASET
An ABUS database was collected from three types of experiments: gelatin phantom experiments, animal (porcine peritoneal) ex vivo experiments and patient experiments. There were three sets of gelatin phantom experiments, five sets of animal ex vivo experiments, and data from 97 patients. Signed informed consent from patients was waived because it was a retrospective study. The study was approved by the Ethics Research Committee of our institute. All the images were taken from the ABUS scanner. The built-in linear array probe model is 14L5BV, the frequency range is 5 MHz to 15 MHz, and the maximum scanning depth is 6 cm. Each scan produces 318 frames of axial images, 730 frames of sagittal images and 573 frames of coronal images. As shown in Fig. 1, 3-D ultrasound shows the side of the LW mesh in the axial and sagittal planes. Because the LW mesh is very thin (the thickness of the LW mesh is only 0.5 mm), doctors usually choose the coronal plane to detect and evaluate the mesh in clinical diagnosis. In this paper, we also used multilayer coronal images for mesh recognition and detection.

B. THE INTRODUCTION OF YOLOV3
The YOLOv3 method is composed of the newly designed backbone network Darknet-53 and a multiscale detection network, as shown in the Appendix part A.
The main idea of Darknet-53 is to use five continuous downsampling modules to change the size of the resolution of the input image from 416 × 416 to 13 × 13. To solve the gradient problem caused by network deepening, the residual module in ResNet is introduced in each downsampling module. To preserve more image information, the YOLOv3 method uses a convolution stride of 2 instead of the traditional max-pooling layer. Each down-sampling module contains a different number of stacked residual units. The method of multi-scale prediction based on feature pyramid networks (FPNs) is used to predict the multi-size targets in the YOLOv3 algorithm, especially for small object detection. Darknet-53 generates three feature maps of different sizes. The prediction results of YOLOv3 are the relative positional relationship between the anchors and the bounding boxes. Anchor is a set of boxes with only width and height parameters obtained by k-means clustering on the ground truth (GT) of the dataset. We can convert the prediction results of YOLO into bounding boxes through (1) to (4).
where t x , t y are the offset of the coordinates, and t w , t h are the ratio coefficients of the width and height of the bounding boxes relative to the anchors, respectively. p w , p h are the weight and height of the anchors. We can divide the image into N × N grid cells according to the size of the feature map, and c x , c y are the coordinates of the upper left corner of the grid cell where the feature map is located. σ is the sigmoid activation function, which can normalize the coordinate offset between 0 and 1.

C. SQUEEZE AND EXCITATION NETWORK
For ultrasound images, to disregard invalid information in images more efficiently and guide where the network should pay attention, the selection of a suitable attention mechanism in the CNNs is important. Jie et al. designed a lightweight gating mechanism named Squeeze-and-Excitation (SE) network to improve the expression ability of the whole network [23], which established the relationship between channels using an efficient fully connected layer. The structure of the SE module is shown in Fig. 2, which can be roughly divided into three stages: squeeze, excitation and combination. And we put the details in Appendix part B.

D. SWISH ACTIVATION FUNCTION
The choice of activation functions in deep networks has a significant effect on the training dynamics and detection performance [24]. In 2017, Ramachandran P et al. proposed the Swish activation function to speed up network convergence and improve classification accuracy. In this study, we used Swish in the residual module. Swish is defined as (5), which means that the result is obtained by multiplying the input value with the sigmoid activation function. β is either a trainable parameter or a constant parameter. The characteristics of a good activation function should generally be smooth and robust to negative values. ReLU function can solve the gradient disappearance problem when the input x > 0, but it sets all the values of the input as negative numbers to 0. If an outlier appears, the biases of the neural network are likely to become very large, making subsequent normal inputs all become negative numbers, which will make the parameters no longer update. However, as an activation function between linear function and ReLU function, Swish can effectively alleviate this problem. If β = 0, Swish becomes the linear function f (x) = x/2, and as β → 0, the sigmoid component approaches as a function from 0 to 1, so Swish becomes like the ReLU function.

III. PROPOSED METHOD A. YOLOM NETWORK
The entire YOLOM detector architecture is shown in Fig. 3.
The detector mainly consists of three modules: feature extraction, feature enhancement and multiscale detection. First, we replaced the traditional Darknet-53 network with a crossstage partial attention network (CSPA-Net). To extract deeper  semantic information, we designed a feature-enhanced pyramid network (FEP-Net). The images are input to the CSPA-Net to extract two feature maps with different sizes. Then, in the feature enhancement stage, the feature map obtained from the last residual block in the feature extraction module is input to the FEP-Net module to obtain a more efficient feature. In the multiscale detection stage, the two feature maps with different sizes obtained from the residual blocks in the feature extraction module are up-sampled and concatenated to obtain feature maps with different receptive field sizes. The sizes of the two feature maps are 13 × 13 and 26 × 26, respectively. Finally, we used the YOLO head of the YOLOv3 method to generate the bounding boxes. In this paper, the number of category is 1, and the channel number of each feature map is 18. Therefore, the YOLOM method generated 2,382 proposals on each abdominal wall hernia mesh image, while 8,265 proposals were reduced compared to the YOLOv3 method.

B. CSPA-NET: FEATURE EXTRACTION NETWORK
CNNs often face the problem of too much calculation in the training and detection processes, which directly leads to model training and detection slowdown. Cross-stage partial network (CSP-Net) [25] can restrict the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which reduces computations by 20% with equivalent or even superior accuracy. And it is easy to implement and general enough to cope with architectures based on ResNet. To reduce the amount of calculation while keeping or even improving accuracy, we proposed a new backbone based on a CSPNet. We divide the input in the channel dimension into two: one part is extracted through a residual network (ResNet), and then the two parts of the feature map are feature-fused to achieve a lightweight backbone. Due to the heavy computation and high complexity of Darknet-53, a novel lightweight backbone network CSPA-Net is proposed. CSPA-Net is composed of CBS modules and SE_Res modules. A CBS module is a 3 × 3 convolution kernel, batch normalization and a Swish activation function. A SE_Res module is shown in Fig. 4. At the shortcut connection, we divide the input in the channel dimension into two, and half goes through the downsampling operation of the CBS modules, followed by feature fusion through concatenation. The shortcut operation does not increase any parameters or computational complexity. The SE module restricts the interdependence between the channels before the feature fusion process of the shortcut and adaptively recorrects the corresponding strength of the features between the channels through the global loss function of the network. Here, the max-pooling window size is 2.

C. FEP-Net:FEATURE ENHANCED PYRAMID NETWORK
The YOLOv3 algorithm utilizes the global features of different convolutional layers of the network but does not make full use of the multiscale local region features of the convolutional layer. SPPNet [26] is able to fuse the receptive fields of different sizes and improve the scale invariance of the network, so that the detector has better robustness to mesh targets of different sizes. To effectively make use of the local region features of the backbone, we proposed a feature-enhanced pyramid network (FEP-Net) based on a SPPNet to fuse the multiscale local and global features, as shown in Fig. 5. Here, the multiscale SPP block is composed of three max-pooling layers, and the size of the pooling window can be computed from (6).
where Size p represents the size of the pooling windows, and Size f represents the size of the feature maps, and n i = 1, 2, 3. Due to the large amount of convolution operation parameters of the deep neural network, the inference speed of the neural network are reduced. Therefore, we introduced the depthwise separable convolution (DSconv) [27] into this module. And we can obtain the pooling windows from (6) are 1, 5, 9, 13, respectively. The strides of the pooling windows are all 1, and the input feature maps are padded with 0 to ensure that the output feature maps after pooling are the same size as the input.
A DSconv can greatly decrease the number of parameters. Hence, the computation time and model size are reduced. We can factorize a normal convolution into a DSconv. A DSconv includes depthwise and pointwise layers based on the dotted box in Fig. 5. The former carries out a singlechannel convolution operation on the input, but such an operation does not make use of the spatial information between the channels of the input feature map, so the latter carries out a convolution operation on the former results in the depth direction to ensure that the number of layers of the network can be deepened and the performance of the network can be improved while reducing the amount of convolution operation computation.
To make the network have a better detection effect on mesh targets of different sizes, we fused feature maps of different scales. We used five DBL modules to enhance the feature map output by FEP-Net, and used up-sampling to change its size from 13 × 13 to 26 × 26, and concatenated it to the output of the last SE_Res block. The result also went through five DBL modules that were used for feature fusion, and finally, we obtained two feature maps with sizes of 13 × 13 × 18 and 26 × 26 × 18.

D. IMPROVING YOLOV3 LOSS FUNCTION WITH CIoU
The YOLOv3 loss function is a linear sum of three parts: the coordinate loss, classification loss and confidence loss. The loss function can be denoted by (7) Loss = Loss coord + Loss conf + Loss class (7) where the Loss coord denotes the coordinate loss, the confidence loss is presented by Loss conf , and Loss class is calculates the classifying loss. The Loss coord of the YOLOv3 method regards (w,h) and (x,y) in (1) as independent variables for loss calculation. In fact, there is a certain spatial constraint relationship between the center point coordinates and the width and height between the bounding box and the GT. Using a traditional IoU to improve the loss function will cause the loss function to not be a derivative and make the network training unable to converge if the bounding box and the GT do not stack or if the bounding box includes the GT.
To overcome these disadvantages, we introduced intersection over union (CIoU) [28] by considering three geometric measures, overlap area, central point distance and aspect ratio, which better describe the regression of the bounding box. Therefore, in our research, CIoU is utilized to improve and modify Loss coord . As shown in the Appendix part C, the purple, gray and yellow rectangles represent the bounding box, the GT and the smallest enclosing box covering two boxes, respectively. The improved Loss coord using CIoU is as in (8), where C is the diagonal length of the smallest enclosing box covering two boxes, and d = ρ(b, b tg ) is the Euclidean distance between the central points of two boxes. IoU is the intersection-over-union between the bounding box and the GT, which constrains the overlapping area between the bounding box and the GT, R CIoU uses d and C to address the problem of the loss possibly not being able to update the gradient when the bounding box and the GT are not stacked, α is the scale factor, v ensures the consistency of the aspect ratio by calculating the diagonal slopes of the bounding box and the GT.

IV. EXPERIMENTAL ANALYSIS
In this paper, all experiments are conducted on a Windows 10 (64-bit) Dell workstation with 64 GB of memory an Intel(R) Xeon(R) E5-2650 V3 2.30 GHz CPU and an NVIDIA Titan XP GPU with 12.0 GB video memory. The deep learning framework was PyTorch, and the map and precision-recall curve were used to evaluate the proposed method. Based on a priori knowledge of the abdominal thickness range, with the removal of a large number of frames unlikely to contain mesh, we collected 2100 original coronal images, divided the images into a model building set and an independent testing set using a ratio of 9:1, and further divided the model building set into a training set and a validation set using a ratio of 9:1.
To ensure that the trained model has certain generalization, we adapted data augmentation techniques including rotation, flipping, scaling, and random cropping in the training set to obtain 13,608 images for model training. In addition, Adam was used for gradient optimization, and the initial learning rate was set to 0.0001.

A. ALGORITHM COMPLEXITY
Operational efficiency is critical to the implementation of an algorithm, so we compare the algorithm complexity in terms of time complexity and space complexity, and the formulas are shown in (9), where M is the width or height of the output feature map of the kernel, K is the width or height of each kernel, respectively, D represents the depth of the network, C l represents the number of channel of each kernels. Time complexity is  model size and memory read and write (MemR+W) for evaluation. Networks with higher spatial complexity have a large number of parameters, and a large amount of data is required to train the network. However, a real dataset is usually not too large, which makes the model prone to overfitting. Table 1 shows the algorithmic complexity comparison between the YOLOM and other state-of-the-art (SOTA) methods. The algorithm complexity of the two-stage detector Faster R-CNN far exceeds that of the one-stage detectors in both time and space. For the one-stage detector, our proposed method reduces the time complexity by 80% and greatly reduces the space complexity.

B. VISUALIZATION OF the MULTISCALE FEATURE MAPS
To illustrate the effectiveness of the multiscale feature maps, we visualize the classification imformation of the feature maps in two layers by Gradient-weighted Class Activation Mapping (Grad-CAM) [29]. As shown in Fig. 6, the first column shows the detection results and the ground truth of meshes (the red boxes denote the detection results, and the green boxes denote the ground truth), and the last two columns show the locations of clusters with sizes of 13 and 26 our method finds in the feature maps. It is clear that the multiscale features can detect meshes of different sizes better. The 13 × 13 feature map has large perspective field so that it can detect large meshes more efficiently and the 26 × 26 feature map can detect tiny meshes well. Finally, we use a nonmaximum suppression (NMS) algorithm to filter the detection results at different scales to obtain the final mesh detection results. The step of NMS is to select the bounding box with the highest confidence, and then calculate the IoU in pairs with other bounding boxes, filter out the bounding boxes with IoU greater than the threshold, and iterate until only the last bounding box is left.

C. ABLATION STUDY
In order to verify the effectiveness of each module, we conducted an ablation study. In Table 2, the effects of CSPA-Net, FEP-Net, CIoU, and the Swish activation function on the YOLOM method are mentioned. According to this table, when the YOLOv3 backbone is replaced with CSPA-Net, the mean average precision (mAP) of the YOLOM method increases from 85.85% to 89.59%, which fully shows that the introduction of the cross-stage partial network and SE module into the backbone can make the network better learn the characteristics of a mesh. The mAP increased from 89.59% to 95.47% when we used FEP-Net in our method. This proves that FEP-Net can expand the receptive field of the feature map and is effective for multiscale detection. Moreover, the mAP increased by 1.87% when the YOLOM method uses CIoU to improve the loss function. CIoU loss takes three geometric properties into account, the overlap area, central point distance and aspect ratio, and leads to faster convergence and better performance. The Swish activation function improved the mAP by 1.02%, and we can conclude that using Swish instead of ReLU can improve the detection accuracy without changing any network structure. The final mAP obtained on the mesh dataset is 98.36%, which is 12.51% higher than the original YOLOv3 method. As a result, the components we used increase the mAP.

D. DETECTION SPEED
In Table 3, the detection time of the proposed YOLOM detector and other SOTA detector algorithms is shown. According to this table, the average detection time of the YOLOM detector on an Nvidia TITAN Xp is 21.4 ms for each test image. In addition, because of the low parameters of the proposed detector, it is capable of running on a CPU, and its detection time is also very short. However, the detection time of the original YOLOv3 method on the same GPU is 50.5 ms. The two-stage detector Faster R-CNN has the longest detection time. Therefore, when using GPU for detection, YOLOM's detection speed is approximately three times faster than YOLOV3, which meets the requirements of realtime ultrasound image detection.

E. VISUAL RESULTS
The detection results of the YOLOM and other SOTA detectors for three test images are visualized in Fig.7. According to this figure, it can be seen that the proposed YOLOM detector has a perfect ability to detect mesh objects using ABUS. As seen in Fig. 7(a), the SSD recognizes the more VOLUME 10, 2022  curved mesh target as two targets. As shown in Fig. 7(c), the YOLOv3 method did not completely detect the mesh target, and the SSD detected the mesh target with a clear mesh structure as a single mesh target, while the YOLOv4 method only detected the mesh with a clear mesh structure. At the same time, it is not difficult to find that our method is more similar to the ground truth shape and position and has a better detection effect on small target objects with subtle differences between the foreground and background.

F. COMPARASION WITH OTHER SOTA METHODS
The performance comparison between the YOLOM detector we proposed and the SOTA detectors is shown in Table 4. All experiments in this paper are conducted on the same test dataset. It can be seen from the table that the method proposed in this paper has a perfect detection effect for the mesh dataset, and can fully ensure the high efficiency and practicability of clinical auxiliary diagnosis. Compared with the original YOLOv3 algorithm and the latest YOLOv4 algorithm, the mAP 50 of our proposed method is 12.51 and 2.35 higher than them, respectively, and at the same time greatly improves the detection speed. The Precision-Recall curve (PRC) is an important indicator for evaluating the object detection model. PRC is a curve representing precision and recall rate under different confidence thresholds. In the dataset, we define the label as  mesh as True (T), and the label as mesh as False (F); in the prediction result, the confidence higher than the threshold is defined as the correct classification as positive (P), otherwise it is negative (N). As show in (10), precision represents the percentage of samples with a label of true among the samples predicted to be positive. And recall means the percentage of samples predicted to be positive among the samples with a label of true. The mesh PRC of are illustrated in Fig. 8. According to this figure, the detection precision is high for most test images. From the PRC, we can see that the YOLOM detection of mesh objects has the best effect because it encompasses all the SOTA algorithm PRC. The performance of the Faster R-CNN, SSD and YOLOv4 methods are almost the same, but it can be seen from their intersection with the black dotted line that the performance of the YOLOv4 method is slightly higher than the other two. The detection performance of the YOLOv3 method for mesh targets is not as good as the other SOTA algorithms.

G. LIMITATIONS OF THE PROPOSED METHOD
After testing, it was found that the detection effect of a small part of the test images is not ideal. We enumerate them to analyze the limitations of the YOLOM method, as shown in Fig. 9. In the first column of the image, due to the sharp deformation of the mesh, YOLOM only detects part of the mesh target. The second column of the image has a large degree of mesh object curvature that was not detected. From these two failure cases, we can infer that the reason for the detection failure are as follows: when a mesh target is completely perpendicular to the coronal plane, the target may not be detected accurately. Therefore, in practical applications, doctors should pay attention to the angle of the mesh scan and perform multi-angle scans if necessary.

V. CONCLUSION
In this research, a mesh detection method based on the YOLOv3 method is proposed that utilizes CSPA-Net, FEP-Net, the Swish activation function and CIoU. The results of the experiments and comparisons demonstrated that the proposed YOLOM detector was more efficient than other existing methods for abdominal wall hernia mesh detection in ABUS images. In this study, the backbone we used could VOLUME 10, 2022 efficiently reduce the number of parameters of the YOLOM detector.
Since the calculation amount is only one-eleventh of the original method, we can use a mediocre GPU for training. In addition, the proposed YOLOM method is a flexible detector because its backbone can be changed from CSPA-Net to other backbones, such as MobileNet or EfficientNet, for different datasets without programming difficulties. Due to the high importance of activation functions and their direct impact on models, the proposed method employs the Swish activation function. The results of the experiments show that Swish improves the efficiency compared with other functions such as LeakyReLU. In addition, in this study, for bounding box regression and improving the loss function, the CIoU method was applied. This method directly minimizes the normalized distance between central points of two bounding boxes, which leads to much faster convergence than other methods such as IoU. Moreover, we found that the coronal mesh texture of an abdominal wall hernia mesh was particularly effective. Automated 3-D ultrasound can offer significant evidence for clinical diagnosis and surgical repair procedures and is a promising detection method for abdominal wall hernia mesh imaging.

A. THE STRUCTURE OF YOLOV3 METHOD
The whole structure of YOLOv3 method is shown in Fig. 10.

B. SQUEEZE AND EXCITATION NETWORK
In the squeeze stage, the feature map is compressed into a (1× 1 × N ) tensor by a global average pooling layer. N represents the global information of each channel. Feature extraction for each channel as in (11): where µ N is the the feature map of the N th channel, H and W are the height and width of the feature map, respectively. In the excitation stage, to obtain the weight of each channel, the correlation of the channels is established through two fully connected layers, as shown in (12): where z is the result of the squeeze stage, W 1 and W 2 are the fully connected layers, respectively. The number of channel for W 1 is N r and the number of channel for W 2 is N. r is a scaling factor to reduce the amount of parameters. σ and δ are the sigmoid and ReLU activation functions, respectively.
In the combination stage, the channel feature is merged with the original feature map, as shown in (13): where s N is the weight for each channel.

C. CIOU
The detailed description of CIoU as shown in Fig. 11.