Lung Nodule Detection in Medical Images Based on Improved YOLOv5s

Lung cancer has the highest morbidity and mortality rate worldwide. The early detection of pulmonary nodules in lungs can help reduce the incidence of lung cancer. However, due to the great variance in shape, size, and location of pulmonary nodules, the detection of small nodules in medical images is very challenging. This paper proposes a novel YOLOv5-CASP model, based on YOLOv5s with the following proposed improvements: 1) incorporating improved Convolutional Block Attention Modules (CBAM) to suppress the interference features of the medical images through a channel dimension and spatial dimension, and to improve the detection performance of the model; 2) substituting the Spatial Pyramid Pooling – Fast (SPPF) module of YOLOv5s with an improved Atrous Spatial Pyramid Pooling (ASPP) module as to increase the model’s receptive field for images of different sizes and extract multi-scale contextual information for improving its performance on detecting small lung nodules; and 3) introducing a Contextual Transformer (CoT) module to optimize part of the CSPDarknet53 module of YOLOv5s in order to enhance the characteristics of the model while removing redundant operations extraction capacity. Experimental results conducted on two public datasets confirm that the proposed YOLOv5-CASP model outperforms the original YOLOv5s model and other five state-of-the-art models (Faster R-CNN, SSD, YOLOv4-Tiny, DETR-R50, Deformable DETR-R50), in terms of the mean average precision (mAP) and F1 score, by achieving corresponding values of 0.720 and 0.740 on the LUNA16 dataset, and 0.794 and 0.766 on the X-Nodule dataset.


I. INTRODUCTION
Lung cancer has become one of the most common cancers worldwide with a high mortality rate, [1]. The first symptoms of lung cancer are abnormal growth of cells in the lungs, which form small round or oval lung nodules. In clinical medicine, the detection of lung nodules is the first step in The associate editor coordinating the review of this manuscript and approving it for publication was Zhipeng Cai . lung cancer screening, allowing for timely relevant treatment which can effectively reduce the mortality rate, [2]. The traditional lung nodule detection method is that doctors observe lung medical images (obtained by computer tomography, X-ray methods, etc.) with the naked eye, and judge whether there are pulmonary nodules in the patient's lungs images, and then conduct further diagnosis and treatment. But as the number of medical image scans increases, the workload of radiologists increases significantly. In addition, pulmonary nodules in general medical images are characterized by small size and various shapes, which further increases the difficulty of identifying pulmonary nodules, leading to misdiagnosis or missed diagnosis which is more dangerous.
Object detection is an important computer vision task. It was developed from the image classification task; the difference is that object detection needs to complete the classification and coordinate positioning of the object in the image, [3]. Traditional object detection models are generally based on artificially designed feature operators to describe images, such as Scale-Invariant Feature Transform (SIFT) [4], Histogram of Oriented Gradient (HOG) [5], etc. These feature operators are designed based on the underlying visual features, so it is difficult to obtain semantic information in complex images. AlexNet [6], which won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) challenge in 2012, demonstrated the powerful feature expression ability of convolutional neural networks (CNNs). Since then, research based on deep learning has developed rapidly, and VGGNet [7], GooleNet [8], ResNet [9], and DenseNet [10] have come out one after another. Because they all have powerful feature extraction capabilities, in addition to completing image classification tasks, they are also commonly used as skeleton networks for more complex computer vision tasks, including object detection. The R-CNN model, proposed by Girshick et al. in 2014 [11], applied CNN to the field of object detection for the first time, and defeated the traditional Deformable Part Model (DPM) model [12] by an absolute advantage on the PSACAL VOC detection dataset [13]. Although R-CNN does not require manual feature selection, it needs to generate a lot of candidate regions to improve accuracy, and many candidate regions overlap with each other. Thus, it is necessary to repeatedly extract features for different candidate regions, which results in slow training and not very good detection. In order to solve this problem, the authors of R-CNN absorbed the idea of SPP-Net [14] and proposed Fast R-CNN [15] in 2015. This network simplifies the SPP layer by one layer, called a Region of Interest (ROI) pooling layer. In addition, Fast R-CNN replaces the Support Vector Machine (SVM), used in R-CNN with a Softmax function, and introduces a Singular Value Decomposition (SVD) decomposition to combine classification with regression. Later, in view of the fact that the generation of Fast R-CNN candidate boxes is completely independent and cannot be learned according to specific datasets, Ren et al. improved Fast R-CNN again and proposed Faster R-CNN [16]. Its biggest innovation is the design of a candidate frame generation network called a Region Proposal Network (RPN), which allowed it to greatly improve the detection speed and break the record of the PASCAL VOC dataset. Afterwards, other models, such as R-FCN [17], Mask R-CNN [18], and ThunderNet [19], based on Faster R-CNN were successively proposed.
Although the two-stage models, starting with Faster R-CNN, have realized the complete process of end-to-end training, there is still a considerable gap in satisfying the real-time requirements. In 2016, a regression-based object detection model was proposed, referred to as YOLO (You Only Look Once) [20]. Its essence is still a CNN which, however, differs from the ordinary CNN in that it can detect multiple border positions and object categories in one run, and implements end-to-end object detection and recognition. The first YOLO version, YOLOv1, divides the input image into 7 × 7 grids, whereby each grid predicts two bounding boxes (BBoxes), so there are 7 × 7×2 BBoxes. Up to 49 objects can be identified. Therefore, YOLOv1 is not conducive to identifying dense objects and small objects. On the basis of it, YOLOv2 [21] draws on the VGG [7] network to construct a new backbone network (Darknet- 19) with improved convergence performance. YOLOv2 significantly improves the detection accuracy while ensuring high detection speed, but it still has the disadvantage of low detection accuracy for small-sized objects. In the third YOLO version, YOLOv3 [22], the basic network is Darknet-53, which draws on the residual structure of ResNet [9], deepens the network structure, and prevents the problem of difficult network convergence caused by network gradient explosion. The prediction frame of YOLOv3 is more than 10 times bigger than that of YOLOv2. In addition, the detection is carried out on different scales, so the overall detection accuracy of the model, along with its detection accuracy of small objects, has been greatly improved. Therefore, YOLOv3 has become a milestone in the area of one-stage detection. YOLOv4 [23] chose CSPDarknet-53 as its backbone network. In addition, its overall structure is similar to that of YOLOv3, but with improved substructures. At the same time, YOLOv4 got rid of the final pooling layer, fully connected layer, and Softmax layer, which allowed it to improve the detection speed while also maintaining good detection accuracy. The basic structure of the YOLOv5 model is similar to that of YOLOv4. The biggest difference is that it is scaled according to different channels. So far 5 models of YOLOv5 have been produced, namely YOLOv5-n/s/m/l/x ranging from small to large models. The YOLOv5 network architecture has the advantages of high detection accuracy and fast operation, whereby the maximum detection speed can reach 140 frames per second (fps). On the other hand, the weight file of its network model is around 90% smaller than that of YOLOv4, which makes the YOLOv5 model more suitable for deployment on medical embedded devices for real-time detection of lung nodules. Since the detection accuracy, real-time performance and lightweight of the model are directly related to the accuracy and efficiency of medical equipment for detecting pulmonary nodules in medical images, this study is focused on improving the YOLOv5s model for the purposes of detecting pulmonary nodules in medical images. As a result, a new YOLOv5-CASP model is proposed, with superior detection performance compared to the original YOLOv5s and five other state-of-the-art models. The main contributions of this paper are the following: 76372 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

1) An incorporation of improved Convolutional Block
Attention Modules (CBAM) into YOLOv5s is proposed as to suppress the interference features of the medical images from both the channel dimension and spatial dimension, and improve the detection performance of the model; 2) A replacement of the Spatial Pyramid Pooling -Fast (SPPF) module of YOLOv5s with an improved Atrous Spatial Pyramid Pooling (ASPP) module is proposed as to increase the model's receptive field for lung nodule images of different sizes and extract multi-scale contextual information for enhancing the detection performance on small lung nodules; 3) An optimization of the CSP Darknet53 (C3) module, used at the end of the backbone and head of the original YOLOv5s model, by replacing its 3 × 3 convolution with a Contextual Transformer (CoT) module [26], resulting in a new module, named CoT3, as to make the model more efficient in obtaining more detailed information; 4) Based on the above YOLOv5s improvements, a novel YOLOv5-CASP model is elaborated and proposed for detecting lung nodules in medical images, which outperforms the original YOLOv5s model and five other state-of-the-art models in this regard, based on the results of experiments, carried out on the open datasets LUNA16 [27] and X-Nodule, as demonstrated further in the paper.
The rest of the paper is structured as follows. Section II introduces relevant background information, including attention mechanisms, dilated convolution, and transformers. Section III presents the main representatives of the two-stage and one-stage object detection models. Section IV explains the proposed YOLOv5-CASP model. Section V describes the conducted experiments and discusses the obtained results. Finally, section VI concludes the paper.

A. ATTENTION MECHANISMS
Attention mechanisms refer to the way people's vision and nerves process information. People first generally determine the local areas they need to focus on by observing the panorama of the entire picture to obtain more detailed information about the object. Attention mechanisms were first applied to natural language processing (NLP) tasks. By introducing long-distance context information, they solved the phenomenon of information forgetting in long sequences. In vision tasks, attention mechanisms are also used to establish spatial long-distance dependencies to solve the problem of limited receptive fields of convolutional kernels, [29], [30]. At present, the Squeeze-and-Exchange (SE) attention [31], coordinate attention (CA) [32], and CBAM attention [33] are widely used. SE only considers the internal channel information and ignores the importance of position information. CA needs to perform weighted fusion on the information of each location, so the computation time and resource consumption of the model are too big. CBAM emphasizes the learning of important features in channel and spatial directions; therefore, the weight of important features is greater and can be transmitted to deeper layers for accurate identification of pulmonary nodules. This paper improves on the original convolutional attention and optimizes the original Rectified Linear Unit (ReLU) activation function, which allows to achieve better accuracy in the detection of lung nodules.

B. DILATED CONVOLUTION
In the traditional CNN model, down-sampling is used to expand the receptive field. Frequent down-sampling, however, may lead to losing some location information and decreasing the image resolution. That is why it is difficult for the network to accurately obtain the location information of objects, which hampers locating both big and small objects. Therefore, in the DeepLab series of models, atrous convolution, also known as dilated convolution, was proposed to solve the above problem, [34]. Compared with standard convolution, a new parameter of the dilation rate was introduced for defining the distance between the values when the convolution kernel processes data. When the convolution kernel's parameters remain unchanged, the receptive field can be expanded by increasing the value of the dilation rate. Atrous convolution can expand the receptive field without decreasing the image resolution. However, a higher dilation rate of the atrous convolution is not better. When the dilation rate is higher, the sampling of the atrous convolution becomes sparse. If there is no correlation between the information obtained from the long-distance convolution, the information that can be given by the spatial continuity, such as the marginal information, will be lost. In TridentNet [35], the authors studied the relationship between the receptive field and object size, and found that a big receptive field could help the detection of a big object, whereas a small receptive field could help the detection of a small object. Therefore, the deeper the network and more times of down-sampling, the better detection of small objects. Because the feature of high resolution should be used for the detection of small objects, this paper expands the receptive field by introducing atrous convolution and further improves the detection effect of big objects without reducing the detection effect of small objects by selecting a suitable dilation rate.

C. TRANSFORMERS
In 2017, Google used the Self-Attention model [36] to replace the Recurrent Neural Network (RNN) model [37], which was the most widely used in NLP tasks until then, causing a huge shock in the NLP field. RNN contains a recurrent layer, whereby the output at the next moment comes from the input at the previous multiple moments and its current state; that is, the network memorizes the previous information and acts VOLUME 11, 2023  on the output, so it can store the correlation between features ( Figure 1).
However, RNN can only perform sequential calculations, which brings two problems: 1) The calculation at the current moment depends on the calculation result at the previous moment, which limits the parallelism ability of the model. 2) During the calculation process, information with too long intervals will be lost, and the long-term dependency of the context cannot be established. Transformer is a neutral network model based on the Self-Attention mechanism, which is used to calculate the attention distribution between elements in the input sequence. Transformer effectively solves the two problems above as follows: 1) Parallelization between modules improves model training efficiency and conforms to modern distributed Graphics Processing Unit (GPU) frameworks. 2) By using the self-attention mechanism, the distance between any two positions of the given data is established to retain long-distance information. The Transformer model consists of an encoder and a decoder, which both consist of a stacked self-attention layer and a fully connected layer. It is a model structure that avoids circulation. As shown in Figure 2, after passing through six layers of encoders to calculate attention, the data is output to the decoder of each layer. By extracting the relationship between different regions through multi-head attention, the Transformer effectively enhances the network ability to extract global features, so that high-level semantic information can better form the extracted local information as a whole and maintain high extraction speed. The Contextual Transformer (CoT) module is a neural network module that combines the advantages of CNN and Transformer. It is often used in various computer vision tasks, such as object detection, semantic segmentation, image classification, etc. Therefore, we optimize the C3 module, used at the end of the backbone and head of the original YOLOv5s model, by replacing its 3 × 3 convolution with a CoT module [26], resulting in a new module named CoT3. The structure of the CoT and CoT3 modules is described in detail in Section IV.

III. RELATED WORK
The task of object detection is to determine whether an image contains an object and, if so, to show its position in the input image, usually represented by a rectangle. The current mainstream object detection models can be divided into two categories: 1) Two-stage object detection models, such as Fast R-CNN [15], Faster R-CNN [16], etc., which first extract the candidate regions, then classify the candidate regions, and finally accurately localize these ( Figure 3). 2) One-stage object detection models, such as Single Shot MultiBox Detector (SSD) [38] and YOLO [20], which do not need to extract the candidate regions, but rather directly generate the class probability and position coordinate value of objects. So, only one stage is required to obtain the final detection result, which increases the detection speed. The following subsections briefly introduce the main representative models of these two categories.

A. TWO-STAGE OBJECT DETECTION MODELS
The origin of the two-stage object detection is coming from the R-CNN model, proposed by Girshick et al. in 2014 [11]. It innovatively uses Selective Search [39] to replace the sliding window, which solves the problem of window redundancy and reduces the time complexity of the model. At the same time, a CNN is used to replace the traditional manual 76374 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. feature extraction part, as to more effectively extract the features of the image and improve the network's external antiinterference ability.
The performance of the R-CNN model greatly surpasses the previous best traditional object detection models, but there are also some problems: 1) Generating about 2000 candidate boxes using the Selective Search method is time-consuming. 2) R-CNN performs scaling operation on the input image, which may destroy some of the information contained in the image and reduce the detection accuracy. 3) Duplicate regions existing between candidate boxes increase computational complexity.
In view of the above problems, the following optimized models were subsequently elaborated.

1) SPP-NET
The SPP-Net model [14] only needs to run once the convolution layer from the entire image to obtain the feature map, which greatly reduces the time spent for feature extraction. In addition, SPP-Net adds Spatial Pyramid Pooling (SPP) to the last convolution layer of the network, producing a fixed-length feature vector as an input to the first fully connected layer. In addition, SPP-Net does not require fixed-size image input, which reduces image information loss caused by image cropping or stretching, and avoids repeated calculation of convolution features, thus effectively improving the precision and speed of image processing.

2) FAST R-CNN
Based on the SPP-Net idea, in 2015 Girshick et al. proposed the Fast R-CNN model, whose network uses a simplified SPP layer, called a RoI pooling layer ( Figure 4). In addition, Fast R-CNN uses a Softmax function to replace the SVM, used in R-CNN, and introduces a SVD decomposition to combine classification with regression. As a result, Fast R-CNN shows not only better accuracy and test speed performance than SPP-Net, but brings a substantial improvement in the training speed, which is 8 times faster than that of R-CNN under the same conditions. Such large performance improvement is mainly due to the use of a multi-task loss, a co-training classification, and a BBoxesregression. However, since the network still uses the Selective Search method for the generation of candidate regions, it still has certain shortcomings, such as low detection speed and lack of ability for use in realtime detection.

3) FASTER R-CNN
In 2015, Ren et al. proposed the Faster R-CNN model [16] that integrates the feature extraction, candidate region selection, object box fine tuning, and classification, which greatly improves the comprehensive detection performance and truly achieves end-to-end detection. Faster R-CNN consists of two components -a Region Proposal Network (RPN), used instead of Selective Search, and Fast R-CNN. The network that produces the suggestion boxes and the object detection network share the convolution features. First, the images (of any size!) are inputted into the CNN layer for feature extraction through the convolution layer, and RPN is used to generate high-quality suggestion boxes, which are then mapped to CNN ( Figure 5).
In the last layer of the convolution feature map, the ROI pooling layer is used to fix the size of each suggestion box, and finally the classification layer and the border regression layer are used to perform specific category judgment and accurate border regression on the suggestion area.
Compared with Fast R-CNN, Faster R-CNN shows better performance and faster speed on generating suggestion boxes, and in addition, its overall detection efficiency is higher. Therefore, it is used as a representative model of R-CNN series for performance comparison with the proposed YOLOv5-CASP model, as shown further in this paper.

B. ONE-STAGE OBJECT DETECTION MODELS
The YOLO model [20], proposed by Redmon et al. in 2016, used the idea of regression to greatly speed up the object detection and reduce the background false detection rate. However, its positioning accuracy and recall rate are still not ideal, and the detection accuracy of small objects is not high. By combining YOLO with Faster R-CNN, Liu et al. [38] proposed the SSD model for solving the problem of positioning accuracy. In 2019, Tan et al. [40] proposed the Effi-cientDet model to quickly perform feature fusion through VOLUME 11, 2023 76375 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  a Bi-Directional Feature Pyramid Network (BiFPN). This model introduces learnable weights to learn the importance of different input features, while repeatedly applying a selftop-down and self-bottom-up multi-scale feature fusion. The following subsections briefly present the above-mentioned one-stage object detection models.

1) YOLO
Aiming at the common problem of poor real-time performance in the two-stage neural network, in 2016, Redmon et al. proposed the first one-stage network, known now as YOLOv1 [20], i.e., the first version of YOLO. By treating the object detection task as a regression task, the one-stage network can get the BBoxes and predict class probabilities. First, after adjusting the image size, the image is inputted into a CNN to predict the coordinates of the bounding box (BBox), the category and confidence of the object in the box. Finally, the Non-Maximum Suppression (NMS) model is used to remove overlapping boxes, obtain the final prediction box, and achieve the detection goal. Since YOLO does not have a candidate region generation stage, its detection speed is very high, i.e., reaching 45 fps, thus greatly outperforming the two-stage object detection models on this criterion.
After several years of development, YOLO has undergone several versions. Among them, the fifth version, i.e., YOLOv5 [24], is the optimal model that takes into account both accuracy and speed, and as such is widely used in industrial domains. The network architecture of its 's' subversion, i.e., YOLOv5s, is shown in Figure 6. The backbone of YOLOv5s adopts both Focus and CSPDarknet53 structures [41]. The most important part of the Focus structure is the slicing operation, which splits high-resolution features into multiple low-resolution feature maps. The function of CSPDarknet53 is to perform feature extraction on images, which also includes Convolutional Block Linear (CBL) and SPP operations. CBL consists of three parts, namely convolution, Batch Normalization (BN), and activation function (Leaky ReLU). SPP is the pooling layer of the spatial pyramid, where the maximum pooling method of three different sizes is used to greatly increase the receptive field. In addition, a Cross-Stage Partial connection (CSP) structure is utilized by the YOLOv5s' backbone to roughly integrate the changes of gradients into the feature map from the beginning to the end, thus reducing the number of network parameters and computational amount, which not only ensures precision and accuracy, but also reduces the size of the model. In addition, CSP is also used in the YOLOv5s' neck to strengthen the ability of network feature fusion. The head obtains the probability of the object class and the final position of the border by calculating the loss according to the signature anchor box in the grid. The Generalized Intersection over Union (GIoU) loss is used as a loss function for the BBoxesas it introduces a minimum circumscribed rectangle between the prediction box and the ground-truth box, which allows to solve the problem that the distance between them cannot be predicted when they do not intersect. The smaller the value  of the GIoU loss, the better the model, meaning that the smaller the gap between the prediction box and the groundtruth box, the better the object detection performance, [42].

2) SSD
The SSD model is a one-stage object detection model based on CNN. The VGG16 network is used as a backbone, the fully connected layer at the end of the VGG16 network is converted into a convolutional layer, and an additional convolutional layer is added on this basis to obtain more feature maps. The VGG16 network and feature maps in the newly added convolutional layer with a different resolution are used for independent prediction (Figure 7). First, a normalized processing is performed on the original input image, which is scaled to a fixed size to serve as an input to the model. Second, the features of the input image are extracted with the SSD network, 6 feature layers with different sizes are obtained, and each feature layer focuses on the extraction of the feature information of the object with a specific size. Third, the prior BBoxeson feature maps with different sizes are combined, and redundant boxes are deleted with Non-Maximum Suppression (NMS) to obtain the final locating BBox, [43].

3) EfficientDet
In the field of object detection, the detection precision of a model is as important as its detection speed. Although the detection speed of a model based on regression is relatively high, its precision is generally low. On this basis, in 2019, Tan et al. proposed EfficientDet [40], an object detection model with high precision and detection speed. EfficientDet uses EfficientNet [44] as its backbone network, and BiFPN and a weighted bidirectional feature pyramid network as its feature network (Figure 8). P3-P7 features from the backbone network are accepted and the top-down and bottom-up bidirectional feature fusion is repeatedly applied. Then, the fused features are inputted into the category and BBoxprediction network, and prediction results of the object category and BBoxare output, respectively. A hybrid scaling method is used, whereby the resolution, depth, and width of all backbone networks and feature networks and prediction networks of category and BBoxcan be uniformly scaled at the same time, resulting in a model scale, which is only 1/4 of that of the previous optimal model, i.e., YOLOv3 [22], and a detection speed, which is 3.2 times higher than that of the previous optimal model, [40].

IV. PROPOSED MODEL
Due to the complex structure, small shape and different pathological characteristics of pulmonary nodules in medical images, it is difficult for the original YOLOv5s model to accurately identify these. To this end, several YOLOv5s improvements are proposed here, resulting in a novel model, called YOLOv5-CASP. Firstly, multiple CBAM attention modules are added to the backbone network and head network. Secondly, the Atrous Spatial Pyramid Pooling (ASPP) module [25] is used to replace the original SPPF module in order to increase the receptive field of the convolution output and strengthen the model in extracting deep features. Thirdly, a Contextual Transformer (CoT) module [26] is utilized, which uses the input context information and guides the learning of the dynamic attention matrix, thereby improving the detection performance of the 76378 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
model. The structure of the proposed YOLOv5-CASP model is shown in Figure 9.

A. CBAM
In CBAM, the learning of important features in the channel direction and space direction is emphasized. Therefore, the weight of important features is higher and can be conveyed to a deeper level for precise breakdown. As shown in Figure 10, CBAM is composed of two modules -a channel attention module (CAM) and a spatial attention module (SAM). Among them, the middle input feature map F ∈ R C×1×1 of the channel attention module performs average pooling and maximum pooling at the same time.
Then, the collected features are transferred to a shared multi-layer perceptron with a hidden layer. The elements of the results features are added and activated with the Sigmoid activation function to generate the channel attention map. The computational formula is as follows: where F denotes the input feature map, M c denotes the channel compression weight matrix, σ denotes the Sigmoid activation function, MLP represents the shared multi-layer perceptron, AvgPool represents the mean-pooling operation, and MaxPool represents the max-pooling operation. The improved channel attention F ′ is obtained by assigning the channel weight learned by M c to different channels of F: where F ′ is the feature map selected through the channel attention and ⊗ is the multiplication of elements. In SAM, mean-pooling and max-pooling are performed for F ′ along the channel, and a 7×7 convolution operation is performed on collected features. And then, the activation operation of σ is performed. The computational formula of the spatial attention map M s ∈ R 1×H ×W is as follows: where F ′ is the feature map selected by SAM and M s is the spatial compression weight matrix. Finally, the spatial information M s is multiplied by F ′ to obtain the feature F ′′ processed by the CBAM module: The global max-pooling and global mean-pooling used in CAM complement each other, so the compressed information can be effectively extracted. In SAM, a 7 × 7 convolution is used rather than the traditional multiple 3 × 3 convolution, because the former can effectively expand the receptive field and better obtain spatial information.
In the model proposed in this paper, the original ReLU activation function of the CBAM module is replaced with a Leaky ReLU activation function in order to solve the following problem: when the input value of the ReLU activation function is zero or negative, the gradient changes to zero and the network cannot execute counter-propagation and learning. By using such an improved CBAM module, the network can detect objects more accurately and faster. In addition, this attention mechanism can solve the problems of complicated background and small objects in the pulmonary nodule image, which can help improve the model's detection performance.

B. ATROUS SPATIAL PYRAMID POOLING (ASPP)
In the original YOLOv5s model, the size of the image becomes smaller when it enters the pooling layer through conventional convolution and some data are lost when the image is restored through up-sampling. For images with low resolution, the negative effect of losing internal information on image recognition is particularly considerable. To solve this problem, an SPPF module is used at the end of the YOLOv5s' head. SPPF is an improved version of the Spatial Pyramid Pooling (SPP) module, which first halves the input channel through a standard convolution module, and then performs max-pooling with pooling kernels of size 5 × 5, 9 × 9, and 13 × 13, respectively, with a self-adaptive padding for different kernel sizes. Then, the results of the three max-pooling operations are connected with the data that have not been pooled and, finally, these are merged. The number of channels after merging is twice the original number. Generally, the SPP module realizes the fusion of local and global features, and improves the expression ability of the feature map. The principle of the SPPF module is basically the same as that of SPP, but the design of the pooling kernel is different. SPP uses three pooling kernels by default, while SPPF uses two pooling kernels in YOLOv5s by default, respectively of sizes 5 × 5 and 1 × 1. Thus, SPPF performs faster in feature extraction. Inspired by SPP, an Atrous Spatial Pyramid Pooling (ASPP) module [25] was proposed for the semantic segmentation model DeepLab [34]. This module uses multiple parallel atrous convolution layers with different sampling rates. The features extracted with each sampling rate are further processed in separate branches and fused to generate the final results. ASPP constructs convolution kernels of different receptive fields through different atrous convolution rates, which are used to obtain object information of multiple scales. Atrous convolution can expand the  receptive field without increasing the calculated amount and decreasing the resolution of the feature map, and more image features can be obtained. Based on the atrous convolution, ASPP can obtain multiple fields with different sizes and classify different objects through parallel sampling performed by five groups of atrous convolution modules with different dilation rates (Figure 11). A suitable dilation rate combination can be selected for different sizes of target objects. ASPP obtains the long-span contextual semantic information through the atrous convolution with different dilation rates, which can not only improve the accuracy of the detection of pulmonary nodules in medical images but also maintain a high detection speed.

C. CONTEXTUAL TRANSFORMER (CoT)
Transformer with self-attention has caused a revolution in the NLP area. In recent years, a lot of transformer-type architecture designs emerged, demonstrating competitive results in multiple computer vision tasks. However, in most existing designs, self-attention is directly used on the 2D feature map to obtain the attention matrix of independent queries and key pairs, based on each spatial location, whereby the rich context of adjacent keys is not fully utilized. On this basis, Li et al. designed a CoT module [26], with a novel transformer style for visual recognition. In the CoT design, the contextual information between the input keys is fully used to guide the learning of the dynamic attention matrix, thus improving the visual representation ability. CoT has absorbed the advantages of the traditional CNN and Transformer, which allows it to effectively enhance the feature representation of input information. CNN can capture the raw information of input features, while Transformer can obtain the global information of input features. The structure of CoT module is shown in Figure 12,where H , W , and C represent the height, width, and number of channels of the input data X , respectively; Q, K , and V , represent queriers, keys, and values; D represents the changed value of the channel; C h represents the number of headers; θ and δ represent an 1 × 1 convolutional operation; ⊗ represents matrix multiplication; W v represents the embeddings matrix.
First, for the input feature X , three variables are defined, namely Q = X , K = X and V = W v . Then, a k×k convolution is performed on K to obtain K with local context information and represent it with K 1 . After that, K 1 and Q are concatenated and two consecutive 1 × 1 convolutional operations are performed (W θ with a ReLU activation function and W δ without an activation function).
The matrix Q is obtained from the interactive operation of query information and local context information, so a stronger feature representation can be obtained. O and V are multiplied to obtain the dynamic context model K 2 , as follows: Finally, the local static context model K 1 and the global dynamic context model K 2 are fused to obtain the final output result.
In the proposed YOLOv5-CASP model, the 3 × 3 convolution in the original YOLOv5's C3 module is replaced with a CoT module, as shown in Figure 13, resulting in a new module named CoT3.
In general, the CoT module first contextually codes the input keys through convolution, resulting in a static contextual representation of the input. The coding keys are then further connected with the input query, and the dynamic multi-head attention matrix is learned through two consecutive 1 × 1 convolutions. The learned attention matrix is 76380 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  multiplied by the input value to realize the dynamic context representation of the input, and the fusion of the static and dynamic context representations is used as the final output.
In the detection of pulmonary nodules in medical images, detecting small nodules has always been one of the main difficulties. YOLOv5 uses a backbone network (CSPDarknet53) to extract feature representations. This network undergoes multiple pooling and downsampling operations, which can lead to decreasing the resolution of the feature map. For small-size or low-resolution nodules, the low resolution of the feature map may result in loss or blurring of target information, making it difficult for detectors to accurately locate and identify these nodules. The CoT module enhances contextual information to capture long-range dependencies, which is very beneficial for small nodule detection because it enables capturing more global context and better understand the relationship between nodules and their surrounding environment. In the proposed YOLOv5-CASP model, the last C3 module at the end of the backbone and head of the original YOLOv5s model is replaced with a newly designed CoT3 module, which uses the CoT module instead of the 3 × 3 convolution originally used in this C3 module. In addition, CoT3 has fewer parameters and demonstrates higher performance relative to C3, which enables the network to detect pulmonary nodules in medical images faster and more accurately.

A. DATASETS
In the conducted experiments, two public datasets were utilized, namely the Lung Nodule Analysis 16 (LUNA16) dataset [27] and the X-Nodule dataset [28], both widely used for lung nodule detection.
The LUNA16 dataset is a subset of the LIDC-IDRI dataset [45], which contains 888 cases of low-dose chest Computed Tomography (CT) thin-slice plain scan images, divided into 10 subsets, containing multiple axial slices of the thoracic cavity. In the 888 CT images, a total of 36,378 nodules are marked, with a diameter ranging from 3 to 30 mm, with an average diameter of 8.31 mm. In LUNA16, only nodules with diameter greater than 3 mm are used, and all other nodules and non-nodules are excluded. The CT images containing 1186 nodules, marked by at least 3 experts, were used as the basic dataset in the conducted experiments.
The X-Nodule dataset is a public pulmonary nodule dataset available on the Roboflow website (https://universe.roboflow. com). The dataset contains a total of 2015 lung X-ray images, annotated by professional radiologists.
Sample images of the two datasets are shown in Figure 14. The splitting of datasets into training, validation, and test subsets for conducting the experiments is shown in Table 1.

B. EXPERIMENTAL METHODS
The K-fold cross-validation is a commonly used method for the evaluation of deep learning models as it can effectively assess the performance of a model and help understand its performance on different datasets. One of the advantages of the K-fold cross-validation is its ability to utilize data more fully, reducing the bias caused by uneven data partitioning. Additionally, it enables to evaluate the model's generalization ability, i.e., its adaptability to unseen data. Through multiple experiments and evaluations on different test sets, one can gain a comprehensive understanding of the model's performance and accurately assess its performance in real-world applications.
In the experiments, we used a 5-fold cross-validation, whereby we divided the two lung nodule datasets into 5 nonoverlapping subsets. Then, we conducted 5 experiments with each compared model, whereby in each experiment we used one subset as a validation set, another subset as a test set, and the remaining 3 subsets as a training subset. This way, each subset took turns being the validation set and test set, while the training set covered the rest of the dataset.
In each experiment, we trained each model using the training set and evaluated its performance and adjusted its hyperparameter values using the validation set. After     completing the model development, we applied it to the test set for obtaining the performance metrics values. Finally, we averaged the results of the 5 experiments to obtain the final result as a comprehensive assessment of the model's performance.

C. EXPERIMENTAL ENVIRONMENT
During the model training process, the initial learning rate was set to 0.01, and then it was gradually decreased using a cosine annealing strategy. The number of epochs was set to 500, and the batch size was set to 32. Experiments were performed on an Intel(R) Core (TM) i5-12400@2.50 GHz PC using Windows 11 as an operating system, 12GB of video memory, CUDA11.3 for model training acceleration, PyTorch1.12.1 as a deep learning framework for training, and input image size of 640 × 640 pixels, as shown in Table 2.

D. EVALUATION METRICS
In the conducted experiments, the proposed YOLOv5-CASP model is compared with six state-of-the-art models, namely Faster R-CNN, SSD, YOLOv4-Tiny, YOLOv5s, DETR-R50, and Deformable DETR-R50, based on precision and recall. Precision is used to indicate the proportion of true positive (TP) samples among the predicted results, while recall is used    to indicate the proportion of correct predictions among all positive samples, as follows:   in the conducted experiments, which are defined as follows: where N represents the total number of classes and AP i represents the average precision of class i, obtained as the area under the precision-recall curve: where p and r represent the precision rate and recall rate, respectively. Tables 3-9 show the mAP and F1 score values achieved by each model in the five experiments performed on the LUNA16 dataset. The use of the CBAM attention mechanism enabled improving the mAP by 0.07 and 0.02 points, and F1 score by 0.04 and 0.03 points, respectively, using the LUNA16 and X-Nodule dataset. While the individual use of other proposed modules did not result in any further performance improvement compared to the individual use of the CBAM module alone, the values of both evaluation metrics were improved in each individual case compared to the baseline model. For instance, replacing the SPPF module with the ASPP module allowed to improve the mAP by 0.04 and 0.02 points, and F1 score by 0.02 and 0.02 points, respectively, using the LUNA16 and X-Nodule dataset. And applying the CoT3 module enabled increasing the mAP by 0.04 and 0.01 points, and F1 score by 0.01 and 0.01 points, compared to the baseline, using the LUNA16 and X-Nodule dataset, respectively. However, the combined use of all three improvements allowed to improve the model performance even further, compared to the individual use of the CBAM module, by 0.02 and 0.03 points for mAP, and by 0.02 and 0.03 points for F1 score, using the LUNA16 and X-Nodule dataset, respectively. Figures 15 and 16 demonstrate the superior ability of the proposed model (YOLOv5-CASP), compared to the baseline model (YOLOv5s), in detecting pulmonary nodules in medical images of the two datasets used.

G. COMPUTATIONAL COMPLEXITY STUDY
Compared to using standard CNN convolution modules, the use of CBAM modules in YOLOv5s increases the model computational complexity. To study this issue, we conducted two additional experiments on the LUNA16 dataset, which was randomly divided into three subsets used respectively for model training, validation, and testing. In the first experiment, we added several 1 × 1 standard convolution modules to the key information extraction parts in the YOLOv5s' backbone and neck, using SiLU as an activation function. In the second experiment, we replaced these standard convolution modules with CBAM modules, while still using SiLU as an activation function. Both experiments ran for 200 rounds with a batch size of 32. The obtained results are shown in Table 21.
The presented results reveal that although the use of CBAM modules increases the number of model parameters, compared to the baseline model with or without using standard CNN modules, this increase is only 5.7% and 10.4%, respectively, which corresponds also to the increase in the GPU memory utilization. In terms of the model training time, the increase is greater -14.0% and 17.4%, respectively. However, this extra computational complexity due to the use of CBAM modules is acceptable, given the fact that the mAP was improved by 0.04 and 0.05 points, respectively.
A deeper conducted analysis revealed that it was mainly the spatial attention module (SAM) that increased the computational complexity, while the additional complexity due to the channel attention module (CAM) could be ignored. Possible strategies to alleviate this problem include reducing the size of the convolution kernel (especially in SAM), lowering the dimensionality of attention maps, and adjusting the module structure, aiming at achieving an optimal balance between detection performance and computational complexity. Considering that detection equipment for medical images may be embedded in precision devices with high resource utilization requirements, in the future, we plan to further investigate this aspect and explore more potential strategies for reducing the model computational complexity.

VI. CONCLUSION
The first step in the analysis of lung cancer screening results is lung nodule detection in medical images. In order to assist the early screening of lung cancer, a novel YOLOv5-CASP model has been proposed in this paper, based on the YOLOv5s model with the following improvements: (i) integrating improved CBAM modules into the YOLOv5s backbone and detection head in order to improve the detection of pulmonary nodules; (ii) substituting the original YOLOv5s SPPF module with an improved ASPP module as to increase the experience field in obtaining the multi-scale context information; and (iii) replacing the last C3 module at the end of the backbone and head of the original YOLOv5s model with a newly designed CoT3 module, which allows to improve the self-attention learning ability and achieve higher detection accuracy by exploring the contextual information between neighboring keys. Series of experiments conducted on two public datasets, LUNA16 and X-Nodule, confirmed the superiority of the proposed YOLOv5-CASP model over six state-of-the-art models (Faster R-CNN, SSD, YOLOv4-Tiny, YOLOv5s, DETR-R50, Deformable DETR-R50), in terms of the mean average precision (mAP) and F1 score.
When it comes to detecting lung nodules with complex or irregular shapes, the baseline model (YOLOv5s) has some limitations. This is because it adopts a grid-based approach and relies on BBoxes around the detected objects of interest. For objects with complex or irregular shapes, it may not be able to accurately capture detailed information about their precise shape. In addition, when lung nodules are partially or completely occluded by other nodules, YOLOv5s may face detection difficulties that could lead to missed detections or false positives. Therefore, in future research, we will focus on addressing this challenge by attempting to improve the performance of the proposed model in detecting complex objects through the introduction of multi-scale feature fusion mechanisms, by adding multiple feature layers in the network and fusing features from different levels to enhance the model's perception ability for objects at different scales. Then, we will try to apply more advanced shape modeling methods to capture shape information of complex or irregular objects more accurately. In addition, we will try to use geometry-based models or contour-based models combined with image segmentation techniques to better describe object contours and shapes. Finally, we will try to improve occlusion handling capabilities to increase detection accuracy for complex objects. In this regard, we will attempt using an occlusion-aware module to reduce missed detections or false positives caused by occlusions through learning and reasoning about occlusion relationships.
We plan to experimentally explore these improvements, based on YOLOv5s, and expect that these research efforts will allow to enhance further the model performance in detection tasks involving complex or irregularly shaped objects.
YUN WU was born in 1996. He received the B.S. degree from the Anhui Institute of Information Technology, in 2021. He is currently pursuing the master's degree with the North China University of Science and Technology. His research interests include machine vision and graphic image processing.
XINYI ZENG was born in 1999. She received the bachelor's degree from the Yancheng Institute of Technology, in 2021. She is currently pursuing the master's degree with the North China University of Science and Technology. Her research interests include machine vision and intelligent medical image processing.
YONGLI AN received the Ph.D. degree in information science from Beijing Jiaotong University, Beijing, China, in 2015. She is currently a Professor with the North China University of Science and Technology, China. Her current research interests include wireless network security, interference cancellation technology, and large-scale MIMO technology.
LI ZHAO received the B.S. and Ph.D. degrees from Tsinghua University, Beijing, China, in 1997 and 2002, respectively. He is currently an Associate Professor with the Research Institute of Information Technology, Tsinghua University. His current research interests include mobile computing, the Internet of Things (IoT), e-health systems, intelligent transportation systems (ITS), home networking, machine learning, and digital multimedia.
ZHIWU WANG received the Ph.D. degree from Tianjin Medical University, in 2014. He is currently a Chief Physician with the Department of Radiotherapy and Chemotherapy, Tangshan People's Hospital. He is engaged in the comprehensive medical treatment of lung cancer and digestive system tumors. His current research interests include medical data processing, screening tumor immunotherapy effect prediction markers based on artificial intelligence.
IVAN GANCHEV (Senior Member, IEEE) received the Engineering and Ph.D. degrees (summa cum laude) from the Saint Petersburg University of Telecommunications, in 1989 and 1995, respectively. He is currently an International Telecommunications Union (ITU-T) Invited Expert and an Institution of Engineering and Technology (IET) Invited Lecturer associated with the University of Limerick, Ireland, the University of Plovdiv ''Paisii Hilendarski,'' Bulgaria, and the Institute of Mathematics and Informatics-Bulgarian Academy of Sciences, Bulgaria. He was involved in more than 40 international and national research projects and has authored/coauthored one monograph, three textbooks, four edited books, and more than 300 research papers in refereed international journals, books, and conference proceedings. He has served on the TPC of more than 380 prestigious international conferences/symposia/workshops. He is on the editorial board and has served as the guest editor for multiple prestigious international journals.