Single-Shot Detection Based on Cyclic Attention

The shallow feature map of the single-shot detector (SSD) is not always conducive to enhancing the recognition precision for a small object because of the lack of contextual information. In this research, a single-shot detection algorithm based on cyclic attention (CA-SSD) is proposed to construct a fast and accurate detector that efficiently obtains full-image contextual information. Our network is constructed by integrating ResNet-34 and proposed novel cyclic attention blocks. This type of building block aggregates different transformations, one of which includes an attention module that uses a long but narrow pooling kernel to acquire horizontal and vertical contextual information for each pixel of all pixels. Each pixel eventually captures the full-image dependencies by following an even further cyclic operation. Our design considers the variability of the gradient, which not only improves the reliability of the cyclic attention block but also cuts the number of parameters for computation. Additionally, by exploring the effects of the stem block and its stride on the performance of ResNet-based SSD algorithms, our network retains more detailed information. For an input size of 300 $\times $ 300, CA-SSD attained 82.5% mAP on PASCAL VOC 2007 test, 78.4% mAP on PASCAL VOC 2012 test, and 32.7% mAP on MS COCO. Experimental results achieved with CA-SSD surpass the best results achieved with the traditional SSD and other advanced object detection algorithms while real-time speed is maintained.


I. INTRODUCTION
Object detection is a fundamental task in the field of computer vision and is critical in many recently developed applications, such as autonomous driving [1], [2], fault detection [3], [4], and medical decision-making [5], [6]. The goal is to find several targets in the complex background of the image and to estimate the position and category information of each object at the same time.
Object detection has been extensively and actively studied with continuous breakthroughs in technology. Most early detectors directly extract features of histogram of oriented gradients or deformable part model [7]- [9] and send them to a well-trained classifier for classification. Unfortunately, these hand-crafted features have weak generalization ability and struggle to meet the needs of the application. In recent years, benefiting from the comprehensive progress of deep convolutional neural networks (CNNs) [10]- [14] and graphical processing unit (GPU) computing capabilities and the accumulation of prior work on well-annotated The associate editor coordinating the review of this manuscript and approving it for publication was Marco Martalo . data sets [15], [16], hand-crafted features have gradually been replaced by complex image characteristics extracted by CNNs. The performance of the object detector has improved. Among object detectors, Faster R-CNN [17] and R-FCN [18] select the top feature map in predicting candidate region proposals of various aspect ratios and scales. However, because the receptive field in the top feature map is fixed, there is a conflict between objects of different scales and the fixed receptive field in the natural image, which limits the prediction of objects that are too small or too large and may decrease the performance of object detection. Meanwhile, an excessively deep backbone network has been used to extract features and improve the detection accuracy but this greatly reduced the detection efficiency [13], [18]. To accelerate detection, the one-stage detection framework called ''you look only once'' (YOLO) [19] and single-shot detector (SSD) [20], which abandons the phase of object proposal generation, are proposed. YOLO [19] is similar to Faster R-CNN [17] and uses a fully connected layer with a fixed receptive field to predict the detection results of multi-scale objects. Although YOLO has speed sufficient for real-time detection, its accuracy is unsatisfactory. The SSD [20] is a VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ better trade-off between accuracy and speed. It utilizes feature maps in various layers to predict objects at distinct scales. The SSD is one of the first attempts to implement convolutional pyramid feature representation for object detection.
In the feature pyramid of the SSD, the high-level feature maps of great receptive fields used to detect large targets provide stronger distinguishing ability for classification subtasks because of semantic information, and the low-level feature maps of small receptive fields have detailed information for regression subtasks, which is conducive to predicting small objects. However, less semantic information is contained in the shallow features, which may result in small objects being poorly detected. Therefore, studies [21]- [24] developed ways to further improve accuracy through ingenious design. The FPN [21] adopts a top-down structure in constructing new feature pyramids for prediction though fusing different level features. The DSSD [22] employs an hourglass structure to transfer contextual information and obtain better results. However, these improved features basically come from the very deep ResNet-101 [13] with heavy computational costs, which lowers the inference speed. Nevertheless, the RFB [23] and FSSD [24] did not have inferior detection performance on the VGG [12] when structures were carefully designed. The deep features of lightweight CNNs have been enhanced by designing modules that simulate the human visual system [23]. A lightweight feature fusion module [24] was proposed by Li et al. to fully utilize the features. Obviously, in addition to having a better feature network, adding more contextual information improves the performance of the detector.
On the basis of the discussion above, in building a detector that considers both speed and accuracy, a reasonable choice is to introduce an ingenious design that improves the feature representation capabilities and avoids using deeper models that slow the calculation. In this research, to enhance the expressive capability of shallow feature maps, a single-shot detection algorithm based on cyclic attention (CA-SSD) is proposed and a novel module, namely the cyclic attention block (CA block), is designed. Specifically, we replace the basic network of the SSD with ResNet-34 and residual blocks. By analyzing the network structure of VGGNet and ResNet, we redesign the stem block to reserve abundant detailed information for object detection and greatly improve the detection accuracy without affecting the detection speed. At the same time, the CA block that enhances the variability of the learned features within different layers while learning the full-image contextual information is used. The CA block has fewer parameters than the residual block, which contributes to the construction of a fast and accurate detector. In achieving the goal of collecting full-image contextual information, we simply perform repeated operations on the cyclic attention module (CAM) to capture the information of the remaining positions indirectly. Among them, the CAM uses a long but narrow pooling kernel to expand the receptive fields of network and maintains the portability of the module by sharing CAM parameters.
Adopting the proposed architecture, our CA-SSD achieves excellent performance with a slight reduction in speed compared with the conventional SSD. We report extensive experiments on the PASCAL VOC 2007 and PASCAL VOC 2012 datasets, demonstrating that our CA-SSD has greatly improved detection of small objects compared with the conventional SSD and outperforms many detectors based on ResNet-101, such as R-FCN [18] and Faster R-CNN [13]. In conclusion, our major contributions are as follows.
1) We investigated factors that influence the detection performance of the ResNet-based SSD algorithm. Using the newly designed structure of the stem block can retain much detailed information for small-object detection. 2) We design the CA block, a novel and lightweight module. The computational cost is kept under control by simply replacing some components of ResNet with CA blocks without training them from scratch. 3) Quantitative and qualitative experiments demonstrate that our CA-SSD is better than the conventional SSD algorithm. Especially for small objects, our CA-SSD has competitive performance on common datasets (PASCAL VOC and MS COCO), but at a slightly reduced speed.

II. RELATED WORK A. TWO-STAGE DETECTOR
The object detection method based on a region proposal can be called the two-stage algorithm, where the detection problem is divided into two stages. A two-stage detector called the region-based CNN (R-CNN) was first introduced by Girshick et al. [26] through integrating AlexNet [11] and a selective search [27]. The accuracy of this approach is much superior to that of the traditional approach, and the technique opens a new era of object detection in deep learning. Nevertheless, its low detection speed remains an urgent problem to be solved because R-CNN [26] needs to perform redundant feature calculations on a large number of overlapping proposals. He et al. subsequently suggested SPPNet [28] by incorporating conventional spatial pyramid pooling (SPP) in the R-CNN [26] architecture, so that the classification module can reuse the convolutional network features regardless of the resolution of the input image and improve the detection speed. Fast R-CNN [29] is a further improvement of SPPNet [28], allowing us to train classification and regression tasks at the same time, but it still relies on the proposals of external regions. Soon after, Faster R-CNN [17] emerged as the first detector that allows end-to-end training. Through the implementation of an effective and precise region proposal network (RPN) to produce region proposals, various parts of object detection are integrated into a unified deep learning framework. In addition, effective extension methods have been proposed to further enhance the prediction performance. ROI Align proposed by Mask R-CNN [30] solves the problem of mis-alignment between the feature map and the original image caused by ROI pooling, and improves the effect of object detection and instance segmentation. Libra R-CNN [31] integrates the features of different levels more effectively through the balanced feature pyramid structure and uses the improved balanced L1 loss to make the detector achieve optimal convergence. D2Det [32] introduced a novel dense local regression strategy to achieve precise localization of objects.

B. ONE-STAGE DETECTOR
Different from the two-stage algorithm with higher precision but lower detection speed, the one-stage algorithm based on regression generates bounding boxes without proposals for external region and predicts the possibility of the object categories at the same time, which has an advantage in terms of speed. The YOLO [19], [33] and SSD [20] are the most representative one-stage detectors. YOLOv1 [19] divides an image into several grids to predict the bounding box at points near the center of the object instead of using anchor boxes. However, the recall of YOLOv1 is low, and YOLOv2 [33] thus employs anchor boxes and combines low-resolution and high-resolution feature maps via a pass through layer to achieve recognition of different scales. The SSD [20] uses multiple feature maps of different sizes and makes each layer concentrate on detecting targets of a specific size. It uses two 3 × 3 convolutional layers to calculates the category confidence and position offset of the default bounding boxes and then adopts non-maximum suppression (NMS) to filter the redundant boxes in the final prediction results. The DSSD [22] adopts deconvolution to aggregate contextual information and enhances the high-level semantics of shallow features. RetinaNet [34] proposes a focal loss through the reconstruction of standard cross-entropy loss, resulting in the detector paying more attention to samples that are hard to classify during the training process. The original lightweight backbones were updated with a deeper ResNet-101 [13], and the performance was comparable to or better than that of the two-stage method. However, the non-negligible cost of the increase in model complexity is the consumption of their advantages in speed.
Recently, more and more scholars have paid attention to improving the performance of the one-stage detector because of its convenience and efficiency [35]- [37]. Many works have proved that it is possible to balance accuracy and speed through some ingenious designs, such as STDN [38], RFB [23], and EfficientDet [37]. In addition, [39], [40] have confirmed the importance of contextual information in visual tasks such as object detection. Here, we propose to improve the detection performance by optimizing the network structure and capturing contextual information.

III. METHOD
To improve the performance of conventional SSD algorithms, a single-shot detection algorithm based on CA is proposed. We first introduce the backbone network in this chapter, which is part of our feature extraction network. In using ResNet-34 as the backbone network to obtain information, we change the elements in ResNet-34 that affect the performance of object detection. We then describe a new CA block comprising CA modules that extracts features that contain rich contextual information and are more favorable for identifying and positioning small objects. We finally describe the loss function and the network training details.

A. BACKBONE NETWORK
We use ResNet-34 to replace the VGG [12] in the original SSD paper [20] as our backbone network and explore the factors affecting the detection performance. The purpose is to overcome shortcomings in object detection while retaining powerful classification capabilities. Inspired by the DSOD [41], we replace the input 7 × 7 convolutional layer with a stack of three 3 × 3 convolutional layers. There are 64 output channels for all three convolutions, and the stride is 1. Next is the max pooling layer, and its kernel size is 2 × 2 and stride is 2. Different from the convolutional layer in the stem block of the DSOD, we have removed the downsampling operation of the first convolutional layer, considering the downsampling operation that already exists in the first convolutional layer is likely to have a greater influence on the detection accuracy, especially the accuracy of detection of small targets; see Table 2. Compared with the original ResNet design, a convolutional layer with a kernel size of 7 × 7 and a stride of 2 followed by a max pooling layer with a kernel size of 3 × 3 and a stride of 2, this simple substitution not only reduces parameter calculations but also reduces the information loss of the original input image. Fig. 1 describes the architectural details of the CA-SSD model. To select feature maps with the same fixed space size as SSD instead of using a larger feature map for detection, we apply the output of the last two CA blocks as two of the feature maps with which to predict the location and confidences, and add three additional residual blocks at the end of ResNet to extract 10 × 10, 5 × 5, and 3 × 3 feature maps. The residual block comprises two branches, one is a 3 × 3 convolutional layer with a stride of 2 followed by a 3 × 3 convolutional layer with a stride of 1 and the other is a 1 × 1 convolutional layer with a stride of 2. Finally, the last feature map is extracted through a 1 × 1 convolutional layer and a 3 × 3 convolutional layer with a stride of 1. Among these convolutional layers, the number of output channels is 128.
Experimental results show that removing the downsampling operation of the stem block improves the performance of object detection (see Table 2).

B. CA BLOCK
Previous work has demonstrated that contextual information is crucial in object detection. We design a CA block to effectively capture the global contextual information. In this part, the specific component of the CA block improved on the basis of the residual block is first introduced. We then introduce the CA module that acquires horizontal and vertical contextual information. We finally introduce how to adopt a VOLUME 9, 2021 FIGURE 1. The architecture of the proposed CA-SSD. We replaced the SSD backbone network VGG-16 with ResNet-34. The yellow module is the stem block that we modified, removing the downsampling of the first convolutional layer. The green module is the CA block that we designed. The specific structure is shown in Fig. 2.
cyclic operation in a CA block so that each pixel captures the full-image dependencies. The overall structure of the CA block is shown in Fig. 2. First, we split the input feature map I ∈ R C×H ×W into two parts x 1 and x 2 through two 1 × 1 convolutional layers, where C is the number of channels of the input feature map and acquired by x 2 after a 3 × 3 convolutional layer. We input Q into the CAM to obtain a new attention map Q ∈ RĈ ×H ×W . The number of channels of Q and Q is C 2, and we will equate it toĈ in the following. Each pixel of the Q aggregates the contextual information on the crossing path where it is located. However, the attention map Q only acquires sparse information. Therefore, Q is input into the CAM again, so that the output feature Q can obtain richer contextual information. It means that each position on the feature map Q aggregates global contextual information. At this time, the output obtained by x 1 after a 3 × 3 convolutional layer is concatenated with Q . This allows us to collect both contextual information and short-range information. Then, the concatenated feature passes through a convolution operation with a kernel size of 3 × 3. Finally, the final feature map is output through the shortcut connection. We named this structure a CA block.
To make calculations and memory more portable and use local features to establish full-image dependencies, inspired by the scene parsing method [42], we introduce a CA module that acquires horizontal and vertical contextual information. As shown in Fig. 3, we suppose the input tensor of the CAM is Q. After the input feature map passes through a horizontal or vertical pooling layer and the average value of the elements in the pooling kernel is used as the output value of the pooling layer, two feature maps q h and q w are generated, where q h ∈ RĈ ×H and q w ∈ RĈ ×W . The module then applies a convolutional layer with a kernel size of 3 × 1 and 1 × 3 on the two output feature maps q h and q w respectively and expands the two output feature maps in horizontal and vertical directions. After the expansion, the two feature maps are of the same size, and the element-wise sum of the two feature maps is q ∈ RĈ ×H ×W : where T (·) represents expanding the feature maps to the same size and φ(·) represents a convolution operation with a kernel size of 3 × 1 or 1 × 3. The output of the CAM is then described as where θ(·, ·) is element-wise multiplication, σ (·) represents the sigmoid function, and g(·) denotes 1 × 1 convolution. However, a single CA module only acquires contextual information on the crossing path where each pixel is located and there is no connection with the remaining locations. Some key details are thus lost. We therefore loop the CA module twice to obtain a new feature map with rich and useful contextual information. As shown in Fig. 4, it is known that any point t on Q is associated with the feature points on Q corresponding to its row and column. The feature point on Q is associated with Q in the same way. The two CA modules share the same parameters. At this time, the point t on Q establishes a relation with the global information on Q.
Compared with a non-local module [43] having similar functions, our CA block needs few calculations and parameters to establish relationships between each pair of locations. The CA block can replace building blocks in ResNet without training them from scratch. We replaced the last residual block of each stage in ResNet-34 with an CA block.
The experimental data demonstrates that our CA-SSD has great performance in terms of the speed and accuracy speed of object recognition (see Table 5).

C. TRAINING 1) LOSS FUNCTION
The loss function of our CA-SSD is a combination of the classification loss function and localization loss function, and is expressed as +λL loc (y, b, g)), (y = y k ij ∈ {0, 1}), (3) where L cls and L loc are the classification loss function and localization loss respectively, and the number of positive samples is N Pos . If N Pos = 0, the loss value defaults to zero. When y k ij = 1, the i-th default box matches the j-th ground truth box of category k. p represents the predicted value of category score. The predicted location of the bounding box, which corresponds to a default box, is defined as b and the ground truth position is represented by g.
The classification loss function of multiple categories is calculated as wherep k i = exp(p k i )/ k exp(p k i ). Similar to Faster R-CNN [17], localization loss function L loc is a smooth L1 loss between the predicted position and the ground truth position. The center point coordinates of the two sets of boxes are calculated, and then bounding box regression is applied.
The localization loss function formula is expressed as The weight coefficient λ is set to 1 through cross validation.

2) TRAINING SETTINGS
A single NVidia TITAN Xp GPU with 12 GB of memory is used to train all models. Most of the training strategies VOLUME 9, 2021 of the improved algorithm proposed in this paper follow the SSD [20]; e.g., adopting a matching strategy, setting the scale for default boxes, and encoding the bounding boxes by using the center code type. Notably, in detecting small objects, the random expansion augmentation included in the latest SSD has been verified to be exceedingly helpful. We adopt the same data augmentation strategy as the SSD during training to make the model more stable against different shapes and sizes of the input objects and to compare with the conventional SSD fairly. A hard negative mining strategy is applied to ensure fast optimization and more stable training. More details can be found in [20]. The experimental section below provides more parameter settings, such as the batch size and learning rate.

IV. EXPERIMENTS
We compare our approach proposed in this research with advanced detectors to objectively appraise whether our approach realistically enhances the precision of small-object detection. After introducing the experimental settings and implementation details, we evaluate quantitative indicators on the PASCAL VOC and MS COCO datasets.
In the experiment, we use the stochastic gradient descent with a weight decay of 0.0005 and momentum of 0.9 as the optimization function. Different datasets have different learning rate decay strategies, which we will introduce separately later. For a 300 × 300 input, the batch size is set to 16 considering GPU specifications. It is worth noticing that we apply an aspect ratio of [2.0, 3.0] in the 38 × 38 feature map, and the number of anchor boxes thus increases to 11,620. We remove the L2 normalization [44]. The ''xavier'' method [45] is used to initialize all new convolutional layers and the mean average precision (mAP) is used as a metric for evaluating the detection performance of the model. In the following sections, the experimental process and evaluation outcomes will be described in detail.

A. PASCAL VOC 2007
The VOC 2007 and 2012 datasets have a total of 20 object categories. They respectively contain 9963 and 22,531 images, and each image is annotated with the ground truth position and corresponding category information. In this experiment, we train the models on the union of VOC 2007 trainval and VOC 2012 trainval and test on the VOC 2007 test dataset. We use the ''warm-up'' strategy in the first five epochs to gradually increase the learning rate from 10 −6 to 0.004 and then use a fixed learning rate of 0.004 for the next training until 150 epochs are completed. In the next 150 to 200 epochs, the learning rate is reduced to 0.0004, and in the last 50 epochs, it is 0.00004.
The evaluation results of competitive detectors and our method, which are measured in terms of the prediction performance of each object category on the VOC 2007 test set, are shown in Table 1. All object detection approaches are trained and tested on the consistent datasets. SSD300 * and SSD512 * refer to the latest SSD results obtained by  applying new expansion data augmentation step, where the numbers 300 and 512 represent the input size. These mAP results are respectively 1.1% and 3.1% higher than those of Faster R-CNN using ResNet-101 as the base network (77.5% vs. 76.4% and 79.5% vs. 76.4%). However, they are lower than those of other algorithms that use ResNet-101 as the base network; e.g., DSSD321 as 78.6% mAP and DSSD513 81.5% mAP and input sizes similar to that of the SSD, while R-FCN has 80.5% mAP. Although both the DSSD and R-FCN have improved accuracy compared with the SSD, the excessively deep backbone network undoubtedly limits the detection speed. The STDN [38] enhances the efficiency of object detection by embedding a scale-transfer module into DenseNet with almost no increase in running time, but it has an mAP 0.6% lower than that of DSSD513 (80.9% vs. 81.5%). The RFB through a multi-branch convolutional block that enhances feature representation of the lightweight network can even reach the accuracy of the advanced model R-FCN operating under the two-stage framework (80.5% mAP), and it is much superior to other methods, such as ION [25] and MR-CNN [46], which try to include contextual information. However, when our CA-SSD300 uses ResNet-34-CA as the backbone network, it attains 82.5% mAP, which is 5%, 3.9%, and 2% higher than the mAP results of SSD300, DSSD321, and RFB300, respectively. The result is even better than those for models with larger input sizes, such as SSD512, STDN513, DSSD513, and Retina640 implemented by us. Our mAP result of CA-SSD300 is respectively 3%, 1.6%, 1%, and 3.4% higher than the mAP results of these methods. Moreover, CA-SSD300 shows great improvement in small-object detection. It achieves the highest ranking in the categories of aero, bird, boat, chair, cow, person, plant, and sheep. Especially compared with the conventional SSD, the performance is improved for 19 categories, with the exception being the train category. We show the precision-recall curve of CA-SSD300 on the PASCAL VOC 2007 test set in Fig. 5. Experimental results show that our CA-SSD512 is only 0.7% higher than CA-SSD300. We think the possible reason is due to hardware limitations. When training CA-SSD512, we can only use a batch size of 8 to train the model at most. In addition, finding the optimal results for the extracted multi-scale features through the loss function is difficult to greatly affect the detection effect. Nevertheless, our algorithm still achieved the best performance. These results indicate that the weaknesses of small-object detection in the SSD are effectively solved by our proposed CA-SSD model.

B. ABLATION STUDY ON VOC 2007
The ablation experiments on VOC 2007 were designed and their evaluations were recorded in Table 2 to prove the effectivity of each constituent in CA-SSD. The term ''w/o VOLUME 9, 2021 FIGURE 6. Detection examples comparison of the SSD300 and our designed CA-SSD300 on the PASCAL VOC 2012 test. We display the results using a score threshold of 0.6. Different colors of boxes represent various categories. For each pair of pictures, the left side represents the recognition of the SSD, and the right side represents the recognition of our CA-SSD300.
downsampling'' means to delete the downsampling operation in the first convolutional layer of the stem block. The CA block is as described in section III.
In the second row of Table 2, the original backbone network and additional layers in the SSD algorithm with an input size of 300 × 300 are replaced with ResNet-34 and residual blocks. This method has 75.0% mAP, which is worse than the result obtained using VGG16.
To improve the prediction results, we replace the first convolutional layer and max-pooling layer in ResNet-34 with a stem block. The first convolutional layer applies a relatively large kernel size of 7 × 7 with a stride of 2. The results in the second and third rows of Table 2 reveal that the stem block improves the target detection accuracy from 75.0% to 76.5%. We believe that this improvement is due to the input having undergone two consecutive downsampling operations (conv and max pooling with stride 2) in ResNet, which resulted in too much information being lost, thereby impairing the performance of the detector. Although the results are improved, the mAP of ResNet-34 is still 1.0% (76.5% vs. 77.5%) lower  than that of the SSD algorithm based on VGG16 at this time.
Comparing the structure in VGG16, we find that ResNet-34 performs a downsampling operation in the first convolutional layer, which leads to a considerable loss of circumstantial information and greatly affects the detection performance, especially in the recognition of small objects. The fourth row of Table 2 shows that after eliminating the downsampling operation of the first convolution, the performance of our detector reaches 81.5% mAP, which is an astonishing 5% mAP improvement over the previous modification (76.5% mAP). Experiments show that the downsampling step of the first convolution hinders better detection performance. When training the ResNet-based SSD, we need to delete this operation to decrease the lack of much important information from the original input image.
In the fifth row of Table 2, we replace the last building block of each stage in ResNet-34 with CA blocks to construct the backbone network, namely ResNet-34-CA, and ResNet-34-CA is used as the basic network for feature extraction. When other components are the same, the mAP of CA-SSD is 82.5%, which is 1% higher than that in the fourth row of Table 2 and 5% higher than that for the SSD. At the same time, fewer parameters are required for calculation.
In the last experiment listed in Table 2, we compare the performance between the CAM in the CA block proposed in this paper and the recurrent criss-cross attention (RCCA) [48] with similar functions, which has superior performance in semantic segmentation. We replace the two CA modules in Fig. 2 with two criss-cross attention modules. Reference [48] has pointed out that when RCCA has two loops (R = 2), it can better balance efficiency and resource usage. The experimental results show that although RCCA can help us improve the performance of the algorithm, it only improves by 0.5%, and RCCA greatly slows down the detection speed (17.6 vs. 33.5). This verifies the effectiveness of the CA-SSD VOLUME 9, 2021  Table 3 shows the accuracy of CA-SSD and the experimental results of advanced detector methods. CA-SSD300 has 78.4% mAP, which exceeds the precision of some one-stage methods with similar input sizes, such as SSD300-VGG16 (75.8%, 2.6% higher mAP) and DSSD321-ResNet101 (76.3%, 2.1% higher mAP). CA-SSD512 is 0.5% higher than DSSD513-ResNet101. Simultaneously, CA-SSD300 maintains an excellent detection performance superior to that of the two-stage detector, although R-FCN has an input size of approximately 1000 × 600 (77.6%, 0.8% higher mAP).
To evaluate our method further in an intuitive manner, we qualitatively compare our method with the conventional SSD. When using an input size of 300 × 300, the recognition examples for the SSD and CA-SSD models is displayed in Fig. 6. The detection results clearly show that our CA-SSD model is more capable of detecting and identifying small-scale targets and some classes with different contexts.

D. MS COCO
In order to further verify the effectiveness of our algorithm, we train the model on the more challenging MS COCO. There are 80 categories in MS COCO. We use trainval35k set for training and use standard COCO metric to report the results on test-dev.
Compared with PASCAL VOC, the COCO dataset has more small objects, and there are more objects to be detected in a single image, which makes detection more difficult.
Therefore, during the training process, we retain the strategy of reducing the size of default box when training with the COCO in the SSD. At the beginning of training, we still apply the ''warm-up'' strategy to increase the learning rate from 10-6 to 0.002 in the first 500 iterations, divide by 10 at 80 and 110 epochs, and end training at 125 epochs.
The test results on MS COCO are shown in Table 4. The experimental results show that SSD300 * has been better than the two-stage detectors Faster R-CNN and ION by 3.2% and 1.5%, respectively. Although SSD512 * improves the original result by 3.7%, it is still lower than R-FCN. Our CA-SSD300 attains 32.7% mAP on test-dev, not only surpassing traditional SSD, but also outperforming R-FCN. We also can see FIGURE 7. Detection examples comparison of the SSD300 and our designed CA-SSD300 on COCO test-dev. We display the results using a score threshold of 0.6. Different colors of boxes represent various categories. For each pair of pictures, the left side represents the recognition of the SSD, and the right side represents the recognition of our CA-SSD300.
that the CA-SSD300 can achieve the best performance no matter what the threshold is. This means that our algorithm has stronger object detection and recognition capabilities than other algorithms, although they may have a larger image input size. Similarly, in terms of small object detection, CA-SSD300 is still 6.6%, 5.3%, 5.8% and 3.8% higher than SSD300, STDN321, DSSD321 and DSOD300 respectively, which is an advantage that cannot be ignored. These results prove that the detection performance of our algorithm for small objects has surpassed algorithms such as DSSD300.
We show examples of object detection results on test-dev in Fig. 7. As can be seen from the figure, benefiting from CA-SSD, we can still identify more small objects even on COCO dataset.

E. COMPARISON OF SPEED AND ACCURACY
The batch size of 1 was used on a single Titan Xp to test the speed of CA-SSD. CA-SSD300 and CA-SSD512 run at 33.5 and 19.6 FPS, respectively. The speed of inference of the various models is shown in the fifth column of Table 5. The values of FPS and mAP in Table 5 were plotted in Fig. 8 as the abscissa and ordinate respectively, so that the advantages and disadvantages of various algorithms in terms of speed and accuracy can be compared more clearly. The figure shows that VOLUME 9, 2021 our CA-SSD300 is faster than most methods used for target recognition and has the best performance.

V. CONCLUSION
We proposed a single-shot detection algorithm based on cyclic attention (CA-SSD). Different from the common use of the deepened backbone network to extract features to achieve better detection performance, we propose a novel and lightweight module, namely the CA block, and the algorithm uses ResNet-34 and the proposed CA block to effectively capture contextual and detailed information. We also explored factors that affect the recognition accuracy of the ResNet-based SSD algorithm and redesigned the stem block, which greatly improved the detection accuracy. We compared the proposed CA-SSD with many competitive methods through a series of quantitative and qualitative experiments on PASCAL VOC. The experimental results strongly prove that the CA-SSD300 model we proposed is much better than the conventional SSD framework, achieving 82.5% mAP on VOC 2007 test and 32.7% mAP on COCO testdev; excellent detection results were obtained for small objects and context-specific objects. At the same time, CA-SSD300 reached 33.5 frames per second on a single Titan Xp GPU and thus had a speed comparable to that of other detectors.