Transformed Dynamic Feature Pyramid for Small Object Detection

The low resolution and less feature information of small targets make it difficult to recognize and locate, which greatly hinders the improvement of object detection accuracy. In this paper, an object detection model (TDFP) based on CNN and transformer was established, which combines local and global context to establish the connection between features. In the proposed transformed dynamic feature pyramid network, a transformer module was designed to dynamically transform and fuse the multi-scale features generated by the backbone to generate a transformed feature pyramid with richer multi-scale features and context information. In this transformation process, gate block is used to dynamically select single-scale transformation or cross-scale transformation to achieve an optimal style of transformation and fusion of multi-scale features. The experimental results show that the model improves the small targets detection accuracy based on CNN and transformer. Based on the backbone ResNeXt-101, TDFP achieves 46.2% AP and 26.3% APS on MS COCO, and takes the amount of computation as a loss constraint to achieve a better balance between detection accuracy and computational complexity.


I. INTRODUCTION
In recent decades, object detection methods based on convolutional neural networks (CNN [1]- [4]) have made great achievements. However, the low detection accuracy of small targets is a difficult problem in object detection, which hinders the further improvement of object detection accuracy. Therefore, the researchers proposed various solutions, such as better multi-scale feature fusion methods [6]- [10], richer context information [11]- [14], appropriate training method [15], denser anchor sampling and matching strategies [16]- [20]. Most of these methods depend on CNN and the preset of anchor boxes. However, the long-distance dependence between objects in images is very important in visual tasks. For image data, CNN can only capture the long-distance dependence between targets by the large receptive field generated by repeated convolution operations [21], [22], which leads to a complex calculation.
The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval .
In recent years, it has been found that self-attention [23] and non-local [24] operations can capture the interaction between targets. Compared with CNN, self-attention in the transformer can mining long-distance dependence between targets and is not limited by the inductive bias of local interaction, and has strong expression ability. Therefore, the transformer is extended to various specific tasks in computer vision, such as classification [25]- [27], object detection [28]- [33] and segmentation task [34], [35], etc., and obtains global information through self-attention. But compared with CNN-based two-stage detectors and one-stage detectors, transformer-based methods have a little disadvantage in detection accuracy. Convolution has translation invariance and local sensitivity, but it lacks the overall perception and macro understanding of the image. The transformer can be used in a convolution network to learn the global features of images. However, for high-resolution input, the self-attention layer is more computational, so it is suitable for smaller spatial dimension input. Therefore, it is worth further research to optimize the network based on the advantages of CNN and transformer. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In this study, we use the convolution method to learn the visual part with rich local context efficiently, and then use the transformer method to learn global context information. Based on CNN and transformer, a new detection model (Transformed Dynamic Feature Pyramid, TDFP) is proposed, whose core is a transformed dynamic feature pyramid network. In this network, a transformer module is designed. After the backbone generating multi-scale features, the better multi-scale feature fusion mode is realized by dynamically selecting the cross-scale transformation and single-scale transformation via gate block and capturing the local and global context information to establish the relationship between the targets. And the transformed feature pyramid with richer multiscale features and context information is generated to alleviate the small targets problem. In addition, to reduce the calculation, we take the calculation as the constraint loss to achieve the optimal balance between the detection accuracy and the calculation.
The detection method proposed in this paper has the following advantages: (1) Compared with the previous CNN-based and transformer-based detection methods, richer multi-scale features and context information can be obtained.
(2) Through dynamic feature transformation and fusion, our model can get richer multi-scale features and context information, and the detection accuracy of small targets based on CNN and transformer has been improved greatly.
(3) The computation is used as the loss constraint to achieve the optimal balance of detection accuracy and computational complexity.

II. RELATED WORK A. CNN-BASED DETECTORS
Two-stage detectors based on CNN, RCNN [36] and its variants [18], [37], [38] solved the problems of traditional detectors with hand-designed features, such as many steps, high time complexity, window redundancy, poor detection accuracy [39], and achieved high detection accuracy, but lack of real-time.
YOLO [40] and its variants [41]- [43], SSD [44] and its variants [16], [45], [46] avoid the use of RPN and realize real end-to-end detection. Some networks can achieve real-time detection while maintaining high detection accuracy, but the detection accuracy of most two-stage detectors is lower than that of two-stage detectors. Scaled-YOLOv4 [41] proposed a network scaling approach that modifies not only the depth, width, resolution but also the structure of the network. YOLOr [42] proposed a unified network to encode implicit knowledge and explicit knowledge together, which can generate a unified representation to simultaneously serve various tasks and benefit the performance of all tasks. RetinaNet [47] proposed focal loss to solve the problem of class imbalance to improve detection accuracy. Lu et al. [48] proposed a novel and effective framework, MimicDet, which has a shared backbone for one-stage and two-stage detectors, then it branches into two heads which are well designed to have compatible features for mimicking, to train a detector by directly imitating two-stage functions. However, most of the above detectors rely on manually set anchor boxes to achieve the detection task. The setting of the anchor involves many parameters and has complex computation. The final performance of the model is sensitive to the anchor boxes, so the robustness of the model is poor.
In recent years, the center-based methods [49]- [51] and the keypoints-based methods [52]- [54] have eliminated the use of anchors, but the detection accuracy is low. ATSS [55] showed that the essential difference between anchor-based detectors and anchor-free detectors is actually how to define positive and negative training samples, and proposed an adaptive training sample selection approach to automatically selects positive and negative training samples according to the statistical characteristics of the targets. which can improve the performance of detectors.
Recently, [3] proposed SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that was learned on an object detection task by Neural Architecture Search(NAS). The learned scale-permuted model outperforms ResNet-50-FPN by (+2.9% AP) in the object detection task. The efficiency can be further improved (−10%FLOPs) by adding search options to adjust the scale and type of each candidate feature block. Cascade RCNN-RS [37] provided simple scaling strategies to generate a family of models that form two Pareto curves, named RetinaNet-RS and Cascade RCNN-RS. These simple rescaled detectors explore the speed-accuracy trade-off between the one-stage RetinaNet detectors and two-stage RCNN detectors. They identified the key architectural changes, training methods and inference methods that significantly improve object detection and instance segmentation systems in speed and accuracy. Zhou et al. [56] developed a probabilistic interpretation of two-stage object detection, which motivates a number of common empirical training practices. They presented a simple modification of standard two-stage detector training by optimizing a lower bound to a joint probabilistic objective over both stages. The resulting detectors are faster and more accurate than both their one-and two-stage precursors.

B. TRANSFORMER-BASED DETECTORS
In the research of applying transformer to computer vision tasks, Cordonnier et al. [57] proposed that the self-attention layer can also achieve the same effect as the convolution layer, while reducing the computational complexity, and can replace the convolution layer. Transformer-based methods can be divided into [57]: (1) vanilla transformer replaces convolutional neural network to achieve visual tasks [26], [27]. Beal et al. [61] used Vit [26] as the backbone network, combined with a prediction head to achieve the final detection, the detection effect of large targets is good, but with poor detection effect of small targets. Therefore, the use of vanilla transformer still needs further research.(2) Combine transformer with CNN. Detr [28] for the first time combines the transformer with CNN for object detection and achieves the SOTA performance, which simplifies the detection pipeline, regards the target detection as an unordered set prediction problem, and compulsorily realizes the unique prediction through binary matching. However, the binary matching between transformer decoder and Hungarian loss is unstable, which leads to slow convergence speed and poor detection effect of small targets. The FPT proposed by Tong et al. [39] is to apply the idea of transformer to the transformation of feature pyramid [60]. Three specially designed transformers are used to transform any feature pyramid into another feature pyramid of the same size but with a richer context in a top-down and bottom-up interactive way, so as to alleviate the small target problem.
To solve the convergence problem of Detr, deformable Detr [30] was proposed to use a deformable attention module instead of the original multi-head attention to focus on a small group of key positions around the reference point. Sun et al. [32] proposed the encoder-only version of Detr, designed a new binary matching scheme to achieve more stable training and faster convergence, and proposed two ensemble prediction models TSP-FCOS and TSP-RCNN based on transformer, which have better performance than the original Detr model, and greatly improved the detection accuracy and training convergence.
For the high computational complexity of Detr, Srinivas et al. [33] proposed an adaptive clustering transformer (ACT) to reduce the computational cost of pre-trained Detr without any training process. LeCun et al. [22] only uses global self-attention to replace the last three bottlenecks of ResNet [1], which significantly improves the baseline in instance segmentation and object detection, while reducing the cost of parameters and minimizing latency.
Recently, [62] constructed a hierarchical transformer and introduced the idea of the locality to calculate the self-attention [23] in the non-overlapping window area, which greatly reduced the computational complexity and improved the detection accuracy. Yang et al. [63] presented focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. They also proposed a new variant of Vision Transformer models with focal self-attention, called Focal Transformer, which achieves superior performance over the state-ofthe-art vision Transformers [27] on a range of public image classification and object detection benchmarks. Meanwhile, Dai et al. [64] presented a novel dynamic head framework to unify object detection heads with attention. The proposed approach significantly improves the representation ability of object detection heads without any computational overhead by coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial awareness, and within output channels for task-awareness.

C. MULTI-SCALE FEATURE FUSION
For the fusion of multi-scale features, the most direct method is to add multi-scale features [60], [65]. FPN [60] is the first time to propose a feature pyramid with top-down and horizontal connections to solve multi-scale problems, especially the small target problem. PANet [65] adds a bottom-up path to FPN [6]- [8]. Different multi-scale feature fusion methods are used to generate a better feature pyramid [6]- [8]. PFPNet [6] constructs the feature pyramid by widening the network width instead of increasing the network depth. AugFPN [7] considers the difference between different scale features and uses the adaptive feature fusion method to add multiscale features with weights. A new architecture of FPN reconfiguration [8] is proposed, which can aggregate task-oriented features in different spatial locations and scales.
Another method is to connect multi-scale features along the channel direction [10], [66]. In addition, some studies [23], [24], [67] consider the information interaction within the same scale features.

III. METHODOLOGY A. OVERALL ARCHITECTURE
The network architecture proposed in this paper mainly includes three parts: (1) The backbone. The pre-trained ResNet is used as the backbone to extract the multi-scale feature maps {a, b, c} of the input image. The size of the input image is 800 × 1000. The size and lower sampling rate s corresponding to {a, b, c} are 25 × 32/32, 50 × 68 / 16 and 100 × 125/8.
(2) The Transformed dynamic feature pyramid network.
{a, b, c} are transformed into a transformed feature pyramid with richer multi-scale features and context information to alleviate the small target problem via the designed transformer module. The details are shown in sections 3.2-3.4.
(3) Head network. At the top of the transformed feature pyramid, Fast RCNN [38], which is the head network, is used to implement the detection task. In order to enhance the generalization ability of the model and avoid overfitting, drop block [68] is applied to each output feature graph. The drop block size is 5 and the feature retention probability is 0.9.

B. TRANSFORMED DYNAMIC FEATURE PYRAMID NETWORK
Transformed dynamic feature pyramid network is the core part of the detection model proposed in this paper, as shown in Figure 2. It contains three parts: (1) The multi-scale features {a, b, c}. They are generated by the backbone.
(2) Transformer module. It uses gate block to dynamically change the feature transformation and fusion methods to achieve better feature fusion.
(3) Transformed feature pyramid with local and global information. More abundant multi-scale features and context information are aggregated via the Transformer module. The transformer module consists of single-scale transformation, cross-scale transformation and gate block. There is a semantic gap between multi-scale features, so it is difficult to make a better fusion of features and establish the relationship between them if the horizontal connection and top-down fusion are carried out directly. At the same time, through multiple down sampling, features lose the underlying spatial location information, which is harmful to small target detection. Therefore, the idea of transformer is applied to the above multi-scale features, and the gate block is used to dynamically select the transformation and fusion methods of features, so as to capture the global context information of multi-scale features to establish the relationship between features and the feature pyramid with richer multi-scale feature information. The details of the transformation are described in Section 3.3.
To achieve better multi-scale feature fusion, gate block is used to select whether to carry out cross-scale transformation and fusion to obtain features b t , so as to acquire the relationship between different scale features. In order to explain the process of transformation and fusion more clearly, the transformation process of one layer of features is described, taking the transformation and fusion of generated features as an example. The transformation between the same color features is the same scale transformation, and the transformation between different color features is the transformation of different sizes. The feature maps features obtained from the above connection are processed by a convolution of 1×1. The dimension of the feature is reduced to 256, and the transformed feature b t is obtained by adding the feature with two times of a t up sampling.
The above operations are performed on features {a, b, c} respectively, and a top-down path is added between features {a t , b t , c t } to obtain the transformed feature pyramid {a t , b t , c t }. Compared with the size of input image, the bottom-up feature size of {a t , b t , c t } is 8, 16 and 32 of down sampling rate respectively, and the feature of each layer has the same channel number of 256.

2) SELF-ATTENTION
Self-attention is the core idea of transformer. The input of self-attention layer is a feature graph, and an updated feature map is obtained for the purpose of calculating the attention weight between each pair of features, each of which contains information about any other location in the same image. If each position in the feature map is a random variable, the similarity between any two positions is calculated. The value of each predicted pixel is enhanced or weakened according to the similarity between each predicted pixel and other pixels in the image. Similar pixels are used in training and prediction, and different pixels are ignored. Self attention layer can deal with the larger sense field than conventional convolution, so these models can obtain the dependence between the features with long-distance interval in space.
For self-attention, it is usually in the form of scaleddot-product [23]. Given query matrix, key matrix and value matrix, the correlation between the two is first calculated by multiplying and dividing by scaling factor, and then the weighted sum of the result and value vector is finally output.
where Q, K , V are transformer matrices. X i,j , X a,b represent different input feature maps.

C. SINGLE-SCALE TRANSFORMATION AND CROSS-SCALE TRANSFORMATION 1) SINGLE-SCALE TRANSFORMATION
The process of single-scale transformation is shown in Figure 3, mainly considering the relationship between pixels in the same feature map. The multi-scale features {a, b, c} are transformed by a single-scale transformation to get the features {a 1 , b 1 , c 1 }, which is similar to the operation of self-attention layer. The specific change process is as follows. This section takes the single-scale transformation process of spatial range feature map as an example (as shown in Figure 4). Given a pixel in the feature map, we first extract a region as the center, and the pixel position of the region is, which is the number of pixels. After a single head attention layer, the output of the pixel is as follows, where queries, keys and values are linear transformations of position pixels and adjacent pixels, which means that a number adjacent to the location is applied and then sum them. When local self-attention gathers spatial information on neighborhood similar to convolution, aggregation is accomplished by convex combination of value vectors with mixed weight, and the mixed weights are parameterized by content interaction. Repeat this calculation for each pixel to get the updated feature map, which has the same scale as the feature map.
In most cases, multiple attention heads are used to learn a variety of different representations of input. The principle of the method is to divide the pixel feature depth into groups. As described above, the attention of each group is calculated separately. Each head uses different transformations and then connects the output representation to obtain the final output.

2) CROSS-SCALE TRANSFORMATION
There is a semantic gap between multi-scale features. In order to better realize the feature interaction between multi-scale features, cross-scale transformation is used to calculate between two different scale features to get the transformed feature map. Firstly, the features are transformed by single-scale and cross-scale transformation respectively, and then the two transformed feature maps of the same scale are added to get the features. Take the cross-scale transformation of feature as an example. Given a feature map, the output feature graph and the feature graph have the same size. Euclidean distance is used as the similarity function to calculate the similarity.
where q i = f q (χ b i ) and k j = f q (χ a j ). χ b i is the i th position of χ b , χ a j is the j th position of χ a . q i , k j is divided into N parts, We get the process of cross-scale transformation as follows, Weight : where v j = f v (χ a j ) is the similarity score of the part χ a j , and s n i,j is the feature position of the middle transformation. F mul is dot product When each pair has a closer distance, they will be given a greater weight. The cross-scale transformation of other scale features is the same.

D. GATE BLOCK
In this paper, we use gate block (as shown in Figure.5) to dynamically change the feature transformation and fusion methods to get better feature fusion. Single-scale transformation is a branch that must participate in the transformation, and cross-scale transformation is decided by gate block. If both branches participate in the transformation at the same time, the transformed features are added to generate each layer of the transformation feature pyramid. CNNGate block [67] is used as the gate block. CNNGate block includes an average pooling layer, two fully connected layers, a ReLU activation function and GumbelSoftmax [69]. The transformed features {a 1 , b 1 , c 1 } are passed through CNNGate block. Assume that the input features with the shape of (C, H, W) are first compressed by the average pooling operation, and the feature dimension is reduced to 1 / 4 of the original dimension. C,H,W are the number, height and width of feature channels. Then, two full join layers, a nonlinear activation function ReLU and a GumbelSoftmax function, are FIGURE 4. An example of a local attention layer in a k = 3 spatial range. VOLUME 9, 2021 used to generate a one hot gate vector β l for dynamic blocks.
where, α l = g l (F l ) is the gate signal generated by the nonlinear function g l (·) in F l . β i l ∈ {0, 1}. n i l ∼ Gumbel(0, 1) is a random sampling of Gumbel distribution. τ is a temperature parameter that affects the gumbelsoftmax function.

A. SETUP 1) EXPERIMENTAL HARDWARE SPECIFICATION AND IMPLEMENTATION DETAILS
The experiment in this paper was implemented in MS COCO 2017 [70]. COCO contains 80 categories. COCO trainval35k split (118K image) was used for training, and minimal set (5K image) was used as the verification of this study. Standard average precision (AP), AP 50 , AP 75 , AP S , AP M and AP L are used to evaluate the model performance. Our work is based on the Faster RCNN and ideas of Transformer, whose backbone mainly is ResNet, in order to compare with more general models and SOTAs(mainly the backbones are ResNet and ResNeXt [71]), we chose ResNet and ResNeXt as the backbone for fair comparison. The backbones mentioned above are pre-trained networks on ImageNet [72], and then the whole networks were finetuned and the backbones' parameters on the training set were frozen. For fair comparison, the size of the input images is resized to 800 pixels or 1000 pixels for shorter and longer edges, respectively.
For all experiments, we use SGD optimizer to train our models end-to-end for 12 epochs on a machine, whose CPU is Intel i7-9700k, 32 RAM, 4 NVIDIA GeForce GTX TITAN X GPUs with SBN [73] and the CUDA version is 10.1. The deep learning framework is Pytorch 1.7.1. Linear warm-up strategy for 500 iterations is leveraged at the beginning of training. Each mini-batch contains 2 images of each GPU and 512 regions of interest (ROI) of each image, and the positive and negative ratio is 1:3. We initialize the learning rate as 0.01 and decrease to 0.001 and 0.0001 at 8th-epoch and 11thepoch. The momentum is set as 0.9 and the weight decay is 0.0001. An end-to-end region proposal network (RPN) [43] is used to generate region proposals. In order to make the model more robust, some data enhancement methods are used, such as geometric distortion, color jitter and so on.

2) HYPER-PARAMETERS
As for the hyper-parameters of the transformer module, 1/ √ d k in Equation 1 was set as 0.1. N in Equation 5 was set as 4 and τ in Equation 6 was a learned parameter, the Gumbel-Softmax distribution can adaptively adjust the ''confidence'' of proposed samples during the training process. We set it as 0.1 initially because it should approach to 0 and τ > 0, at higher temperatures, Gumbel-Softmax samples are no longer one-hot, and become uniform as τ → ∞.
We apply the DropBlock [68] to each transformed feature map, to alleviate the over-fitting problem. Follow [59], we set block size = 5 and keep prob = 0.9.

3) LOSS FUNCTION
To reduce the computational complexity of the model and save resources, the loss function not only contains the classification and regression losses, but also adds the computation cost as a loss constraint [72] to achieve the optimal balance between the detection accuracy and the amount of calculation. C max , C min represent the computation cost of the highest configuration and the lowest configuration respectively, and C R represents the actual computation cost. C t arg et is controlled by super parameter α. The final loss function is as follows, where i is the index of an anchor in a mini-batch, and p i is the prediction probability that the anchor i is a target. If the anchor is positive, p * i = 1, otherwise p * i = 0. The 4-dimensional vector t i representing the four angular coordinates of the prediction box and t * i is the coordinate vector of the truth bounding box. L cls is the log loss on two categories (target and non-target). The regression loss is a smooth L1 function. This term p * i L reg indicates that the regression function is only activated at p * i = 1. These two terms of L cls , L reg and L C are balanced by the balance parameters λ.

B. COMPARISON
We compared TDFP with the most advanced object detectors in the test of MS-COCO [70] benchmark test-dev 2017. In these experiments, the images are randomly scaled from 640 pixels to 800 pixels in the training process, and the number of iterations is increased to 200K. We used the same settings and super parameters (e.g., learning rate, NMS threshold, etc.) obtained from FPT [39] and DyFPN [67] for TDFP. Table 1 lists the comparison of the results of some detectors. R50, R101, RXt50, and RXt101 indicate ResNet50, ResNet101, ResNeXt50, and ResNeXt101.
Using resnext-101 as the backbone, the AP of TDFP reaches 46.2% and the AP S is 26.3%. In the same backbone network, compared with Detr [28] based on Transformer, the detection accuracy of TDFP and large targets is slightly inferior, but the AP S is 2.6%-2.9% higher than that of Detr and UP-DETR [29]. The more obvious result is that our method surpasses the transformed-based ViT-FRCNN [61] 6.0%-6.9%. We also found a surprising result, compared with the two-stage detector SpineNet [3], Faster RCNN [38], AugFPN [7], DyFPN [67] and the one-stage detector RetinaNet [47] based on CNN, the network TDFP has large improvement of AP, AP 50 , AP 75 , APM, AP L on COCO.

C. ABLATION STUDY
The ablation study was performed on MS COCO 2017 val set, and the main backbone network was ResNet-50. The purpose of this study is as follows.

1) COMPARISION OF TRANSFORMER METHODS
In this section, we evaluate the importance of Transformer module (TS module), Single-scale transformation (SS TS) and cross-scale transformation (CS TS). As shown in Table 2, when the TS module is not added, the network fuses the features through the convolution layer, and the detection accuracy is the worst. The detection effect of transformer (TS) module is better than that of convolution. The AP of CS TS is 0.5% higher than that of SS TS, but the detection result of both CS TS and SS TS is the best. The AP of small target is 2.6% higher than that of no TS module. Therefore, the transformed features have more abundant local and global context information to establish the relationship between features, as shown in Figure 6, which shows the visual comparison of features through convolution layer, single-scale transformation and cross-scale transformation. Among them, columns a, b, c, d and e are the original image, the convolution layer, the Single-scale transformation, the cross-scale transformation and the fusion feature maps after the Single-scale transformation and the cross-scale transformation. As can be seen from Figure 6, compared with the convolution layer, the self-attention layer can obtain more abundant global context feature information. cross-scale transformation can get more context information of multiscale features than Single-scale transformation and realize the interaction between multi-scale features. Single-scale transformation and cross-scale transformation are used to capture the relationship between features with longer distance, and they are more sensitive to the features of small targets.

2) THE NECESSITY OF COMPUTATION LOSS
The influence of CC loss (Table 3) and resource limitation coefficient are studied. When CC loss is not used, the calculation amount is the largest. Although CC loss can lead to a small decrease in detection accuracy, it can greatly reduce the calculation amount and achieve a better balance between the accuracy and the calculation.

3) THE NECESSITY OF TRAINING STRATEGY
To explore the application effect of SBN [73] and DropBlock [68] in TDFP (as shown in Table 4), both SBN and DropBlock improve the model performance of TDFP, and their combination can achieve better results, making the bounding box AP improved by 1.6% -2.1%.

4) FPS AND GFLOPs
FLOPs measures model speed through theoretical calculations. FPS (frames per second) refers to the frequency of individual images that are displayed on a video device or the number of recorded images per second. We use torchscript models to measure FLOPs and FPS on an Nvidia GeForce RTX 2080 Ti GPU.
Under the same backbone ResNet-50, the comparison between TDFP and RetinaNet [47], Fast RCNN [38],    Detr [28] as well as recent SOTAs in FPS and GFLOPs are reported in Table 5 (V100 represents NVIDIA TensorRT on a V100 GPU). When GFLOPs is close, TDFP model achieves the same result as Fast RCNN baseline. When the efficiency is not significantly reduced, FPS is higher than RetinaNet, which has lower AP S , but greatly improves AP L . Compared with Detr [28] which is based on transformer and CNN, FPS and GFLOPs are slightly inferior in terms of accuracy and overall accuracy, but the detection accuracy of small targets is greatly improved. And both FPS and GFLOPs are higher than FCOS [49].

D. VISUALIZATION OF RESULTS AND DISCUSSION
In the test set of COCO, this paper selects some images which are difficult to detect, and the detection results are shown in Figure 7 On the whole, the detector in this paper can correctly detect the multi-scale targets in the image, and the detection results of small targets are also good.
Our model is to improve the detection accuracy of small targets combined CNN with Transformer methods. Both of them have advantages and disadvantages. The Transformer-based method has better detection results for large targets than small targets, while CNN is the opposite. Our method is to use the thought of Transformer in the process of constructing the feature pyramid. Compared with Faster RCNN, it may not be able to obtain better features of small targets. Compared with Detr, we proposed a better feature fusion and construct features. The rich feature pyramid combines local and global contextual information. Therefore, our method reached a compromise between Faster RCNN and Detr.
The experimental results show that the proposed model can effectively improve the accuracy of target detection and keep less computation. It achieves 44.4% and 46.2% AP on ResNet-101 and ResNeXt-101, respectively. The results surpass the previous two-stage detector Fast RCNN [38] and one-stage detector RetinaNet [47]. Compared with Detr [28], the overall detection accuracy in the same backbone network is lower, but it greatly improves the small target detection accuracy. At the same time, richer global information is beneficial to the big targets. Dynamic selection of the optimal multi-scale feature fusion method can obtain and aggregate more abundant multi-scale features and context information, which can better solve multi-scale problems, especially small targets. At the same time, the amount of calculation as a loss constraint training can reduce the amount of calculation without causing a significant decline in accuracy. In addition, the accuracy can be improved by a certain training strategy.

V. CONCLUSION AND FUTURE WORK
In order to mitigate the low accuracy problem of small targets due to less feature information in small targets and the limitations of CNN, a novel detection model based on CNN and transformer was proposed. In this model, a transformer module was designed to combine the local context and the global context information obtained by feature transformation. In this module, the method of dynamic multi-scale feature transformation and fusion determined by gate block was used to obtain optimal feature fusion and a feature pyramid with richer multi-scale feature information and context information. Through the above methods, the detection accuracy of small targets based on CNN and transformer is improved, and the detection results of large targets are better than that based on CNN. In addition, the proposed detection model takes the amount of calculation as a part of the loss function without significantly reducing the accuracy while reducing computation cost. However, this paper only applies the transformer idea to two-stage detector, which has a lot of optimization space in terms of accuracy and speed. In future work, we will consider combining transformer with one-stage detector or anchor-free detector.