Pyramid Attention Upsampling Module for Object Detection

The core task of object detection is to extract features of various sizes by hierarchically stacking multi-scale feature maps. However, it is not easy to decide whether we should transmit semantic information to the low layers while reducing the loss of semantic information of the high-level features. In this paper, we present a novel method to reduce the loss of semantic information, and at the same time to improve the object detection performance by using the attention mechanism on the high-level layer of the feature pyramid network. The proposed method focuses on the sparse spatial information using deformable convolution v2 (DCNv2) on the lateral connection in the feature pyramid network. Specifically, the upsampling process is divided into two branches. The first one pays attention to the global context information of high-level features, and the other rescales the feature map by interpolation. Finally, by multiplying the results from the two branches, we can obtain upsampling result that pays attention to semantic information of the high-level layer. The proposed pyramid attention upsampling module has three contributions. First, It can be easily applied to any models using feature pyramid network. Second, it is possible to reduce losses in semantic information of the high-level feature map by performing context attention of the high-level layer. Third, it improves the detection performance by stacking layers up to the low layer. We used MS-COCO 2017 detection dataset to evaluate the performance of the proposed method. Experimental results show that the proposed method provided better detection performance comparing with existing feature pyramid network-base methods.


I. INTRODUCTION
O BJECT detection is a fundamental but still challenging problem in computer vision. Deep learning-based object detection methods are widely used in various fields such as face detection [1], [2], object tracking [3], [4], pedestrian detection [5], [6], autonomous driving, and medical imaging, to name a few. Convolutional neural networks (CNNs) have been rapidly developed in recent years, starting with Alexnet [7], and have shown significantly improved performance in the field of object detection. Since learning a backbone network from scratch requires large amounts of data and processing time, a pre-trained backbone with a large dataset such as ImageNet is widely used for efficient feature extraction [8]. To realize a lightweight backbone, MobileNetV2 [9] used inverted residuals and linear bottleneck, ShuffleNet [10] used pointwise group convolution and channel shuffle, and Xception [11] applied depth-wise separable convolution to the inception model method that separates channels and spaces. On the other hand, to increase increasing the accuracy by constructing a deeper network, ResNet [12] used a deeper neural network through residual learning, and ResNeXt [13] improved performance by increasing cardinality using grouped convolution based on ResNet. To detect multi-scale objects, the image pyramid [14] method used a multi-scale feature map from high to low layers at the cost of increased computational load and memory space. To solve this problem, research on effectively integrating the feature maps of the high and low layers is being actively conducted. Figure 1 shows a representative model used for object detection with the structure of the feature pyramid network (FPN) [15]. The deep layers in the backbone extracts highlevel features such as textures or parts of objects, whereas the shallow layers extracts low-level features such as edges and curves. The FPN uses both high and low-level features effectively as shown in Figure 1 to hierarchically stack feature maps through top-down pathway with lateral connection to detect objects of various sizes. Through this method, the performance of detecting multi-scale objects was improved, and computational cost and memory problems were solved. PANet [16] proposed a stronger FPN structure by adding a bottom-up path, NAS-FPN [17] proposed a new FPN structure through neural architecture search, and BiFPN [18] proposed a fusion method for multi-scale feature maps by building a weighted bi-directional feature pyramid network. Existing FPN-based structure uses interpolation [19], [20] or deconvolution [21], [22] to fuse high and low-level feature maps. More specifically, interpolation mainly uses nearestneighbor or bilinear interpolation. Nearest-neighbor interpolation is calculated quickly by filling the value using pixel values at the nearest location, but as shown in Figure 2b, the edge becomes ambiguous due to the occurrence of aliasing. Bilinear interpolation uses values obtained by multiplying the neighboring four pixel values by weights when the size of the original image is resized by N times, and blur occurs as shown in Figure 2c. Deconvolution expands the size of the feature map by inversely calculating the convolution, and checker board artifact occurs due to even overlap phenomenon according to stride and keel size.
To solve this problem, Zhao et al. proposed a pyramid pooling module (PPM) that applied different size pooling [23], and Chen et al. proposed atrous spatial pyramid pooling (ASPP) by superimposing a dilated convolution applied with various rates on an atrous pooling layer [24]. However, PPM has a problem in that information on the pixel location is lost in the process of applying pooling to various sizes to the feature map. In addition, since ASPP uses dilated convolution, a wide receptive field can be utilized at the cost of losing local context information.
In this paper, we proposed an upsampling method by reducing the loss of semantic information of input images of higher-level feature maps in the FPN structure and attention to local and global context information. The proposed method consists of three steps. 1) We used deformable convolution v2 [25] to the feature map generated in each step to the backbone to generate more prominent context information, and then used lateral connection to reduce semantic information loss to the higher-level feature map. 2) By applying global average pooling to the feature map, we generated through deformable convolution and extracted high-level features using the attention map of global contextual information. 3) In other branches, the high-level feature map is interpolated, and the global context information is multiplied by the attention map to obtain the upsampled feature map.

A. OBJECT DETECTION
Object detection is usually classified into one-stage and two-stage approaches. The two-stage methods showed high accuracy by sequentially applying regional proposals. Regions with CNN features (R-CNN) is the first two-stage method that uses CNNs to propose candidate regions, and then extracts features from each region [26]. There are many improved variants of R-CNN including, fast R-CNN [27], faster-RCNN [28], Mask R-CNN [29], Cascade R-CNN [30], and Libra R-CNN [31], to name a few. On the other hand, one-stage detection methods are faster than two-stage methods at the cost of lower accuracy. One-stage methods include YOLO [32], SSD [33], RetinaNet [34], and EfficientDet [18]. In addition to one-and two-stage methods, keypointbased methods, such as CenterNet [35], CornerNet [36] and center-based ATSS [37], FCOS [38], used anchor-free approach. Recently, many object detection methods eliminate non-maximum suppression (NMS) or anchor generation processes using transformer [39] significantly improved detection performance [40], [41].

B. FEATURE PYRAMID NETWORK
Multi-scale object detection is a crucial problem in the object detection field. The image pyramid was a commonly used method in object detection to increase the accuracy of multiscale object detection [14]. However, independent extraction of features at each level from the image pyramid requires a redundant computational cost. To reduce the computational load and improve the detection accuracy, feature pyramid network (FPN) compensates for the semantic information lost in the forward process using a top-down pathway and lateral connection [15]. Recently, more robust pyramidal structure based on FPN has been proposed. PANet added a bottomup path augmentation to deliver low-level information to the high-level once more, enabling more precise localization [16]. Natural architecture research-feature pyramid network (NAS-FPN) utilized natural architecture research (NAS) to design a new feature pyramid neural network structure and to improve performance at the cost of increased memory space [17]. Tan et al. proposed a weighted bi-directional feature pyramid network (BiFPN) that solves the computational cost problem and fuses multi-scale feature maps more quickly and effectively [18].

FIGURE 4: Transformer architecture
Attention mechanism have been actively studied as essential elements for many neural networks in natural language processing. Recently, attention mechanism in computer vision have been used in many fields to improve performance by focusing on regions of interest in images and capturing long-range dependencies. Its application to computer vision includes image classification [42]- [44], image generation [45], [46], and segmentation [47], [48]. Attention mechanism for object detection has been proven through previous studies since it can help to locate and recognize objects in images and improve detection performance [49], [50]. Li et al. proposed a MAD unit find neuron activation in high and low streams through aggressive search [49], Zhu et al. proposed a new structure of couplenet that integrates global and local information of objects to enhance detection performance [50]. Figure 4 shows the structure of transformer [39]. Recently, transformer-based object detection methods were proposed and provided improved detection accuracy [40], [51].

III. PROPOSED METHOD A. IMPROVED LATERAL CONNECTIONS
The the proposed method can be applied to the original feature pyramid network using attention upsampling module as shown in Figure 5.
Recently, the deep learning-based object detection method uses the FPN to build a robust model against scale change by hierarchically stacking feature map of each stage using lateral connections and top-down paths. The feature map of each stage extracted through a backbone network has detailed characteristics of the object. FPN fixes the number of channels to 256 through point-wise conversion on the feature map of each step extracted from the backbone and proceeds to the decoder stage. The decoder combines a high-level feature map with abundant category information and a lowlevel feature map to provide a stronger model for multi-scale objects. However, in the process of channel reduction using point-wise convolution to reduce the amount of computation, semantic information about the object of each stage feature map is lost.

B. ATTENTION UPSAMPLING MODULE
The proposed method changes the number of channels to 256 by using the deformable convolution V2 (DCNv2) when passing the feature map from each stage to the lateral connection to solve the above problem. The DCNv2 is expressed as  where p k and w k respectively represent the offset and weight for the k-th position, and y(p) and x(p) respectively represent the characteristics of the position p in the output and input feature maps. ∆p k and ∆m k are the learnable offset and modulation scalar for the k-th position. DCNv2 improves the deformable convolution by finely adjusting the spatial support region with a modulation scalar in the range of [0, 1] for an offset with a real number of unrestricted ranges. DCNv2 uses a flexible kernel that can be easily changed. In addition, the receptive field can be finely adjusted through the modulation scalar. Therefore, DCNv2 can focus more powerfully and efficiently on sparse spatial location than fixed kernels. When applied to the lateral connection, the representation ability of features generated in each stage can be improved, and object detection performance can be improved by extracting high-level features as a feature map that is more robust to geometric deformation.
In this section, we propose the attention upsampling method by dividing the upsampling process into two branches to focus on semantic information by applying global average pooling to the high layer feature map and to reduce information loss that occurs during the upsampling process as shown in Figure 6.
In general, FPN is used in object detection to extract multi-scale features. FPN upsamples the high-level feature map and fuses it with the low-level feature maps. Usually, interpolation is used to resize the feature map, but the interpolation method has a problem in that semantic information is lost due to aliasing and blur. To solve that the problem, an alternative upsampling method uses the transposed convolution. However, it cannot avoid checkerboard artifacts due to overlapping, which results in performance degradation. To compensate for the information loss that occurs during the upsampling process, the size of the feature map is changed using interpolation with convolution operation. However, above mentiones methods are computationally expensive and use complex convolution structures. Furthermore, because of unidirectional upsampling, the semantic information of the high-level layer is lost due to interpolation and deconvolu-tion.
The proposed method focuses on global context information by applying global average pooling to higher-level feature maps and fusion with low layers to reduce semantic information loss of higher-level feature maps. Average pooling can extract globally important features by summarizing and reflecting spatial information in consideration of the entire image. Global average pooling can be applied to highlevel feature maps extracted through the lateral connection to use high-level functions including abundant category information. The proposed attention upsampling module sends the feature map of the high layer to two branches. The first branch extracts global context information at 1 × 1 × C through global average pooling and aggregates information through 1 × 1 convolution followed by batch normalization and ReLU. Another branch resizes the feature map by applying nearest-neighbor interpolation to the higher-level feature map. By multiplying the attention map of global context information from each brunch and the upsampling feature map, the semantic information of the high-level feature map can be focused to obtain the upsampled result. Finally, a modified pyramid network is constructed by the elementalwise sum of the attention feature map of the high layer and the feature map of the low layer.

A. EXPERIMENT SETUP
MS-COCO 2017 dataset was used for evaluation. The dataset consists of 118K training images and 5K verification images. MMDetection implemented in pytorch was used for the experiment [52]. The input training images are maintained at a ratio and the size of the width and height is adjusted to 1333 and 800. The initial learning rate is 0.005, and for the stability of learning during the initial 500 iterate, the warmup ratio is 0.01, and 8 epochs and 11 epochs are multiplied by 0.1. The optimizer used SGD, weight-decay is set to 0.0001, momentum is 0.9, and batch size is set to 4. We used a single GPU, 2080ti.

B. EVALUATION METHOD
For performance evaluation of the proposed method, precision, recall, and average precision were used. Precision and recall can be respectively expressed as and where T P means true positive, F N false negative, and F P false positive. In order to calculate the precision and recall, the intersection over union (IoU) is used to set the thresholds of the ground truth box and the prediction box to determine the truth. The equation for IoU can be expressed as.
where B p is the predicted box, and B gt is the ground truth box. The mAP of MS-COCO increases by 0.05 per step from IoU = 0.5 for 80 classes and is calculated as an average AP up to 0.95. The equation for mAP is expressed as. mAP = mAP 0.50 + mAP 0.55 + · · · + mAP 0.95 10 , where mAP 50 and mAP 75 are calculated as fixed values of IoU = 0.5 and IoU = 0.75, respectively.

C. RESULTS
The performance the proposed method was compared with FCOS [38], PANet [16], faster R-CNN using FPN [15], Cascade R-CNN [30] and Libra R-CNN [31]. For quantitative performance comparison of the proposed method, 5K validation images (mini-val) were used on the MS-COCO detection dataset. The evaluation uses AP, AP 50 , AP 75 , AP S , AP M , and AP L as evaluation criteria for MS-COCO. Table 1, 2 shows the quantitative evaluation performance of mAP, mAP 50 , and mAP 75 when the existing method and the proposed method are applied. Experimental results of the proposed method showed the best performance in most cases. In some small, medium, and large evaluations. according to object size, accuracy was not improved, but all of the results were improved in mAP. Figure 7. shows the qualitative evaluation comparing the proposed method and FPN. Qualitative evaluation was evaluated by comparing images as a result of detecting objects with a score of 0.5 or higher. The figures in the first and second rows compare images containing objects of various sizes. When the proposed method was applied through the test images, even small objects were VOLUME 4, 2016  successfully detected. The figures in the third and fourth rows compared images containing large objects. FPN erroneously detected multiple objects in a single large object. On the other hand, the proposed method successfully detected the large object because feature fusion is performed by paying attention to global context information. Figure 8 shows the comparison of the proposed attention upsampling module and the pyramid last layer (P2) feature map of FPN. If the FPN uses only nearest-neighbor interpolation, aliasing is unavoidable, which makes context information insufficient to distinguish objects. However, the proposed method shows better results because it improves global context information from the high-level feature map to attention upsampling. The proposed method improves lateral connection and attention upsampling global context information to propose a low-level feature map and fusion method. As a result of the experiment, it compensated for information loss caused by the conventional method and showed better results in object detection of various sizes. Therefore, when using the attention upsampling module proposed in this paper, quantitative evaluation prove that it is a more robust model for multi-scale objects than conventional methods.

D. ABLATION STUDY
The effect of the pooling method on the proposed attention upsampling module was tested. Table 3 shows the performance of the proposed method according to the use of max pooling and average pooling. ResNet-50 was used as backbone, and the performance was the lowest when both max pooling and average pooling were used. Although the performance difference between max pooling and average pooling is small, average pooling shows the better performance.

V. CONCLUSIONS
In this paper, we proposed a pyramid attention upsampling module for object detection that can be used in feature pyramid-based neural networks. Our method focuses on sparse spatial information by applying DCNv2 to improve the physical connection. In addition, the loss of semantic information that occurs in the process of upsampling high-level feature maps was improved. In the upsampling process, we presented a technique for performing the upsampling process in two branches to reduce the loss of semantic information in the high-level feature map and to attention to the global context information. The proposed method can be easily applied to the object detection model using the FPN structure, and it shows better performance on MS-COCO benchmark by applying it to various object detection models. Since the one-stage method showed a slight better performance compared to that of the two-stage method, further research is needed to improve the attention upsampling module.
HYEOKJIN PARK was born in Incheon, Korea, in 1994. He received a bachelor's degree in Electric Engineering from Korea National University of Transportation, Korea, in 2020. Currently, he is pursuing a Master of Science degree in Digital Imaging Engineering at Chung-Ang University. His research interests include object detection, object recognition, artificial intelligence, and unsupervised learning.