Adaptively Dense Feature Pyramid Network for Object Detection

We propose a novel one-stage object detection network, called adaptively dense feature pyramid network (ADFPNet), to detect objects cross various scales. The proposed network is developed on single shot multibox detector (SSD) framework with a new proposed ADFP module, which is consisted of two components: a dense multi scales and receptive fields block (DMSRB) and an adaptively feature calibration block (AFCB). Specifically, DMSRB block extracts rich semantic information in a dense way through atrous convolutions with different atrous rates to extract dense features in multi scales and receptive fields; the AFCB block calibrate the dense features to retain features contributing more and depress features contributing less. The extensive experiments have been conducted on VOC 2007, VOC 2012, and MS COCO dataset to evaluate our method. In particular, we achieve the new state of the art accuracy with the mAP of 82.5 on VOC 2007 test set and the mAP of 36.4 on COCO test-dev set using a simple VGG-16 backbone. When testing with a lower resolution (300 × 300), we achieve an mAP of 81.1 on VOC 2007 test set with an FPS of 62.5 on an NVIDIA 1080ti GPU, which meets the requirement for real-time detection.


I. INTRODUCTION
In recent years, deep convolutional neural networks (CNNs) have fostered the development tasks in computer vision, such as classification [1]- [4], semantic segmentation [5]- [7] and object detection [8]- [11] through learning better feature representations. For example, to extract high level information, VGG [1] uses very small (3 × 3) convolution to deepen the network. For the same goal, GoogLeNet [2] proposes an inception module to increase the depth and width of the network. The introduction of shortcut by ResNet [3] makes the backward propagation of the gradient easier and enables the deeper networks to be effectively trained. DenseNet [4] connects every layers in a feed-forward approach to strengthen feature transmission, encourage feature reuse, and improve feature expression.
As for the object detection, the purpose is not only to identify the class of objects, but also to localize the object within a bounding box. At present, CNN features have better robustness and strong characterization ability than traditional hand-crafted features. Traditional image processing, which is characterized by hand-crafted features, relieves the problem of multiple sizes of objects by constructing an image pyramid [12]. The pyramid of an image is a set of images that are progressively reduced in resolution and derived from the same original image. Due to its effect in analyzing images at different scales, the image pyramid is also introduced into deep CNNs based object detectors.
Nevertheless, images in the pyramid need to calculate features separately, which greatly consumes computing resources. Therefore, Single Shot MultiBox Detector (SSD) [13] designs a pyramidal feature hierarchy to reuse the multi-scale feature maps and detects objects of different sizes at each feature layer. This method greatly reduces the waste of resources compared to the image pyramid. However, SSD uses features from shallow to deep, resulting in high-resolution feature maps without sufficient semantic information. Furthermore, Feature Pyramid Network (FPN) [14] constructs a top-down framework with lateral connection to produce feature maps with strong semantic information at all scales.
Recently, it has been shown that receptive field plays a key role in object detection in various scales [15]- [17]. For example, inspired by the Atrous Spatial Pyramid Pooling (ASPP) [18], aggregating features through series atrous convolutions with different atrous rates is introduced in [15], [16] for object detection. Unlike increasing the field of view through traditional convolution, the atrous convolution alleviates the contradiction between the field of view and feature resolution, which might be benefit for object localization and detection.
In order to better promote the development of multi-scale object detection, we propose a novel network structure, named adaptively dense feature pyramid (ADFP), which enhances the feature representation capabilities of CNNs-based network structures. The network structure is mainly composed of dense multi scales and receptive fields block and adaptively feature calibration block. The dense multi scales and receptive fields block mainly consists of a cascade of atrous convolution layers through densely connection, resulting dense multiscale features from multiple receptive fields. Then the adaptively feature calibration block is used to calibrate the produced feature maps based on the feature dependencies to retain features contributing more for the detection and depress features contributing less. We then construct a novel one-stage object detector based on SSD [13] framework. By introducing structure designed by us, the new object detector not only achieves state-of-the-art performance, but also maintains faster detection speed. Our work is most related to the work in [15]. While the difference lines in that we use dense connections to extract features and the features is calibrated by a followed SENet module. Our experiment results confirmed our approach by keeping high level semantic information and fine details simultaneously for a object detection task. In summary, the contributions of the this paper are listed as follows:

1.
A novel module called adaptively dense feature pyramid (ADFP) to densely aggregate information at multi scales and receptive fields is proposed.
directly predict bounding boxes and class probabilities. Although there is a slightly lack of precision, YOLO is extremely fast. After that, YOLOv2 [30] and YOLOv3 [31] are proposed to further improve the accuracy in various aspects. Among those one-stage detectors, SSD [13] detects objects of a certain scale through a series of pre-defined anchor boxes on corresponding layers. The anchor boxes over different aspect ratios and scales are set on each layer of a feature pyramid. DSSD [32] replaces the backbone with Residual-101 and adds a large number of high-level semantic information by deconvolution to improve the accuracy. To inherit the merits of both one-stage detectors and two-stage detectors, Single-Shot Refinement Neural Network (RefineDet) [33] designs two inter-connected modules, the anchor refinement module and the object detection module, to coarsely refine the anchors and further improve the regression and classfication respectively. Kong et al. [34] reformulate the feature pyramid structure to combine deeper features and shallower features with global attention and local reconfigurations. To enrich the semantic information for features, Zhang et al. introduce a semantic segmentation branch and a global activation to build Detection with Enriched Semantics (DES) [35]. Parallel feature pyramid network (PFPNet) [17] increases the width of the network instead of deepening the depth of the network to avoid integration between different layer features.

B. ATROUS CONVOLUTION
The traditional CNNs increase semantic information or receptive fields through a series of convolutional filters as well as pooling layers. However, it reduces the image feature resolution, which is important for accurate object localization and detection. To alleviate the contradiction between sufficient receptive field and image feature resolution, atrous convolution [36] or dilated convolution, is proposed for the task of semantic image segmentation and later developed by [18], [37]- [39] for semantic segmentation. Recently, as in the field of object detection, Liu et al. [15] build a lightweight model based detectors, Receptive Field Block Net (RFBNet), using a RF Block (RFB) module to enhance the feature discriminability and robustness. As opposite to this method using atrous convolution, we propose to use a dense connection of atrous convolution, which produces features densely in both scales and receptive fields. Those generated dense features are then calibrated by a adaptively feature calibarion block to retain features contributing most for the detection task and depress features contributing less. The details of our proposed module is described in the following section.

III. METHOD
We propose an adaptively dense feature pyramid network (ADFPNet), which is based on SSD framework with a novel adaptively dense feature pyramid (ADFP) module. The ADFP module is composed of two components including the dense multi scales and receptive fields block and the adaptively feature calibration block, as shown in Fig. 1. We describe the details in the follow sections.

A. DENSE MULTI SCALES AND RECEPTIVE FIELDS BLOCK (DMSRB)
Incorporating different scales and receptive fields has been proven to improve the detection accuracy [15], [17]. Thus, the purpose of this block is to generate dense features with multi-scales and different receptive fields. Inspired by the Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), we design a dense multi scales and receptive fields block consisting of atrous convolution layers with different atrous rates to take full advantages of the multiple receptive field sizes and multiscale features. Compared to the DenseNet, we add atrous convolutions into the proposed module, which has shown to be effective for the feature extraction as it is widely used in the task of semantic segmentation. In two dimensional case, such as images, the atrous convolution operator responding each element i on the output z can be explained as follows: where x denotes the input feature maps, a represents the atrous rate, and w[k] corresponds the k-th parameter of the filter w. The atrous rate is the stride used for sampling the information from the input feature maps. Atrous convolution operation can be interpreted as inserting a − 1 zeros between two sequential filter elements along each spatial dimension to expand the filter by sampling the input x. One example of atrous convolution processing two-dimensional signals with atrous rate of 2 is visualized in Fig. 2(a). However, sampling the input feature x using large atrous rate would cause sparse information, as shown in Fig.   2(b), where the feature is extracted by an atrous convolution with an atrous rate of 5 in one dimension, while only 3 pixels contributing to the convolution calculation process leading to loss of information. Such problem can be alleviated by stacking larger atrous rate after smaller ones to gather information more intensively from more computational pixels. As shown in Fig. 2(c), through the stacking of the atrous convolutions with atrous rates of 1 and 5, 9 pixels in the one dimensional input participate in the convolution calculation, which is 3 times the number in the single atrous convolution calculation. When a two-dimensional signal is used as an input, a single 3 × 3 atrous convolution with an atrous rate of 5 aggregates 9 pixels while the stacked 3×3 atrous convolutions with atrous rates of 1 and 5 aggregate 81 pixels. On the other hand, the stacked atrous convolutions with different atrous rates produce image features with different scales, which helps for object detection. Specifically, to extract features in a dense mode, we stack the atrous convolutions in a dense connection mode, where each layer have access to ahead layers and all subsequent layers to obtain a dense receptive fields. The atrous rates are increasing as the layers in the module get deeper. Such computation of the dense atrous convolution layers can be formulated as: where zd denotes the output of the d th layer, [z 0 ; z 1 , … , zd−1] corresponds to the concatenation of the outputs of the 0 th , 1 th , … , d − 1 th layers, and H a denotes the d th atrous convolution with atrous rate a. Obviously, the dense multi scales and receptive fields block can create a much denser feature pyramid for the reason that the atrous convolutions take all the previous layers' outputs as input. Compared to the single atrous convolution, the atrous convolutions in dense mode with the same receptive fields can sample more information from the input. In other words, through the dense connection, more pixels are involved in the feature extraction process.
The details of our dense multi scales and receptive fields block are illustrated in Fig. 1. A feature transformation is used to aggregate and fuse the information in the input feature maps. The produced feature maps not only contain rich multi-scale semantic information about categories of objects, but also keep the fine details about the shape and location of objects.

B. ADAPTIVELY FEATURE CALIBRATION BLOCK
The features extracted by the DMSRB contain extensive dense features from different scales and receptive fields. While given the assumption that not each channel in the dense feature contributes equally [40], thus some of the feature channels might be redundant. Such redundant features will obstacle the learning as well as the back-propogation. Thus, finding the useful features and depressing the redundant features become necessary. Thus, the adaptively feature calibration block (AFCB) is proposed to exploit the feature channel dependencies to calibrate the aggregated dense features by retaining on relatively important features and weakening relatively unrelated ones using the Squeeze-and-Excitation block [40]. Mathematically, let x ∈ ℝ C × H × W be a feature map that is passed from the DMSRB block and F c be a transformation to produce a channel attention vector c ∈ ℝ C × 1 × 1 from input. The total adaptively feature calibration block can be expressed as follow: where x refers the calibrated features and is the element-wise multiplication. The channel vector c is created to clearly model the inter-channel dependency of the features. Each value in the c multiplies the corresponding feature channel by the broadcast mechanism along the channel dimension. In order to create c more efficiently, we first compress the input feature map x in the spatial dimension H × W by using average pooling, generating a channel descriptor: p ∈ ℝ C × 1 × 1 . Then the channel descriptor p is sent into a two-layer perceptron neural network. To limit model complexity and increase computational efficiency, we first condense p to size C r × 1 × 1, then activate it in the hidden layer by ReLU activation function, and finally restore to size C × 1 × 1, then activate with Sigmoid function. The r is reduction rate used to flexibly adjust the channels of the descriptor p. The process of producing the channel attention vector c by this multi-layer perceptron (MLP) can be formulated as follow: where δ denotes the ReLU activation function, the σ refers Sigmoid activation function, The Sigmoid activation function projects the value2in range [0,1] with lower value depressing feature and higher value retaining feature.

C. TRAINING LOSS
The overall objective loss function is a weighted sum of the classification loss (clc) and the localization loss (loc), as shown: L(x, c, l, g) = 1 N L clc (x, c) + L loc (x, l, g) (5) where N is the number of matched default boxes. If N = 0, we set the loss to 0.
The classification loss is the softmax loss over multiple classes confidences (c).
The localization loss is a Smooth L1 loss between the predicted box (l) and the ground truth box (g) parameters. Similar to Faster R-CNN, we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h).
in which and

D. DETAILS OF NETWORK ARCHITECTURE
The motivation of the proposed adaptively dense feature pyramid network (ADFPNet) is to remedy the scale variation of object instances in object detection task with different receptive field sizes. To inherit the merits of SSD in accuracy and speed, we construct a feed-forward convolutional network that reuses the pyramidal feature hierarchy to produce category scores and box offsets for a fixed-size set of pre-set bounding boxes with ADFP module. Then the non-maximum suppression (nms) is followed to filter out most boxes to obtain the final detection results. The whole structure of ADFPNet is showed in Fig. 1. 1) BACKBONE-To fairly compare with the original SSD, we choose the VGG-16 network, pre-trained on the ILSVRC dataset [41] for high quality image classification, as backbone. It is noticed that other backbone such as ResNet50 or ResNet 101 is also be the alternative candidates for the backbone. Due to differences in classification tasks and object detection tasks, we remove the final classification layers and add corresponding convolutional layers with sub-sampling parameters in VGG-16 to meet our needs.

2) PYRAMIDAL FEATURE HIERARCHY-
The original SSD uses the multi-scale feature maps with different resolutions from different layers, including conv4_3, conv7, conv8_2, conv9_2, conv 10_2, and conv11_2, to predict both locations and confidences of objects at vastly different scales, which is called pyramidal feature hierarchy. In our networks, we keep the pyramidal feature hierarchy but with different configurations using the proposed ADFP module as in Fig. 1. Firstly, we place the ADFP module after conv4_3 and conv7 layer. Features from these two layers are first processed then send to the prediction layer and successive layers. Secondly, we replace the conv8_x and conv9_x layers in the original SSD with an ADFP module respectively to produce more dense and semantic rich information. All the ADFP modules consist of a cascade of atrous convolution layers with atrous rates of 1, 2, 3, 4, and 5 except the one after conv4_3, where the atrous rates are 1, 3, 5, 7, 9, and 11. The reason is conv4_3 has larger feature map resolution and need larger atrous rate to capture the large receptive field. We indicate it as ADFP_L module, as shown in Fig. 1.

IV. DATA AND EXPERIMENTS
We have conducted extensive experiments on three widely used benchmarks, namely Pascal VOC 2007, VOC 2012 [42], and MS COCO [43] datasets. The Pascal VOC datasets have 20 object categories, which are the subset of that in MS COCO including 80 object categories. VOC 2007 consists of 5,011 images as trainval set and 4,952 images as test set with all annotations available. In VOC 2012, the researchers annotate trianval set (11,540 images) and leave the test set (10,991 images) annotations unavailable. We split COCO dataset into train set (118k iamges), val set (5k images), and test set (41K iamges), which is much larger than the Pascal VOC datasets. The details on each dataset are described below.

A. PASCAL VOC 2007
In this experiment, all the methods are trained on VOC 07 + 12 trainval set, the union of VOC 2007 trainval set and VOC 2012 trainval set, and tested on the VOC 2007 test set. In VOC 2007, the positive predicted bounding box, whose Intersection over Union (IoU) with the ground truth is higher than 0.5, is sent to predict the final results. We trian our method for 350 epochs using SGD with a "warm-up" strategy. Applying the "warm-up" strategy, we ramp up the learning rate from 10 −6 to 4 × 10 −3 at the first 5 epochs, and then multiply it by 0.1 at 200, 250, and 300 epochs. Referring to [13], we set the default batch size at 32, the weight decay to 5 × 10 −4 , and the momentum to 0.9 in the training. Due to the memory constraint, we halve the batch size and learning rate when training using 512 × 512 input, and keep the other settings unchanged.

B. PASCAL VOC 2012
In this experiment, we train our ADFPNet on the union of VOC 2007 trainval set and test sets, and VOC 2012 trainval set (VOC 07 + +12), then submit the prediction results to the public evaluator. Considering the increase of training set, we adjust the total number of training epochs to 400. We set the learning rate to 4 −3 after the same "warm-up" strategy followed VOC 2007, and divide it by 10 at 250, 300, and 350 epochs. The other training setting used in VOC 2007 are kept.

C. MS COCO
To further validate our method, we conduct experiments on MS COCO dataset, which is a larger and more challenging dataset, and submit the prediction results of test-dev (20k images), which is a subset of the test set, to the official evaluation server to produce the mean Average Precision (mAP). MS COCO uses another evaluation metric different from VOC. The average mAP overing 10 different IoU thresholds from 0.5 to 0.95 is applied to evaluate the performance of the detection methods more comprehensively. APs with IOU thresholds of 0.5 and 0.75 are two other important evaluation indicators in COCO. In addition, COCO divides the object instances into large (area > 96 2 ), medium (32 2 < area < 96 2 ), and small (area < 32 2 ) according to the number of pixels in the segmentation mask to produce the corresponding APs. The training is conducted on the 2017 train set, which is exactly the same as the original public trainvel35k set as reported in the official website. We set the batch size to 32 in training and still apply the "warmup" strategy increasing the learning rate from 10 −6 to 2 × 10 −3 at the first 5 epochs. We continue to train the method with 2 × 10 −3 learning rate for 95 epochs, then decay it to 2 × 10 −4 , 2 × 10 −5 , and 2 × 10 −6 for another 50, 30, and 20 epochs, respectively. Referring to SSD, we reduce the size of the default anchor boxes while keeping the other settings same as in VOC since the size of object instances is smaller than that in VOC. Similarly, for the memory issue, we halve the batch size and learning rate for 512 × 512 input, increasing 20 epochs for the learning rate of 1 × 10 −3 .

V. RESULT
A. PASCAL VOC 2007 1) QUANTITATIVE RESULT- Table 1 shows the performance comparison of ADFPNet with the state-of-the-art methods. The results of SSD300 and SSD512 are enhanced by using a "zoom in" operation to produce random crops as training examples. Our ADFPNet fed with low resolution input 300 × 300 achieves 81.1% mAP without any bells and whistles, which outperforms the SSD300 (77.2%) by a large margin and even exceeds the SSD512 (79.8%) in performance. It should be noticed that our ADFPNet is the first method obtaining above 81% mAP with such low resolution input as we known. By increasing the input size to 512 × 512, the performance of our method is further improved to 82.5% mAP, which exhibits the best mAP among the most advanced VGG-16 based methods (e.g., RefineDet, RFBNet, and PFPNet-R, etc). Our ADFPNet512 surpasses most of the two-stage object detectors including ResNet-101 based Faster RCNN and R-FCN, and shows the result similar to CoupleNet [48], which designs different coupling strategies and normalization ways to couple the global structure with local parts for object detection. Note that, two-stage object detectors typically use high-resolution images (i.e., ~ 600 × 1000) as input and use ResNet-101 as the base network, which yield higher detection performance but greatly increase the inference time as we all know. Compared to the real-time methods such as SSD, YOLOv2, RefineDet, and RFBNet, ADFPNet not only exceeds them in performance, but also is on par with them in inference speed. In order to make our training process more intuitive, the loss and mAP curves of ADFPNet300 during training are shown in Fig. 3 and Fig. 4, respectively. The mAP is evaluated on the VOC 2007 test set every 10 epochs. Moreover, the precision-recall curve of ADFPNet300 tested on the VOC 2007 test set is shown in Fig. 5.
2) QUALITATIVE RESULT-The detection results across multi-objects and different scales on VOC 2007 test set compared with the SSD300 [13] are shown in Fig. 6. It suggested that the SSD method missed objects in very small scale while the proposed method can capture and detect them successfully, which contributes to the the proposed module.

3) FEATURE MAP BEFORE AND AFTER CALIBRATION-
We also show the qualitative results of ADFPNet512 on the feature map before and after self-calibration in Fig. 7 and 8. It suggested that the feature calibration block did depress the features which offers less or sparse information by learning a lower weight (shown in green dot rectangle in Fig. 7 and 8(c)) and assigning a higher weight for the features containing useful information (shown in red dot rectangle in Fig. 7 and 8(c)). Table 2 shows the detection accuracy of the proposed ADFPNet with the other state-of-theart frameworks. To better demonstrate the effectiveness of our ADFPNet, we separately report the results of each category in VOC 2012 test set. Compared with the frameworks using the similar input size, ADFPNet300 produces the best mAP of 79.0%, which has even surpassed most two-stage frameworks using much deeper base network (i.e., ResNet-101 [3]) and larger input size around 1000 × 600. When the input size to 512 × 512, ADFPNet512 achieves best mAP of 81.9%, outperforming the most recently proposed frameworks aiming to detect the multi-scale objects by a large margin (e.g., 80.3% mAP of DES512 [35] and 80.0% mAP of DFPR512 [34]). To the best of our knowledge, ADFPNet is the first framework to obtain performance above 81% mAP on VOC 2012 without any bells and whistles. Table 3 shows the comparison of our method and the other state-of-the-art methods. ADFPNet300 produces 31.8% mAP, which outperforms the other VGG-16 based detectors with the same input size of 300 × 300. It is also noticed that the accuracy of proposed ADFPNet300 is higher than RefineDet320 by 2.4%, which designs the anchor refinement module (ARM) to filter out the negative anchors and coarsely adjust the positive anchors with slightly larger input images. The accuracy of ADFPNet300 even exceeds R-FCN based on ResNet-101 backbone and is similar to RetinaNet400 which uses ResNet-101 as backbone and a 400 × 400 input size. It should be noticed that our method is much better than the recent advanced one-stage detectors which try to include multi-scale context information such as DFPR [34], RFBNet [15], and PFPNet-R [17]. Furthermore, when testing under input image size of 512 × 512, the performance of ADFPNet512 can further improve to 36.4%, which outperforms most of one-stage methods except ResNet-101-FPN based RetinaNet800*, which adopted scale jitter, used the 800 × 800 input image, and was trained for 1.5× longer than RetinaNet500. Compared with the two-stage methods, ADFPNet512 surpasses most of them except Faster R-CNN w/ TDM and Deformable R-FCN with complex backbone and large input size (i.e.,1000 × 600).

C. MS COCO
Our proposed ADFPNet also shows excellent performance on small object detection in COCO dataset. In COCO, approximately 41% of objects are small while only 24% are large as small object detection is still a fundamental problem in computer vision. As shown in Table 4, ADFPNet300 and ADFPNet512 achieve 12.6% and 19.2% mAPs, respectively on the small objects which demonstrates the effeciency and advantage of proposed method. Moreover, ADFPNet512 achieves the best AP on small objects among the VGG-based detectors and even better than most of ResNet backbone based detectors. We show the detection results of ADFPNet512 on the MS COCO test-dev set in the Fig. 9.

A. ARCHITECTURE ABLATION AND STUDY
We conduct experiments on the union of VOC 2007 and VOC 2012 trainval sets to exploit the influence of ADFP module, VOLUME 7, 2019 ADFP_L module, adaptively feature calibration block, and more default boxes. The accuracy is evaluated on VOC 2007 test set, as shown in Table 4. In all the experiment, the input image size is set to 300 × 300 and all the other hyperparameters are set to be the same.

1) ADAPTIVELY FEATURE CALIBRATION BLOCK (ROW 2)-
To verify the effectiveness of adaptively feature calibration block, we construct a variant network by removing it. As listed in Table 4, the variant increases the performance by 3.5% mAP as compared to the baseline. With the adaptively feature calibration block, the mAP is further improved from 80.7% to 81.1%.

2) MORE DEFAULT BOXES (ROW 3)-
In the original SSD, the feature map of conv4_3 contains fine details, which is critical for location, but lacks strong semantic information, which is used for classification. Therefore, only 4 default anchor boxes are associated at each location of conv4_3, conv10_2, and conv11_2, while 6 default anchor boxes are associated at each location of the other layers. We sent the feature maps from conv4_3 into our ADFP_L module to produce a feature map containing rich details and semantic information, wihch is necessary for detecting small object instances. Thus, in order to improve the performance, especially for small instances, we set 6 default boxes, adding aspect ratios of 1 3 and 3, on the feature map from ADFP_L module, which has no effect on the original SSD as mentioned in [15]. As shown in the third and fifth rows in Table 4, adding more default boxes increases the mAP from 80.3% to 81.1%.
3) ADFP MODULE-To demonstrate the effectiveness of ADFP module, we redesign a simple network only with the ADFP module and use the original SSD with new data augmentation as a baseline. The SSD obtains the detection performance of 77.2% as shown in the first row of Table 4. Obviously, with the simple introduction of our ADFP module, this performance is improved to 80.2%. The 3% gain fully demonstrates that our module, which extracts the features with different receptive fields in a dense way, can significantly boost detection performance.
Due to the feature map produced from conv4_3 is much bigger than the others, we correspondingly adjust the atrous rates to constitute a new module, defined as ADFP_L. As can be seen in the fourth and fifth rows of Table 4, the adding of ADFP_L module network further increases the performance by 0.9% mAP as compared to the network only with ADFP module. This probably contributes to the sufficient contextual field from conv4_3 when using larger atrous rates.

B. INFERENCE TIME STUDY
To quantitatively evaluate inference time, we test SSD and ADFPNet with batch size 1 on our machine with an NVIDIA 1080ti, CUDA 9.0 and cuDNN v7 to compare fairly. All the methods are trained on the VOC 07 + 12 trianval set and evaluated on the VOC 2007 test set with 300 × 300 input size. We report all the results in Table 4. The ADFPNet300 without ADFP_L module outperforms the original SSD300 with a large margin (80.2% vs 77.2%), although it spends a little extra time (15 ms/img vs 8 ms/img). The addition of the ADFP_L module consumes almost no extra time but improves performance by 0.9%. Finally, our framework has a 3.9% accuracy gain compared to SSD with an FPS of 62.5. It strongly proves that our proposed ADFP_L and ADFP module significantly help promote the detection performance while meeting the needs of real-time detection (30 frames per second or better), as mentioned in [57] and [29].

VII. CONCLUSION
We present a novel adaptively dense feature pyramid network (ADFPNet) for object detection under the Single Shot MultiBox Detector (SSD) framework. The proposed network is able to detect objects across different scales by extracting feature maps with dense multi scales and receptive fields. Extensive experiments have been conducted on several public benchmarks, Pascal VOC 2007, Pascal VOC 2012, and MS COCO to demonstrate the efficiency of our method, which achieves the state-of-the-art performance without any bells and whistles. Moreover, the proposed method also achieves a good balance between detection accuracy and inference speed. Architecture of the proposed adaptively dense feature pyramid network (ADFPNet). The proposed ADFP module first produces dense features across multi scales and receptive fields; then the dense features are re-calibrated according to their contribution to the detection task. The proposed module is seamless connected to the conv4_3 and conv7 layer. We use larger atrous rate for the ADFP module after conv4_3 as its larger feature resolution (denoted as ADFP_L).  Training loss of ADFPNet300 on VOC 07 + 12 trainval set. Conf loss curve signifies the confidence loss. Loc loss curve signifies the localization loss. Loss denotes the total loss of confidence loss and localization loss. The horizontal axis represents the training epochs.  The mAP curve of ADFPNet300 trained on VOC 07 + 12 trainval set. It is tested on the VOC 2007 test set. The horizontal axis represents the training epochs. The feature maps of ADFPNet512 before and after self-feature calibration. (a) shows a detection results; (b) shows the feature map from Conv4_3 layer of channel 288 to 303 generated by SSD; (c) shows the feature map generated by our method before feature calibration and (d) shows the corresponding feature of (c) after calibration. The numbers in (c) show the relative feature weights calculated by the adaptively feature calibration block. The red-dot rectangle represents weighted more while the green-dot rectangle is the feature which is depressed. Best viewed in color.  Detection results on COCO test-dev set.    Note that train2017 signifies the COCO 2017 train set which consists of the same exact images as trainval35k. RetinaNet800* adopted scale jitter and was trained for 1.5× longer than RetinaNet500 using input image size of 800 × 800.