Single Shot Multibox Detector With Deconvolutional Region Magnification Procedure

In this paper, we make an effort to improve the accuracy of small and medium object detections of SSD (Single Shot Multibox Detector). To this end, we introduce a deconvolutional region magnification procedure in which the existing layers in SSD play a role in the region proposal network and the proposed regions are magnified for recognition. Moreover, features are also extracted from a shallow layer and a new feature pyramid is constructed on top of these structures. Then, features are contacted and fed into classification and regression modules as in SSD. The weights of the present model are obtained via a pre-training-re-training strategy. By evaluating the model performance on a test set assembled by the samples in the PASCAL VOC and MS COCO datasets, the present model shows that the mAPs (mean average precisions) of small and medium object detections are 42.4% and 74.7% respectively, which are 27.1% and 15.6% better than SSD. This proves the effectiveness of our proposed method.


I. INTRODUCTION
Recognition of objects and regions in images along with their location and classification is the central topic of computer vision that has attracted enormous attention for decades. Recently, significant improvement for object detection arises due to the emergence of the deep learning techniques [1], [2], which is a powerful method for learning feature representations automatically from raw input data. Basing on the deep convolutional neural networks (CNNs), there is an increasing number of models and applications devoted to design the object detection system, the so-called object detector. The Overfeat Network [3] is the first Deep Learning object detector which employs CNNs after a sliding window segmentation. It segments each image into several parts and does classification on each part using an individual CNN. Subsequently, the final location and classification predictions are generated by combining outputs of the previous two processes. The highly influential successors that are designed basing on the pipeline of this two steps idea include the Region Convolutional Network (R-CNN) [4], the The associate editor coordinating the review of this manuscript and approving it for publication was Kok-Lim Alvin Yau . Fast-RCNN [5], the Faster-RCNN [6], as well as the extended faster-RCNN with a position-sensitive ROI (region of interest) pooling [7]. Although these models have achieved better accuracy, there are also certain models in which the predictions of class probabilities and object locations are combined into a single step. The single-step models have the advantage of the real-time speed and memory saving while maintaining competitive accuracy. The most popular single step object detectors on the market are the Single Shot Multibox Detector (SSD) [8] and the You Only Look Once (YOLO) [9], [10]. The former is the first model to propose training on a feature pyramid in which default boxes are generated from each grid cell on each feature maps, and the later constructed in the same vein as the former but with only one feature map for classification and generated two default boxes for each grid cell directly cropped on the input image. On the PASCAL VOC2007 test, SSD can achieve 74.3% mAP (mean average precisions) at 59 FPS (frames per second) on an Nvidia Titan X for 300 × 300 input, outperforming state-of-the-art methods [8]. Since there is only one early layer assigned to collect low-lying features, the semantic information is not enough resulting in poor performance on small and medium object detections. A solution to this annoying issue is highly VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ desirable. Nowadays, there are many improvements with many refinements. For example, DSSD (Deconvolutional SSD) in which the backbone network VGG16 is replaced by a more powerful one (i.e., ResNet-101) and followed by an hourglass network structure [11], RSSD (Rainbow SSD) with a rainbow concatenation module [12], DSOD (Deeply Supervised Object Detector) which is trained from scratch and designs a DenseNet architecture to improve the parameter efficiency [13], and FSSD (Feature Fusion SSD) with an elaborately designed feature fusion module [14] are proposed successively. For extensive reviews, see [15], [16]. Inspired by these advances, in this paper, we focus on improving the performance of the small and medium object detections of SSD. To achieve this, we propose a new method in a way that combines the essence of region proposal network (RPN) in faster-RCNN and deconvolution operation in DSSD. To enhance small and medium object detections, low-level features are extracted via a deconvolutional region magnification procedure in which predicted boxes of SSD layers are regarded as region proposals and a deconvolutional layer is introduced to magnification. Our intuition is that larger feature maps can lead to higher classification accuracy. On the other hand, the feature extraction is reinforced with an existing convolutional layer in the backbone network VGG16. This layer is superficial and more information about small and medium objects may be preserved. Features that are obtained from newly added layers are contacted with features from a newly created feature pyramid which is inspired by the feature pyramid network (FPN) [17] and fed to classification and regression. By a training strategy involving a pre-train stage followed by a re-train stage, we obtain the weights of the present model. It is shown that our strategy is an effective way that makes mAPs of small and medium objects on the test nearly 27.1% and 15.6% higher than SSD.
The paper is organized as follows. In the subsequent section, we begin by reviewing the structure of SSD and then describe the newly added layers with the underlying motivation. In section 3, training and testing are made, and the outputs are provided. The performance of the present model is compared not only with SSD but also with other stateof-the-art SSD-based models that are designed for object detection across different scales on both accuracy and speed. The conclusion is given in Sec. 4.

II. MODEL
In this section, the architecture and salient properties of SSD are briefly outlined. Then the main ingredients of our design strategies are supplied and explained in detail.

A. SSD
In SSD, the idea of anchor boxes such as these in RPN and multiscale features maps such as in the FPN are combined to achieve a fast detection speed while still retaining a high detection quality. The sketch of SSD architecture is shown in Fig. 1, which includes a feature pyramid placed on top of a backbone convolutional base (e.g., VGG16) and is followed  by non-maximum suppression (NMS) to produce the final detection. In the feature pyramid architecture, each layer plays a specific role to detect objects in different scales.
SSD generates the detection results directly from feature maps in different levels. The low-lying feature maps may contain essential location information, but the semantic and context information may be insufficient. Besides, small objects may lose their information when passes through the backbone network resulting in some missing detections. In SSD, there is only one low-level layer that is allotted to detect small objects, i.e. conv4_3 in the existing VGG16. The detailed information may not enough for correctly recognizing small objects. However, there are five layers above conv4_3 with decreasing size and resolution. These layers can give enough information for large object detection but still insufficient for medium object detection. For example, in Fig. 2, we visualize feature maps with detection results from layers conv4_3, fc7, and conv8_2 in SSD. It is shown that the confidence for category cat is ranging from 0.1 to 0.36, and in certain cases, the cat is incorrectly attributed as a person or dog. The detection results from low-lying feature map conv4_3 are even worst, where only one cat is correctly recognized with a small confidence 0.1. If we set a confidence threshold higher than that value, such as 0.3, there will be no recognition at all. For other feature maps, the confidences are increased but the Here, an enlarged feature map is obtained from the deconvolution layer Deconv2 that follows the convolutional layer conv4_3, and features are also extracted from an existing layer conv3_3 which is shallow than conv4_3. accuracies are still worried. As a result, there is a large room for improving the small and medium object detection under the framework of SSD.

B. IMPROVEMENTS FOR SMALL AND MEDIUM OBDECT DETECTIONS
The strategy to improve the detection of the small and medium objects is twofold as sketched in Fig. 3. Additional features are extracted from an existing layer conv3_3 in the backbone network. The intuition is that small and medium objects may not even have any information at the very top layers. A shallow layer may reserve more information about these objects. If the chosen layer is too shallow, it will be not enough semantic information. Thus, we choose a lower layer next to conv4_3 to reconcile this trade-off. Moreover, the deconvolutional region magnification procedure on feature maps extracted by the layer conv4_3 is made. The deconvolutional region magnification procedure includes a series of operations and we elaborate in the following. The deconvolution operation is taken to increase the resolution of the feature maps. After the deconvolution operation, the size of an output feature map d is increased as where s is the number of strides; k is the size of the deconvolution filter; i is the size of the input feature map; p is the number of zero padding. Since the original size of feature maps corresponding to conv4_3 are 38 × 38, we introduce a deconvolutional layer Deconv2 with s = 8, k = 6, and p = 1 giving rise to output feature maps of size 300 × 300. This size is equal to the input image and features of small and medium objects may be easily captured. The size of the low-lying feature map is large, which may contain essential location information, but the semantic and context information may be insufficient. Therefore, the small objects are mainly detected by the low-lying feature map, and the medium objects are detected by the high-lying feature map. Same as the RPN in Faster-RCNN, we introduce the concept of region proposal which is just predicted boxes by SSD architecture remained in our model. Then, the region proposals are mapped to the enlarged feature maps according to the following formulae where r w/h is the width/height of the region proposal on the enlarged feature map; d w/h is the width/height of the region proposal on the input feature map; f w/h is the width/height of the input feature map; img w/h denotes the size of the input image. The proposed regions are cropped from the enlarged feature map when all region proposals are mapped. A maximum pooling is applied to resize each proposed region to 38 × 38. Feature maps in other channels are all processed by the same procedure. Although the final output of the deconvolutional region magnification procedure has the same size as the original conv4_3, it may contain more information of the target object and makes the detection of small and medium objects easier. An example is shown in Fig. 4. After the backbone network, an input image of size 300 × 300 turns out to be a set of feature maps. We visualize one of them that associate with conv4_3 as shown in Fig. 4(b). The deconvolutional layer Deconv2 deconvolutes this feature map to the size 300 × 300. After cropping and maximum pooling, the proposed region returns to the size of 38 × 38. Notice that the activated region in feature maps Fig. 4(b) and Fig. 4(d) have the same shape, but feature map Fig. 4(d) contains more information that makes the detection easy. In fact, the deconvolutional region magnification procedure can be regarded as a ''zoom in'' VOLUME 9, 2021 operation. In the detection pipeline, feature maps corresponding to conv4_3 are all replaced by those zoomed feature maps.

C. ARCHITECTURE
The higher-level layers may contain more semantic information and correspond to larger receptive field. Therefore, it is responsible for the detection of large objects. However, in the present model layers in SSD only plays a role of RPN such that we need to design a new feature pyramid. To this end, we gather all feature maps from existing layers Fc7, conv8_2, conv9_2, conv10_2, and conv11_2 assembling a new feature pyramid. To distinguish with SSD, layers in this new feature pyramid are labeled as conv3, conv4, conv5, conv6, and conv7. As shown in Fig. 5, these newly added layers not only have the same hyper-parameters such as kernel size of filters but also share the same weight and feature map as their counterparts in existing SSD. For each grid on each feature map, default boxes are generated in the same way as in SSD, and scores for each category and offsets for bounding boxes are predicted.
The feature maps corresponding to the deconvolutional layer Deconv2 should also be assigned with default boxes. Since these feature maps are cropped from an enlarged one, the mapping between the default box on a feature map and the bounding box on an input image should be modified. This mapping in SSD is expressed as where c x/y is the center coordinate of the default box on the feature map; img cx/cy is the center coordinate of the bounding box; w k /h k is the width/height of the bounding box with (x min , y min , x max , y max ) being its top left and bottom right coordinates. For a feature map after the deconvolutional region magnification procedure, Eq. 3 should be modified as where f center x/y is the center coordinate of the bounding box with (x min ,ȳ min ,x max ,ȳ max ) being its top left and bottom right coordinates. Since additional layers are stacked on the building block of SSD, we add seven convolutional layers for classification and other seven convolutional layers for bounding box regression. In each of these layers, we use 3×3 filter with L n = k×4 and C n = k×c channels for location and classification predictions, where c is the number of classes and k is the number of default boxes on each grid. The default boxes predicted scores and offsets for bounding boxes for each layer are contacted and fed into a combined loss function in the same way as SSD. With those improvements, we plot the overall architecture of our model in Fig. 6. Here, layers in SSD are depicted in the blue box, layers that improve small and medium object detections are depicted in green boxes, the new feature pyramid is depicted in the red box, and classification and regression modules are depicted in the yellow box.

D. LOSS FUNCTION
In this paper, a multi-task loss function is used [8]. The overall loss function is the weighted sum of the confidence loss (conf ) and the localization loss (loc): where N is the number of matched default boxes, if N = 0, we set the loss to 0; α is the weight term of localization loss which is set to 1. The confidence loss is the softmax loss over multiple classes confidences (c).

III. EXPERIMENTAL RESULTS AND DISCUSSIONS
In this section, we are in a position to perform the training and examine the performance of the present model.

A. EXPERIMENTAL SETTING AND TRAINING STRATEGY
The code is built on Caffe [18]. We train the present model on a computer with ubuntu16.04, Intel Xeon E5-2640 v4 CPU, and eight Nvidia Titan Xp GPUs with graphic memory of 12GB. To verify the efficiency of our model for the  experiments, the MS COCO evaluation metrics [19] are adopted, which divide the objects into three scales according to their areas: small (area < 32 2 ), medium (32 2 < area < 96 2 ), large (area > 96 2 ). According to these evaluation metrics, we select seven classes (i.e., bicycle, bus, car, cat, dog, motorbike, and person) from PASCAL VOC and MS COCO datasets, which all meet the definition of the small and medium objects. Moreover, the number of small and medium objects within these classes is larger than other classes. By these pictures, we assemble a pre-training dataset on which layers within SSD is pre-trained using the same training policy as in Literature [8]. In the following, the pre-trained SSD is also called SSD although its weights are different from that in Ref. [8]. Then, we pick up 3376 pictures from PASCAL VOC and MS COCO assembling a re-training dataset. The present model with weights from pre-training is re-trained on this dataset. The parameters for all the newly added convolutional layers are initialized with the xavier method [20]. During the pre-training, we minimize the joint localization and confidence loss. We apply the same matching strategy, hard negative mining strategy, and data augmentation as described in Ref. [8]. Using the SGD (stochastic gradient descent) with initial learning rate 10 −4 , 0.9 momentum, 0.0005 weight decay, and batch size 20, the optimization is done after 120000 iterations. During the re-training, no data augmentation is involved, and the optimization is achieved after 47600 iterations. We select 423 images with small objects and 456 images with medium objects for testing from the PASCAL VOC and the MS COCO. The present model is evaluated on a computer with ubuntu16.04, Intel Core i5-7400 CPU, and Nvidia GeForce GTX1060 GPU with graphic memory of 6GB. In the rest of this section, we visualize the detection results of our model for small and medium objects and compare the results with SSD. Furthermore, comparisons of overall performance on timing and accuracy between the present model, SSD, and other methods are made.

B. ABLATION STUDY
We investigate the effectiveness of different components of our model by the ablation study. The experimental results are shown in Table 1 and Table 2. ''Deconvolution'' refers to introduce a deconvolutional layer to magnify the low-resolution feature map. The deconvolution helps to improve the mAP of the small and medium object detection of SSD for 14.4% and 6.6%, respectively, because the larger feature maps may lead to higher classification accuracy for small and medium object detection. ''Region proposal'' means that we select the exiting layers in SSD that play a role of the region proposal network and the proposed regions are magnified for recognition. By adding the ''Deconvolution'' and ''Region proposal'', the model performance increase from 15.3% mAP to 34.2% mAP for small objects as shown in Table 2. Moreover, our model increases the performance by 8.2% for the medium objects as shown in Table 1. ''Feature pyramid'' represents a new feature pyramid that is constructed by the shallow layers. Since the feature map generated by the newly created feature pyramid not only preserves the features of small objects but also improves the accuracy of the object classification, the mAP is improved by 20.8% and 10.4% for the small and medium objects, respectively.

C. RESULTS AND DISCUSSIONS
We illustrate some detection examples of specific layers in Fig. 7 and Fig. 8. For a covered object car in Fig. 7, the detection result corresponding to layer conv4_3 in SSD is plotted in the right panel, and the left panel shows the detection result corresponding to layer Deconv2 in the present model. It is shown that the bounding box given by SSD does not match the object size exactly, and the corresponding confidence is 0.02. However, the output of the present model not only matches the correct object size but also gives a confidence increasing nearly 6×. For dense objects in Fig. 8, i.e. peoples in the picture, the detection results of SSD are compared with the present model. Here, panels (a), (b), and (c) correspond to layers Fc7, conv8_2, and conv9_2 in SSD, and panels (d), (e), and (f ) correspond to layers conv3, conv4, and conv5 in the present model. Notice that the dense medium object detection due to the layers Fc7, conv8_2, and conv9_2 is not satisfactory. There are numerous people that are not recognized by the detector. As can be  Examples of detecting output of SSD and present model with scores higher than 0.3 are demonstrated in Fig. 9. Here, green bounding boxes represent the objects that are recognized by the present model but missed by SSD, yellow bounding boxes represent fault recognition by SSD, and purple bounding boxes represent the recognitions of the present model that have higher confidence than SSD. Fig.9 nicely proves the effectiveness of improvements we made that indeed result in better discrimination of the detector.
The average precision (AP) for a specific category and the mAP for all categories are effective metrics to evaluate object recognition models. We calculate the AP and mAP of the present model and other models, and the results are shown in Table 3 and Table 4. According to the code in literature [14], we reproduce a very recently developed version of SSD, i.e. FSSD. This model introduces an elaborately designed feature fusion module in which features from different layers in the feature pyramid are concentrated. A new feature pyramid is then established by pooling the concentrated feature maps to various sizes. Each of these feature maps may include vital location and semantic information. Using the same training strategy and testing data, we obtain the detection results of several other state-of-the-art SSD-based object detectors that are for object detection across different scales are also provided in Table 3 and Table 4. Here, outputs for medium objects are provided in Table 1 with the confidence VOLUME 9, 2021

TABLE 4.
Average precision of small object detection for every category and mean average precision (mAP) for all categories. Here, the confidence threshold is taken as 0.1.  higher than 0.3, and Table 4 displays small object detection with the confidence higher than 0.1. From Table 3 we note the obvious increase in the values of mAP. Particularly, the mAP for medium object detection of the present model increases by 15.6% compared to SSD. In addition, the results of our model are also higher than other methods. A similar tendency can also be found in the small object detections as shown in Table 4, where the mAP increases by 27.1% compared to SSD. Even though compared with the recent state-of-the-art model TDFSSD, our model's mAP exceeds the TDFSSD by 3.1%.
To examine the sensitivity for a specific category such as pedestrians, from MS COCO we pick up 1100 pictures including persons in various sizes and positions. Using these pictures, we calculate the true positive rate (TPR) which is defined as TPR = TP TP + FN (8) where TP is the number of true positive recognitions and FN is the number of false negative recognitions. The result is listed in Table 5. Notice that the TPR of our model has increased by 10.0% compared to SSD. Our model takes the first place in the state-of-the-art methods except AugFPN. The detection speed is another critical criterion for an object detector that may have potential application in a real-time system, and it is often measured in FPS. Table 6 shows the comparison of detection speed between SSD, our model, and the other models on our testing environment. It is shown that FSSD can run at 48 FPS little slower than SSD that can run at 55 FPS. However, our model has a speed drop relative to previous models. The FPS of the present model is 24. For a single image, our model consumes more time twice than the previous models. As a matter of fact, speed vs. accuracy is the main trade-off of object detectors. The newly added layers that help to improve the performance on accuracy also deepen the network. With these layers, the model needs more calculation during forward and backward propagations, resulting in poor performance on training and inference speed. Since a typical video frame stream is usually 25 FPS, the present model could still satisfy the requirement of real-time detection.

IV. CONCLUSION
In summary, we have added a series of improvements on the existing SSD architecture. These improvements manage to increase the accuracy on small and medium object detections significantly over previous attempts. Our improvements include using an extra shallow layer in the backbone network, using a deconvolutional region magnification procedure to magnify low-level feature maps, and constructing a new feature pyramid on top of the existing SSD structure. With these modifications, we can achieve a dramatically improved performance much better than SSD even training on a dataset with a small number of samples. However, the newly added layers consume a lot of time resulting in the speed dropped by half, whereas it is still fast enough for real-time applications. In the future, it is worth to enhance our model with much stronger backbone networks such as ResNet [23] and DenseNet [24], and design a lightweight version of the model that is more appropriate for embedded systems.