Dilated Convolution and Feature Fusion SSD Network for Small Object Detection in Remote Sensing Images

Noting the shortcomings of current methods in detecting small objects in image-based remote sensing applications, in this paper, we propose a novel implementation of single shot multibox detector (SSD) networks based on dilated convolution and feature fusion. We call this algorithm dilated convolution and feature fusion single shot multibox detector (DFSSD). This algorithm removes the random clipping steps of data preprocessing layers in conventional SSD networks and utilizes the structure of feature pyramid network (FPN) network to fuse the low-level feature map with high resolution and the high-level feature map with rich semantic information. It also enhances the receptive field of the third-level feature map of the DFSSD network by using dilated convolution. In the data processing step of the model, we use the image segmentation of the feature point region proposals to improve the training sample size. The mean average precision (mAP) value of the proposed DFSSD network, when tested on remote sensing datasets, achieves 76.51%, which is significantly higher than that of the SSD model (69.81%).


I. INTRODUCTION
Object detection has always been a research hotspot in the field of computer vision [1]. Detecting objects from general classes is the supporting technology for a large number of applications including intelligent monitoring [2], intelligent robotics [3], and many other applications. For instance, several methods are developed for face detection [4] and pedestrian detection [5] for surveillance systems and selfdriving cars that are mature and achieve reasonable performances. However, the detection accuracy for a general class of objects with heterogeneous shapes, sizes, patterns, colors, and morphology is far from satisfactory. The source of difficulty is developing a unified method to capture objectspecific features of an object with diverse size, shape, color and etc. It is difficult to find common features, especially The associate editor coordinating the review of this manuscript and approving it for publication was Alicia Fornés .
for traditional machine learning methods that rely on manually designed feature extraction methods. Recently, more and more researchers have turned their attention to deep learning (DL) methods [6]. There exist many excellent object detection methods based on deep learning architectures and platforms such as AlexNet [7], ZFnet [8], VGGNet [9], GoogleNet [10], R-CNN [11], Faster R-CNN [12], SSD [13], and etc...Among them, the single shot multibox detector (SSD)model is a network architecture based on convolutional neural networks (CNN) with relatively high accuracy and near real-time performance. In this paper, the size of remote sensing image data set is from 800 × 800 to 4000 × 4000, the pixel value of small object is between 10 × 10 and 50 × 50, and the pixel value of medium object is between 50 × 50 and 300 × 300. However, the feature map of SSD model used for prediction is not reused, lack of sufficient semantic information, and the detection effect of overlapping object and small object is poor.
Many scholars have carried out research on improving the small object detection capability of the SSD model. Li et al. [14] proposed the feature fusion single shot multibox detector (FSSD) model, which reconstructs the multi-scale features of the model through feature fusion and down-sampling operation, and enriches the feature details to improve the detection performance on small objects. Liu et al. [15] proposed the DeepSat classification framework based on the ''hand-made'' features and deep belief network (DBN). The framework augments a CNN with handcrafted features (instead of using DBN-based architecture) for classification. This method achieves superior performance on sat-4 and sat-6 datasets with the accuracies of 99.90% and 99.84% respectively. Zhou et al. [16] proposed a multi-level feature extraction method to solve the problem of object loss of discontinuous object tracking in the image-based remote sensing systems. Chen et al. [17] proposed an improved semantic segmentation Neural Network Based on DeepLabv3, which adopts dilated convolution, a fully connected (FC) fusion path and pre-trained encoder for the semantic segmentation task of HRRS imagery, reaching the classification accuracy of 91%. Duarte et al. [18] proposed three multi-resolution CNN feature fusion methods to improve the classification accuracy of building damage in the remote sensing images, reaching the accuracy of 88.7% on the satellite and aerial (unmanned) cases. Ni et al. [19] proposed a learnable framework of CNN based on the multilayer energized locality constrained affine subspace coding (MELASC), which improved the accuracy of scene classification for image-based remote sensing applications.
In this paper, we propose an image double segmentation method based on feature point region, which segments the original remote sensing images, maximally retains all the information of an image, and reduces the adverse effects of the segmented image. More specifically, we propose the dilated convolution and feature fusion single shot multibox detector (DFSSD) network, which combines the high-level feature map and the low-level feature map to improve the spatial semantic information of the low-level feature map. At the same time, the dilated convolution [20] is used to allow the third-layer features to participate directly in the prediction, further enriching the detailed information of the network features. The performance of [the proposed] DFSSD network on remote sensing datasets including700 aircraft and 938 car remote sensing images is not inferior to that of the same type of the networks, while the mAP is increased by 4%. compared with the original SSD network model.

II. RELATED WORK
Compared with the traditional image-based object detection methods, object detection using aerial images encounters problems such as the difficulty of detecting small objects, and the lack of sufficient representative features for objects. Traditional methods generally use scale-invariant feature transform (SIFT) [21], speeded up robust features (SURF) [22], features from accelerated segment test (FAST) [23], Binary robust invariant scalable keypoints (BRISK) [24] and so forth to detect objects. Although these methods yield reasonable performance for object detection under plain backgrounds, they do not achieve good results under complex backgrounds. Therefore, deep learning methods with higher capabilities in capturing intricate patterns received increased attention by the research community and have been applied to various tasks related to image-based remote sensing applications such as military guidance, object tracking, urban planning and so forth. The most popular and successful deep learning methods are based on CNN architecture due to their enhanced performance in modeling visual information. At present, DL-based object detection methods can be divided into two main categories including the object detection algorithms based on the candidate region and object detection algorithms based on regression models.
The calculation process of the object detection algorithm based on candidate region includes the following steps. Firstly, n regions of interest (ROI) are extracted from the input image according to the region selection algorithm. The commonly used selection algorithms are selective search, edge boxes, region proposal network (RPN) and so on. Then, a multi-layer convolution neural network is used to extract the above regions of interest and classify the extracted features. Finally, the bounding box regression is used to correct the output window and provide the final result. Some implementations of the object detection algorithm based on the candidate region include R-CNN [11], Fast R-CNN [25], Faster R-CNN [12], R-FCN [26].
Although the above-mentioned object detection algorithms based on the candidate region can provide high accuracy, they cannot detect the moving object in real-time videos. Indeed, the object detection methods without RPN networks have more advantages in terms of operation speed. The deep learning object detection algorithm based on regression models can identify the objects in multiple locations of the original image or the feature map, and directly obtain the type and the location of the object. Some successful implementations include Yolo [27], SSD [13], YOLO9000 [28], DSSD [14], RSSD [16], FSSD [14].
More and more CNN-based methods were used in the field of image-based remote sensing [29], [30]. Zhang et al. [31] constructed an iterative weakly supervised learning framework, which can automatically mine and augment the training datasets from the original images. This method combines the framework with the candidate RPN to locate an aircraft in large-scale and extremely high-resolution images. Cai B et al. [32] designed an end-to-end convolution neural network to realize the detection of airport objects. The authors of this work proposed a method of mining difficult samples to train the end-to-end deep convolutional neural network for airport detection in complex situation, reaching the accuracy of 83.02% on a optical remote sensing dataset acquired from Google Earth and integrated them into the network architecture. Pang et al. [33] proposed a unified and self-reinforced network called remote sensing VOLUME 8, 2020 region-based convolutional neural network (R2-CNN) to detect small and medium objects in remote sensing images. The network is composed of backbone Tiny-Net, intermediate global attention block, and final classifier and detector, having the high recall and precision in GF-1 images and GF-2 images. Zhao et al. [34] proposed a method of aircraft detection based on the Block-Level F-CNN remote sensing images, combining the image block-level fully convolutional neural network model and the multi-scale structure for object detection, reaching the accuracy of 83.02% on an aircraft dataset from the Beijing capital international airport. Liu et al. [2] Proposed an end-to-end multi-component fusion network (MCFN) to realize the small airport objects detection of remote sensing images, composing of dual pyramid fusion network (DPFN), relative region proposal network (RRPN) and contextual information network (CIN).
In short, existing object detection methods have some limitation in remote sensing images. First, because of the limitations of CNN, the low-level feature map semantic information is relatively scarce but accurately presents the object location. In contrast, high-level feature semantic information is rich but imprecisely presents the object location. In addition, previous methods cannot adequately extract the features of a small object. Finally, when the object is in a complex scene the accuracy of previous algorithms will be decreased.

III. PROPOSED WORK A. IMAGESEGMENTATION OF FEATUREPOINT REGION PROPOSALS
When facing complex scenes, the human visual system can quickly perceive different interest objects and give preference to them. This is the perception ability of the human visual attention mechanism independent of the detection environment, which operates solely based on detecting the contrast between the desired objects and the background [35]. In the field of computer vision, using this biology-inspired feature, we can reduce the redundancy of information and quickly detect the object information under various environmental interference [36], [37].
Since the scale of remote sensing images is typically too large, the direct processing of the original image by the convolution networks may cause the training of the network to diverge. Even if it converges, the accuracy of the subsequent object detection and the generalize ability of the model may not be ideal. Therefore, inspired by the characteristics of the human vision system, in this paper, we propose an image double segmentation method based on feature point region, which are used to generate the feature point map of the remote sensing images, as shown in Figure1.
In particular, the human visual attention mechanism is based on quickly perceiving the objects with large color changes. In this paper, we use feature points to represent the objects in the pictures, and automatically select the regions with a relatively large number of feature points as the training data set. The original size of the utilized remote sensing image data set is from 800 × 800 to 4000 × 4000. After the image size test step, as shown in Table 1, the size of the segmented image is reduced to 512 x 512. To find the feature points, we use the binary robust invariant scalable keypoints (BRISK) method [24], which has desirable properties like rotation invariance, scale invariance, robustness, and relatively fast speed. Finally, we use double segmentation on the remote sensing image with feature points to get segmented images of different regions, count the number of feature points in each small image, keep pictures with at least 30 to 60 feature points in each small image, generate an XML file labeled with the object data in each small image, and the number of pictures is increased from 1638 to 31560.
The method of double segmentation is to cut the original image twice from two directions. For the first time, it starts from the lower left corner of the original image to the upper right corner, and the segmentation size is 512. If the length of the original image is less than 512, it will not be segmented in the horizontal direction; similarly, if the width of the original image is less than 512, it will not be segmented in the vertical direction; if the length and width of the original image are less than 512 when it is segmented to the rightmost and topmost sides, this part of the image will be discarded. The second time starts from the upper left corner to the lower right corner, and the segmentation method is the same as the first time. If there is no object in the segmented image, the image will not be used as training data and be removed directly.
According to the annotation mapping formula as shown in formula (1), the corresponding sub segmented image data annotation XML file is generated: where x min , y min , x max , y max is the information of the annotation box, l s is the starting position of the abscissa of the cut image on the original image, l end is the ending position of the abscissa of the cut image on the original image, w s is the starting position of the ordinate of the cut image on the original image, w end is the ending position of the ordinate of the cut image on the original image, C is the size of the cut image (C = 512 in this paper), n is the number of characteristic points in the cut area.
If n ≥ 50, the annotation information of this object will be kept completely, and the difficult item in the XML file will be set to 0.If 30 ≤ n < 50, the annotation box will be kept, but the difficult item in the XML file will be set to 1, if n < 30, the annotation information of this object will be removed. In this way, we can not only keep the segmented object information as much as possible, but also eliminate the increase of false detection caused by too little information and too much background information.
In this paper, we propose an image segmentation method based on feature points. This method has two advantages: (1) retaining the useful information of the original image. In a remote sensing image, the background information accounts for the most part, and the object information only accounts for a small portion of the image. Using this method, we can reduce the background information as much as possible to keep useful information, and reduce the redundancy; (2) improving the diversity and quantity of the dataset. In this paper, the result of image segmentation using this method is shown in Fig.2.

B. NETWORKINFRASTRUCTURES
The proposed DFSSD network can be viewed as an improved version of the SSD network. The benchmark SSD network uses vgg-16 as the basic feature extraction layer, replacing the full connection layers FC6 and fc7 of vgg-16 network structure with two convolution layers, removing the dropout layer and the classification layer in vgg-16, and adding four additional groups of convolution layers. Each group uses 3 × 3 convolution kernel and 1 × 1 convolution core to reduce the channel number of the feature map. Different levels of the feature maps are used for the border offset of differently-scaled objects as well as the prediction of different class scores [38]. Finally, the detection results are obtained by the non-maximum suppression (NMS) [39] applied to prediction layer. The feature pyramid network (FPN) network uses the characteristics of multi-scale feature map to detect small objects with low-level and high-resolution feature maps and large objects with high-level and large receptive field feature maps to ensure that objects of different scales can be detected.

C. DFSSDNETWORKSTRUCTUREDESIGN
In this paper, we note the property of small object sizes in remote sensing images. The data enhancement of the SSD network can be considered mainly a random clipping process, which makes the small object with less information lose parts of the information randomly and potentially lead to the decreased detection ability of the final training model for small objects. To avoid this issue, we propose an image segmentation method based on the feature point region proposals method. This method not only compensates for the lack of sample richness, but also improves the detection accuracy.
When using the SSD network, we find out that the prediction layer of the SSD network does not make full use of the local and global semantic features of the lower layer, which leads to the poor detection ability of small objects. For the remote sensing images, the conv4 layer in the SSD network has undergone three down-sampling operations, and the resolution of the resulting feature map is not enough to detect small objects. Therefore, we consider using the Conv3_3 layer feature map. If we use the large convolution kernel or several small convolution kernels to convolute the Conv3_3 layer directly, the semantic information of the feature graph is increased at the cost of increased computation load of the network model training. In order to reduce the computation complexity and accelerate the speed of the training phase, we propose to use dilated convolution to operate the Conv3_3 layer, and combine it with the FPN network structure. We call this method as dilated convolution and feature fusion single shot multibox detector (DFSSD), which improves the size of the receptive field of the feature layer, and increases the semantic information. The detailed network structure of the proposed DFSSD method is shown in Fig. 3.
The proposed DFSSD network model comprises an improved SSD layer, a horizontal base layer, an up-sampling layer, a fusion layer, and a prediction layer. The improved SSD layer is based on the original SSD model, and the detailed parameters following some prior works in the literature including [13]. In this study, we use two dilated convolutions with dilation rates of 12 and 18 for the features of Conv3_3 layer, and then fuse the feature maps of different receptive fields obtained by the convolution operation/layer. At the same time, the convolution kernel of 3 × 3 is used to eliminate the aliasing effect caused by the fusion of different feature maps. The horizontal base layer, the up-sampling layer, the fusion layer, and the prediction layer are improved according to the FPN network structure. Table 2 shows the size of the convolution kernels, the number of the convolution kernels, the strides and the padding of the convolution layers, and the size of the convoluted feature maps. The purpose of this layer is to reduce the number of channels and prepare them to be fused with the latter feature map. The fused feature map not only retains the high-resolution of the lower-level feature map, but also represents better semantic information.
The up-sampling layer enlarges the feature map to twice the original size. In the process of the feature map enlargement, there will be many vacancies without pixel values. The vacancies are filled with bilinear interpolation. The number of channels of the feature map is 256. The sizes of the output feature maps in the up-sampling layer are 64 × 64, 32× 32, 16× 16, 8 × 8, and 4 × 4. The purpose of the up-sampling layer is to obtain a feature map of the size needed for the fusion layer. The prediction box scale of the DFSSD network prediction layer is calculated as shown in formula 2:   Table 4 shows the parameters of the prediction layer. The prediction layer is obtained by the normal convolution of the fusion layer and the dilated convolution of the features of Conv3_3 layer. The purpose of using the convolution kernel of size 3 x 3 is to deblur the feature map of the fusion layer. In the feature image enlargement step, a bilinear interpolation method is used to fill in the vacancies, which may cause the pixel values of the blocks to be similar. This effect may undermine the clarity of the contours around the target objects and make the objects to appear fuzzy, which highlights the need for this operation.

D. LOSS FUNCTION
When training the detection network, it is necessary to save the true value box information for each vehicle position marked in the input image. For each candidate box, the offset of the center point of the candidate box from the center point of the truth box as well as the confidence of the object encompassed by the candidate box should be calculated at the same time. In the training phase, all candidate boxes and the two truth value boxes are first matched according to the Jaccard matching algorithm [40]. The candidate boxes are regarded as matching boxes, whose matching coefficients with the truth VOLUME 8, 2020  value boxes are greater than 0.5. They will be marked as positive samples denoted by c 1 , and other candidate boxes that do not satisfy the minimum matching rate are considered negative samples, denoted by c 0 .
In the process of network training, the total loss function includes the classification loss and the location regression loss, calculated as where p represents the confidence of the category, d represents the candidate box, g represents the true value box, n represents the number of positive samples, L cls represents the classification loss function, and L loc represents the position regression loss function. The resulting total loss function for the image comparison is shown in Fig. 4. Note that the large fluctuations in the loss function occur only at the beginning of the training phase for epochs below 20. This behavior is normal for deep learning methods, so it is consistently observed across all methods including the SSD, Faster RCN, and DFSSD methods. Therefore, it is sufficient to use epoch numbers above 50 to achieve stable results.
The classification loss L cls is based on the two-class softmax loss. When classifying, the confidence degree belonging to the automobile category is expressed by p 1 , and the confidence degree belonging to the background category is expressed by p 0 . Therefore, the classification loss function is where ln(p) denotes the natural logarithm and we havep 1 The position regression loss function L loc (d, g) is the smooth L1 loss of the matching between candidate box d and the truth value box g [21], which can not only ensure that when the difference between the prediction box and the ground truth is too large, the gradient value is not too large, but also ensure that when the difference between the prediction box and the ground truth is very small, the gradient value is small enough. Following the position regression algorithm in benchmark SSD, we calculate the coordinates of the center points of the matching candidate box and the truth value box, and the migration regression of the width and height as where i represents the i th matched candidate box and j represents the j th true value box.

IV. EXPERIMENTALRESULTS ANDANALYSIS A. INTRODUCTIONOF EXPERIMENTAL ENVIRONMENT
In this study, the experimental environment is centos7 system, the processor model is Inter (R) Xeon (R) CPU e5-2670 V3 @ 2.30 GHz x 12, the graphics card model is NVIDIA GeForce GTX 1080 Ti, the video memory is 11g, the memory is 32g, the experimental framework is Pytorch deep learning framework. Also, the learning rate parameter is 0.001, the weight attenuation parameter is 0.0005. We use the small batch gradient descent algorithm and the momentum optimization algorithm to optimize the parameters, with the mini-batch size of 16. The epoch number of iterations is 400, the number of steps of each epoch is 1000 and the momentum factor is 0.9. The loss function of the DFSSD is basically similar to that of the SSD. The location information of a category is obtained by the regression function, and the classification confidence is predicted by the softmax function.   In this work, we have made a remote sensing image dataset about car. According to the format of Pascal voc2007 and 2012 data sets, we use an unmanned aerial vehicle (UAV) remote sensing system to collect vehicle objects in different environments, with a total of 2045 pictures in the dataset including 1138, 500, and 407 images, respectively, for training, verification, and test phases. The actual size of the object in the remote sensing image is shown in Fig. 5.

B. EXPERIMENTALRESULTS
In order to verify the efficacy of the proposed object detection method, we carry out different experiments. We compare the performance of the proposed DFSSD method with the benchmark SSD and Faster RCNN methods. To realize a fair comparison, we train and test the three networks using the same dataset. The achieved results are shown in Fig. 5, Fig. 6, and Table 5.
From the mAP comparison results between the different models in Table 5 and Fig. 6, it can be seen that the mAP of the DFSSD model is more accurate than the other methods. The achieved detection accuracy is 2% higher than the Faster RCNN method and about 7% higher than the baseline SSD method. The superior performance of the DFSSD method in terms of the high detection rate is mainly due to the use of deconvolution step, which enriches the low-level features. At the same time, the fusion of different levels of feature maps can effectively improve the detection accuracy. Fig. 7 illustrates the enhanced detection accuracy of the DFSSD model compared with the SSD and Faster RCNN methods. Since the SSD and Faster RCNN methods are primarily designed for object detection in natural scenes, they cannot accommodate the requirements of the small-scaled vehicle detection in remote sensing images. The DFSSD model fully covers the requirements of the remote sensing image through designing different levels of feature fusion methods and enabling different candidate frame scales; therefore, deems effective for vehicle object detection in remote sensing images.

C. COMPARETIVEEXPERIMENTOFOTHERREMOTESENS-INGDATASETS
In order to further verify the usability of the proposed method, we have carried out comparative experiments for the DFSSD and SSD methods using the nwpuvhr-10 dataset [41] and rsod dataset [42]. The two networks use the same hyper parameters for learning. The initial learning rate is set to 0.001 and the number of epochs is set to 400. The SSD method does not use our proposed data processing method, whereas the DFSSD method uses the proposed preprocessing method for training and testing. The experimental results are shown in Table 6.

D. SPEEDTEST
In order to confirm the feasibility of using the DFSSD method in real-time applications, we carry out comparative experiments on different networks. The input size of the SSD network is 512, which uses vgg-16 basic network, and the input size of the Faster RCNN network is1000 x 600, which uses vgg16 basic network. Because of the large-scale difference of the remote sensing images, we test the image in different scale ranges, and set the resolution of image to 0-1000,1000-2000,2000-3000, and 3000-4000. The test results are shown in Fig. 8.
As can be seen in Fig. 9, the single forward inference time for the DFSSD method is 59ms, which is 11ms longer than that of the SSD. The main reason for the slightly decreased speed is the more complex network structure used in the DFSSD method. The DFSSD method apparently shows a big advantage over the Faster RCNN in terms of the operation speed. Overall, we conclude that the DFSSD method does   It can be seen that the proposed DFSSD method obtains sufficient spatial structural information about small objects. The more semantic information obtained by the DFSSD method enhances the feature representation of small objects. The proposed method identifies more small samples to use in the training, so it can make the model learn more local information and small objects information. In different scenes, the experimental results confirm the effectiveness of the proposed method.
Compared with other methods, the proposed method is more accurate, especially in the detection of small objects. It is also very effective for objects with complex scenes and occlusion. However, in some scenes, the distance between the objects is very close, that is, many parts of the objects are also connected together, so the proposed method cannot detect them correctly.

V. CONCLUSION
In this paper, we proposed a novel deep learning method, named DFSSD for small-scaled object detection with applications to aerial remote sensing images. The proposed method significantly improves upon an already successful method, called SSD. The key idea is an enhanced image segmentation processing approach based on the extracted feature points of remote sensing images, so that the resulting image segments retain maximal information for the small-scaled objects after scaling. The proposed method replaces the random clipping step of the SSD network, and hence alleviates the adverse effects of the random clipping on small objects in the training phase. For small object detection in remote sensing images, DFSSD uses two different dilated rate convolution kernels to perform multi-scale fusion on the Conv3_3 layer, which expands the receptive field of the feature map. At the same time, based on the FPN network structure, a DFSSD network prediction layer is designed, and feature maps of different layers are integrated to capture multi-scale context information, which improves the network ability to detect small objects. In the prediction phase, the overlapping object frame is removed by non-maximum suppression of the original image. The proposed method, on the premise of ensuring the real-time detection speed of DFSSD network as much as possible, improves the object detection accuracy of remote sensing images by 4% compared with the benchmark SSD network.
ZHIWEI ZHANG received the B.S. degree in communication engineering from Xi'an Shiyou University, Xi'an, China, in 2017. He is currently pursuing the master's degree with the Xi'an University of Posts and Telecommunications. His current main research topic is the deep learning and computer vision. He has a strong interest in this direction.
ABOLFAZL RAZI received the B.Sc. degree in electrical engineering from Sharif University, the M.Sc. degree from Tehran Polytechnic, and the Ph.D. degree in electrical engineering from the University of Maine. He is currently an Assistant Professor of electrical engineering with the School of Informatics, Computing and Cyber Systems (SICCS) and the Director of Wireless Networking and Smart Health (WiNeSH) Research Laboratory, Northern Arizona University. Prior to joining NAU, he held postdoctoral position with the Electrical and Computer Engineering Department, Duke University, where he developed novel information-theoretic methods for dictionary learning, compressive sensing, and inverse problems. He also held a postdoctoral associate position with Case Western Reserve University, where he developed computational methods based on Bayesian inference for integrative analysis of cancer omics data. In addition to his academic service, he served about seven years in wireless industry holding several positions including the Project Manager of value added services, the Research and Development Researcher, the Network Optimization and Integration Engineer, and the Smart Card Design, and Test Engineer. VOLUME 8, 2020