Multi-Scale Aerial Target Detection Based on Densely Connected Inception ResNet

,


I. INTRODUCTION
For the past few years, the rapid development of unmanned aerial vehicle (UAV) has garnered an increasing interest for this technology in a wide range of applications. UAV-based remote sensing has become more modular, miniaturized and intelligent in recent years, and been widely used in various fields. In the field of security, particularly, the role of UAV is increasingly important [1]. If a manually evaluated image is used to search the targets, considerable false and missed detections could occur owing to the influence of subjective human factors. Therefore, high-precision and robust detection algorithms are urgently needed.
Detecting objects in aerial images is difficult and challenging due to the following reasons: (1) The long-distance between the target and UAV inevitably leads to a low The associate editor coordinating the review of this manuscript and approving it for publication was Ruqiang Yan. resolution for images. (2) There exist huge variations in the appearance and color of targets with various orientations, which increases the inter-class similarity between desired targets and the complex backgrounds. The resolution of images taken by different devices also varies.
It is intractable to recognize targets in aerial images effectively. In order to solve these problems, a large number of detection methods have been proposed.
The majority of existing object detection methods applied to aerial images are implemented to distinguish objects from the background at a large scale, such as ships at sea [2], and planes at airports [3]. Traditional detection methods have many problems: The handy features are only for specific target detection, and they perform poorly in generalization and robustness. Besides, the region searching based traditional algorithms are time-consuming and slow [3].
More recently, deep learning-based algorithms have been dominating the top accuracy benchmarks for various visual detection tasks. The existing target detection models based on deep learning can be divided into two categories: those based on region recommendation [4]- [9] and those based on regression [10]- [13]. The region proposal based networks use the idea of region proposal and then classify them; while the regression-based networks use a single convolutional network to predict bounding boxes and class probabilities simultaneously from an input image.
To improve the detection results for aerial objects, an aerial target detection lightweight network is proposed, with fewer model parameters, a faster detection speed, and higher detection accuracy. The main contributions are as the following: (1) The RI-Dense model is introduced to replace the original feature extraction network. This model integrates the ideas of InceptionNet and DenseNet, effectively connecting every feature layer in series along the feature channel to ensure information transmission. This proposed structure alleviates the problem of gradient disappearance, retains more features, and reduces the number of network parameters.
(2) A multi-scale feature fusion structure RI-Deconv is proposed, motivated by the idea of ResNet. RI-Deconv uses the deconvolution operation to perform multilayer feature fusion and constructs advanced feature maps with a high resolution and semantic information.
(3) A target stitching method is designed to combine the cropped targets. Large-sized pictures need to be cropped before being sent to the network. Consequently, some objects are inevitably cropped into different parts. Our method would stitch these divided parts together and reduce the missing rate.
(4) The proposed model is evaluated on the NWPU VHR-10 [14] and DOTA [15] datasets to test its performance. The experiment shows that the proposed network is effective in improving the detection accuracy.
The rest of this paper is organized as follows: section II introduces related work. Section III explains the details of the proposed method and materials. Section IV analyzes the experimental results. Section V makes a conclusion.

II. RELATED WORK
In recent years, many target detection methods based on deep learning have been developed, immensely promoting the advancement of target detection. These theories will be briefly introduced in the following part.
Over the past decade, some efforts have been devoted to addressing the problem of small object detection from aerial videos [16]- [19]. One widely applied strategy is to enlarge images to different scales directly, which achieves more detailed information about the small targets. For example, Chen et al. [20] presented an approach where the input is magnified to enhance the resolution of small objects. On the basis of this research principle, Cao et al. [21] fused feature modules to additional contextual information to deliver a better detection performance. By generating multiple feature maps with different resolutions, they are able to naturally handle objects of various sizes including small ones. Other approaches are based on the deep neural network in which multi-scale feature layers represent each small target characteristic [22], [23].
YOLO (You only look once) [10] is the first one-stage detector in object detection, a milestone in the history of one-stage detection model. YOLOV3 [13], an enhanced version of YOLO, achieves an outstanding detection accuracy. SSD [12] (Single Shot MultiBox Detector) is a one-stage detector proposed by W. Liu et al. The main contribution of this technology is the introduction of the multi-reference and multi-resolution detection technique which significantly improves the detection accuracy of one-stage detectors. Based on SSD and similar to FPN, DSSD [24] employs top-down pyramid CNN layers to improve the accuracy, but at the cost of computational efficiency. FSSD [25] inserts a fusion module at the bottom of the feature pyramid to enhance the accuracy of SSD. While keeping a fast speed, FSSD achieves marginal improvements upon SSD in accuracy. S3FD [26] is a highly accurate real-time face detector, based on the anchor model used initially for object detection. In order to overcome the limitations on small objects, S3FD introduces a scale compensation anchor matching strategy to improve recall rate, and a max-out background label to reduce the false positive detections.
After studies on these works, researchers put forward smaller target detection algorithms. Cheng et al. [14] train a RICNN model by optimizing a new objective function via imposing a regularization constraint. This explicitly enforces the feature representations of the training samples to be mapped closer to each other before and after rotating, hence achieving a rotation invariance. CISPNet [27] applies a context information scene perception (CISP) module to obtain the contextual information for targets of different scales and uses k-means clustering to set the aspect ratios and sizes of default boxes. Cheng et al. [28] propose a novel and effective method to learn a rotation-invariant and Fisher 84868 VOLUME 8, 2020 discriminative CNN (RIFD-CNN) model by introducing and learning a rotation-invariant layer and a Fisher discriminative layer, respectively, on the basis of the existing high-capacity CNN architectures. REMSNet [29] combines a dense connectivity pattern and parallel multi-kernel convolution to build a lightweight and varied receptive field sizes model. In addition, they design a parallel multi-kernel deconvolution module and a spatial path to further aggregate different scales information. WFCNN [30] is a weight feature value convolutional neural network, consisting of one encoder and one classifier. The encoder uses the linear fusion method to hierarchically fuse semantic features. RADC-Net [31] proposes a residual attention based densely connected convolutional neural network, with a novel residual attention block designed to highlight local semantics relevant to the aerial scenes. Zhou et al. [32] suggest an effective framework for weakly supervised target detection in RSIs based on transferred deep features and negative bootstrapping for detection in remote sensing images. Li et al. [33] put forward a new FPN with multiangle anchors. A double-channel feature fusion network is proposed to learn local and contextual properties along two independent pathways. Zhang et al. [34], after analyzing the bottlenecks and development directions of deep learning in remote sensing target detection, provide a guidance for researches in this field. FaceBoxes [35] is a light-weight CNN for face detection which has a lightweight yet powerful network structure that consists of the Rapidly Digested Convolutional Layers (RDCL) and the Multiple Scale Convolutional Layers (MSCL). The MSCL, aiming to enrich the receptive fields and discretize anchors over different layers, is capable to handle faces of various scales. SVDNet [36] is designed based on a singular value decompensation algorithm, achieving a high detection robustness and desirable time performance. Diao et al. [37] combine the strength of the unsupervised feature learning of deep belief networks (DBNs) and visual saliency, which avoid an exhaustive search across the image and generate a small number of bounding boxes to locate the object quickly and precisely.

III. METHODS
SSD uses several feature layers to make predictions, which effectively improves the target detection accuracy. However, the feature layers in the SSD detect targets independently, resulting in reduced detection for small targets. A new one-stage detection model that inherits the idea of SSD is designed. The details will be introduced in this section.

A. THE RIDNet
The proposed network is illustrated in FIGURE 2. RIDNet consists of two parts: a feature extractor and an object detector. The RI-Dense structure, created as a feature extractor, consists of several RI-Dense modules, in which the input of each layer is the output of all previous modules. The dense connections between feature layers benefit to learning inner-class semantic features thoroughly. Therefore, the detection speed of the network is accelerated. Different from the SSD, our feature pyramid is made up of the fusion result of the RI-Deconv modules instead of the convolution. After each up-sampling process of the RI-Deconv structure is finished, the interference of the feature layer with less information can be reduced, and the feature recovery accuracy of the RIDNet can be enhanced, which improves the expressive power of the model.

B. FEATURE EXTRACTION NETWORK
The feature extractor consists of three modules: root module for preliminary feature extraction and feature size reduction, VOLUME 8, 2020  RI-Dense module for multi-feature layer connection, and bottleneck layer for dimension reduction. TABLE 1 describes the structure of the feature extraction network.

1) ROOT MODEL
In general, adjacent pixels own similar information, because of the large size of the input image. These pixels contain excessive redundant information. Dilated convolution [38] means adding holes that do not participate in the calculation of a standard convolution kernel. Through this procession, the receptive field becomes larger compared with the standard one, and the redundant information is effectively reduced. The root module contains three dilated convolution layers to eliminate redundant information and parameters. The details are shown in TABLE 1.
After processed by this three-layer dilated convolution, the receptive field becomes 13 × 13, while the standard convolution becomes 7 ×7. Compared with the standard convolution, the receptive field of dilated convolution increases by 2.4 times. Replacing the pooling operation by dilated convolution, the receptive field is increased without sacrificing the size of the feature map, and the redundant information in the image can be filtered out simultaneously.

2) RI-DENSE MODEL
This is a densely connected residual framework absorbing the idea of Inception ResNet to improve DenseNet. The framework is shown in FIGURE 3.
RI-Dense has three branches, all of which contain 1 × 1 convolution kernel to change dimensions. Depending on the shape of kernels, the feature scales extracted by different  branches are various. 3 × 3 convolution kernels are used to extract the details of small targets. Furthermore, two 3 × 3 convolution kernels are used together to substitute a 5 × 5 convolution kernel to handle large targets. In order to further reduce the model parameters, the 3 × 3 convolution kernel in each branch is divided into a 1 × 3 and a 3 × 1 kernel. The dense connection between the RI-Dense models can ensure information transmission. Such an implementation benefits network training without consuming much computational resources.

3) BOTTLENECK LAYER
The bottleneck layer controls the dimension and scale of the RI-Dense module's output. Such a layer encourages the network to compress feature representations to the best fit in the available space, in order to get the best loss during training. FIGURE 4 shows the structure of the bottleneck layer. It consists of a 1×1 kernel and a 2×2 average pooling. They are added to reduce the channels of feature maps in the network, which otherwise tend to increase in each layer. This dimension alternation is achieved by using 1 × 1 kernels that have fewer output channels than input channels. A 1 × 1 convolution layer compresses the dimension, and then the scale of the feature map is compressed by a 2 × 2 average pooling layer. Bottleneck layers help by reducing the number of parameters in the network while allowing it to go deep and represent many feature maps.

C. RI-DECONV MODEL
One of the main contributions of SSD is that multiple feature layers are used for prediction. Nevertheless, this advantage 84870 VOLUME 8, 2020 is also limited due to the separate prediction of each feature map. Therefore, using multiple feature fusion can effectively increase the detection appearance of the network.
A feature fusion module named RI-Deconv is comprised and then constructs a feature pyramid with our proposed model. The RI-Deconv feature pyramid contains five fusion feature layers, and their scales are 2 × 2, 4 × 4, 8 × 8, 16 × 16, and 32 × 32 correspondingly. The structure is shown in FIGURE 5. In this phase, the feature map is restored step by step to the original image size by up-sampling. In order to acquire the deep-level semantic information and the shallow-level position information simultaneously, the RI-Deconv directly connects the corresponding size feature map from the deconvolution to the convolution in the up-sampling process.
As shown in FIGURE 5, the RI-Deconv module contains two inputs with different scales. The scale of the large feature map is twice that of the small feature layer. Both the two feature layers have the same number of channels. The large feature map has two branches: 1 × 1 shortcut and 3 × 3 convolution. These two paths compose a residual structure to address the problem of gradient disappearance. 3 × 3 convolution kernels provide a larger receptive field and increase the feature extraction ability of the network. The smaller feature map is expanded by four times after deconvolution. After that, the expanded feature map would fuse with the large feature map.

D. TARGET STITCHING
Because of the large size of pictures in the dataset, a series of 1024 × 1024 patches are cropped from the original images with a stride set to 512. Furthermore, the RIDNet takes the cropped blocks as the input for the model. Since the large-sized images are cropped into several patches, the objects are inevitably divided into several parts. A method for target stitching is designed to prevent targets from missing and repeating. Each fusion feature layer is detected and then maps the coordinates of results to the original image. Whether to fuse these prediction boxes is determined by their relative position.
Every pixel in SSD generates anchor boxes. The length of the default anchors are as follows: where k ∈ [1, m], m is the layer number, i.e., m = 5. S min = 0.2, S max = 0.3, min _size = S k and max _size = S k+1 in layer K.
In order to enhance the model's identification ability, a series of aspect ratios are set for the anchor frame.
R n means the ratio of anchors, and R n ∈ {1, 2, 1/2, 3, 1/3}. A particular side length is added when R n = 1: The center of default anchor is ( a+0.5 |f k | , b+0.5 |f k | ), where f k is the size of the k-th feature. a, b ∈ {0, 1, 2, · · · , |f k | − 1}. The mapping relationship between the anchor coordinate of the feature map and the original coordinate is as follows: where (c x , c y ) is the center of anchor coordinate on the feature map. An indicator I iou is introduced to determine whether to fuse adjacent prediction boxes. When I iou is higher than the threshold, two prediction boxes are fused. The threshold set in this paper is 0.4, and the class of fused prediction box is described as Class fusion . FIGURE 6 illustrates the definitions of I overlap and I sum .
VOLUME 8, 2020 Class fusion = ClassA(Score a > Score b ) ClassB(Score a < Score b ) Firstly, the image pieces are stitched in the horizontal direction. After this operation, the image parts will become transverse strips, which will then be merged vertically. The divided targets will be stitched through this process.

IV. EXPERIMENTS AND ANALYSIS A. MATERIALS
Experiments are implemented under Pytorch 1.0 framework by python language on a 64-bit computer with Ubuntu 18.04, CPU Intel i9-7900X CPU @ 3.3GHz, and NVIDIA Titan X 12G with CUDA9.2 and cuDNN7.5. The maximum training iteration is 120k. All parameters are randomly initialized with the xavier method. The model is fine-tuning by using SGD with 0.9 momentum, 0.0001 weight decay. The initial learning rate is set to 0.001, and it is decayed as cosine annealing for each batch. The batch size is set to 16. The experiments are carried out on two public datasets: NWPU VHR-10 and DOTA. NWPU VHR-10 contains 800 aerial photos depicting 10 kinds of targets, among which 650 are targets and 150 are backgrounds. The samples of each category are shown in FIGURE 7. In the NWPU VHR-10 dataset, large-scale targets account for more than 15% of the image area, and small-scale targets account for less than 5%. With the target scales varying in an extensive range, it can test the network's ability to detect multi-scale  The images in the DOTA dataset come from different platforms, with their size ranging from 600 × 600 to 4000 × 4000. The DOTA dataset can be classified into 15 categories, and the number of instances in it is more than 180,000, much larger than that in the NWPU VHR-10 dataset. DOTA includes targets of different scales with a high spatial resolution, which can better test the generalization performance and robustness of the model. FIGURE 9 shows the data samples.

B. COMPARISON WITH OTHER METHODS
The proposed method is compared with several popular models to verify the effectiveness of the RIDNet,  including FRCN [7], YOLOV3 [13], SSD [12], FSSD [25], S3FD [26] and RFCN [8]. mAP (mean Average Precision) is used as a measure of performance. The results on the NWPU VHR-10 are shown in TABLE 3 and Figure 10. As displayed in TABLE 3, our method achieves a satisfactory performance in terms of mAP values on NWPU VHR-10 dataset. Figure 10 shows that the detection accuracy of different methods varies with the number of iterations. The mAP is recorded every 5000 iterations. Figure 12 displays the detection results of various methods.
The results of above-mentioned methods on DOTA database are shown in TABLE 4 and Figure 11. The mAP of the RIDNet is slightly lower than that of YOLOV3 on NWPU VHR-10. However, the mAP of the RIDNet is higher than that of YOLOV3 on DOTA. The reason for this phenomenon is that small objects occupy a large proportion on DOTA, which increases the difficulty of detection [39]. FIGURE 13 displays the detection results of RIDNet on DOTA. Furthermore, the time-consuming of the RIDNet is 47.4 ms. The proposed method covers a presentable computation cost while achieving a better detection accuracy with small model sizes.

C. ABLATION EXPERIMENT
Ablation experiments are conducted to verify the effectiveness of each technique proposed in this paper. The results are shown in TABLE 5 and TABLE 6.   Experiment results show that the fusion of different feature layers has a corresponding effect on the detection ability. The detection result is the best when all five feature layers participate in the fusion. However, it is worth noting that the mAP is only slightly less than that of five fused feature maps when the other four feature maps are fused without the 2 × 2 feature map. It is because that after multiple convolutions and down-samplings, there leave almost no features on the 2 × 2 feature map. Thus, the 2 × 2 feature map has little effect on improving the detection accuracy. The ablation experiment is conducted on the NWPU VHR-10 to test the effectiveness of proposed models. The experiment results are shown in TABLE 6. When DenseNet is used as the feature extraction network, RI-Deconv achieves the highest detection accuracy. When the RI-Dense structure is used as the feature extraction network, the detection accuracy will be improved. The RI-Deconv structure maintains the highest detection accuracy in each experiment, which shows that the proposed models are effective and efficient when detecting.

V. CONCLUSION
In this paper, the RI-Dense model and RI-Deconv model are proposed for small targets and multi-scale targets detection. Based on these models, a lightweight detection network RIDNet is designed, absorbing the ideas of deconvolution, DenseNet, InceptionNet, and ResNet. The RI-Dense model improves the efficiency of feature extraction, addresses the problem of gradient disappearance, and achieves high detection accuracy with fewer parameters. RI-Deconv module adds semantic information from deep layers and detailed information from shallow layers to the fusion layers, which can improve the performance of multi-scale detection. It fully utilizes the information extracted from multiple feature layers to improve detection accuracy. With a dense residual structure, our network is able to deal with objects of different sizes and improve the detection accuracy for small and weak objects. The network can handle large-sized images by cropping original input images. Moreover, the target stitching method guarantees that divided targets will be stitched back.
After experiments on two public datasets, it is found that the proposed algorithm, the RIDNet, has a better performance in detection compared with other popular detection algorithms. Moreover, RIDNet is lightweight enough to be deployed on UAV. Ablation experiments also show that proposed models can effectively improve detection accuracy.
MIAOHUI ZHANG received the B.S. degree in control theory and control engineering from Northeastern University, in 2002, the master's degree from the Graduate University of the Chinese Academy of Sciences, and the Ph.D. degree from the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, China, in 2013. His current research interests include pedestrian detection and re-identification, abnormal behavior analysis, and video content understanding. VOLUME 8, 2020 KANGNING PANG received the B.S. degree in automation from the Henan University of Science and Technology, China, in 2017. He is currently pursuing the M.S. degree with the School of Computer and Information Engineering, Henan University, China. His current interests include pattern recognition and computer vision.
CHENGCHENG GAO is currently pursuing the M.S. degree with the School of Computer and Information Engineering, Henan University. His current interests include pattern recognition and computer vision.
MING XIN received the B.S. degree in information management and information system from Southwest University, in 2002, and the M.S. degree in applied mathematics from Henan University, in 2008. She is currently pursuing the Ph.D. degree with the School of Computer Science and Engineering, Beihang University, China. She joined the School of Computer and Information Engineering, Henan University, in 2002, where she has been an Associate Professor, since 2013. Her current research interests include moving object detection and tracking and object recognition.