Vehicle Detection Method for Remote Sensing Images Based on Feature Anti-Interference and Adaptive Residual Attention

Vehicle detection in remote sensing images is of great significance to urban traffic intelligence. Though existing vehicle detection methods for remote sensing images such as fully convolutional regression network, spatial density building net, and pretraining and random-initialized fusion network have made many efforts and progress on network structural optimization, their models remain being weak on the feature anti-interference, contextual information utilization, and neglect the loss of feature information during the down-sampling process. In this article, we propose feature anti-interference and adaptive residual attention-Net, a remote sensing image object detection algorithm based on feature anti-interference and adaptive residual attention. First, a feature interference module is constructed, fed with the shallow feature and random noise, and generates interference in the detection process, making the detector improve its anti-interference ability against the disturbance in the adversarial training process. Second, a novel adaptive residual attention module is introduced into the network to extract the adaptively contextual features and enhance the weak features. Finally, a cross level fusion module is designed to enhance the collaboration between multiscale feature layers to reduce the loss of small target feature information. The effectiveness of the method is verified by comparing the method proposed in this article with other mainstream methods on the UCAS-AOD, CARPK, and OVDS datasets.


I. INTRODUCTION
I N RECENT years, vehicle detection has been getting more and more attention for its application potential in traffic intelligence and military defense [1], [2], [3], [4], [5]. In this field, deep learning has become the research hotspot and the mainstream. Most researchers focus on optimizing feature and network structure, including feature richness enhancement, spatial context informationextraction, and cross-layer fusion. However, due to the complex background and small targets, it remains a long way to meet the practical needs. More efficient feature extraction and structural optimization methods are expected. The authors are with the Department of Electrical Engineering, Guangxi University, Nanning 530004, China (e-mail: 2012301053@st.gxu.edu.cn; 20140043@gxu.edu.cn).
The code is freely available at: https://github.com/hel2020/FICLAR-Net. Digital Object Identifier 10.1109/JSTARS. 2022.3206036 Up to the present, vehicle detection algorithms based on deep learning can be roughly divided into two categories: the first is the anchor-based category, and the second is the shape-regressed category. In the first category, Ren et al. [6] proposed faster-RCNN based on fast-RCNN [7], which used region proposal networks to generate the anchor boxes and adjust the coordinates replacing the other R-CNN [7], [8] algorithms that generated anchor boxes by rules. Nan and Li [9] proposed an oriented vehicle detection framework for aerial images based on improved faster-RCNN, which used the oversampling and stitching data augmentation method to decrease the negative effect of category imbalance. Liu et al. [10] proposed a one-stage detection network single shot multibox detector (SSD). Tang et al. [11] proposed an oriented_SSD to generate arbitrarily-oriented detection results, which used default boxes with various scales on each feature map location to produce detection bounding boxes. Anchor-based methods can not realize pixel-wise detection, thus, unable to distinguish dense-align objects under its nonmaximal suppression algorithm.
In the second category, the shape-regression methods detect objects pixel-wisely, without any nonmaximal suppression, and are suitable for dense-align and small object detection. Duan et al. [12] proposed the coarse-grained density map network, where the coarse-grained density map was predicted by the estimation network, and cluster regions were generated based on density maps. The network achieved accuracy improvements. Li et al. [13] proposed a density-map guided object detection network, which generated a density map and learned scale information based on density intensities to form cropping regions, and it improved the detection accuracy in high-resolution aerial images. Tayara et al. [14] proposed the fully convolutional regression network (FCRN) that mapped detected vehicles from the aerial images into the spatial density elliptic kernels and significantly improved the detection accuracy. Chen et al. [15] presented a two-stage spatial density building net (SDBN) to obtain accurate vehicle geometrical parameters based on the spatial density map. Liu et al. [16] proposed a pretraining and random-initialized fusion network (PRFN), which fused the random-initialized model and the pretrained model to improve the richness and robustness of the features of the model. In addition, Makantasis et al. [17], [18] proposed a novel Rank-R FNN, a tensor-based nonlinear classifier. The model can fully exploit the structural information along every data dimension. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The model has fewer trainable parameters and achieves high classification accuracy on small sample data sets. Protopapadakis et al. [19] proposed a stack autoencoder-driven and semisupervised-based deep neural network (DNN). Less than 0.08% of labeled data were used to train the depth model. And semisupervised techniques were used to estimate the soft labels of large amounts of existing unlabeled data. This method significantly reduced the time of manual data annotation.
However, the existing shape-regressed methods have some common shortcomings as follows.
1) Lack of anti-interference ability, vulnerable to the complex background in the detection process.
2) The detection feature of the remote sensing images relies severely on the local context information, but the existing vehicle methods pay less attention to obtaining the spatial context information.
3) The loss of small target feature information caused in the down-sampling process is not fully compensated. In order to improve the anti-interference ability of the network and reduce the influence of complex backgrounds on the detection process, Ding et al. [20] proposed an anti-interference road detection network, which generated pseudofeatures by the generator and superposed them onto the multilayer features of the encoder. To enhance spatial context feature extraction, Woo et al. [21] proposed a lightweight general module that could sequentially infer attention maps along the channel dimension and spatial dimension. Wang et al. [22] proposed the nonlocal module to get the global context information through the self-attention mechanism, which captures long-distance feature dependencies.
Motivated by the abovementioned research works, in this article, a vehicle detection network based on feature antiinterference and adaptive residual attention (FICLAR-Net) is proposed. Instead of simply stacking the pseudofeatures with the encoder, in the feature interference module (FIM), shallow features and random noise are used as the input of the generator, and the interference features obtained are connected to the trunk network through special modules. We design four connection modules (CM) and add interference at four different locations. In addition, we design a new attention module (ARAM), which enhances weak features through deformable convolution residual blocks, and obtains context information through the pixel attention mechanism. Finally, we use atrous convolution to construct the cross level fusion module (CLFM) and fuse multilayer feature information to increase feature richness. The contributions of this article are summarized as follows.
1) A FIM is constructed, fed with the shallow feature and random noise, and generates interference in the detection process, making the detector improve its anti-interference ability against the disturbance in the adversarial training process. 2) A novel ARAM is introduced into the network for extracting the adaptively contextual features and enhancing the weak features. 3) A CLFM is designed to reduce the loss of small target feature information via the collaboration between multiscale feature layers.

A. Generative Adversarial Network
Goodfellow et al. [23] proposed generative adversarial network based on game theory. It used generator and discriminator to carry out generative adversarial training. This model has attracted the attention of researchers since it was put forward, and many improvements have been carried out [24], [25], [26], [27] In formula (1), the input of the generator (denoted as G) is the random variable (denoted as z) from the hidden space (denoted as p z ). The input of the discriminator (denoted as D) is the real sample (denoted as x) or the generated sample (denoted as x'). The training objective is to distinguish the real sample from the generated sample.

B. Deformable Convolution
Unlike traditional convolution, deformable convolution [28] can adaptively learn two-dimensional (2-D) position offset to obtain a kernel with adaptive shape, adapting to geometric deformation of target size and scale in complex scenes. Furthermore, DCNv2 [29] introduced the modulation mechanism into the standard deformable module to enhance the ability to manipulate the support area of the space. The deformable convolution of modulation is expressed as where K is the number of sampling positions, Δm k is the learnable modulation scalar, and, w k , p k , and Δp k represent the weights of the sample positions, the default offsets, and the learnable offsets, respectively.

III. METHODOLOGY
The model proposed in this article is shown in Fig. 1. The model structure includes backbone, FIM, CM, ARAM, CLFM, and decoder with the skip-convolution. The backbone is based on the DenseNet [30]. The clustering algorithm and the loss functions are also presented in this section.

A. Feature Interference Module and Connection Module
As shown in Fig. 1, we construct a FIM to improve the performance of the network. The shallow features of the input image are extracted through the first layer of the backbone, then fused with random noise of the same size and sent to the generator to obtain the interfering features, and then sent to the network to obtain the combined features. A discriminator is used to distinguish between the mixed feature distribution and the actual feature distribution. Both the generator and the discriminator consist of multiple convolutional layers stacked on top of each other. Sending the mixed features to the detector causes some interference to the detection process and facilitates In the training process, for the same input, the network is divided into two training steps: no noise added (a = 0) and noise added (a = 1). The input image is sent into the backbone and then through the CLFM and the ARAM. The shallow features and random noise (a = 1 or a = 0) are stitched together and fed into the Generator, and then the resulting interfering features are fed into the trunk network. The Discriminator is used to discriminate whether the main feature adds interference. Finally, the Decoder outputs the spatial density map and calculates the loss with the label. In addition, we design four CM and add interference at four different locations.
the learning of the diversity of network parameters, which can enhance the anti-interference capability of the network and improve the generalization performance. The process can be expressed as follows: where F noise is the interference feature generated by the generator, Gen is the generator, Noise is the random noise, F 1 is the shallow features, F origin is the feature without interference, F mix is the mixed feature after interference, Dis is the discriminator, De is the decoder, and F end is the input of the decoder. In order to analyze the impact of different interference methods on the network, we designed four different interference locations, as shown in Fig. 1. Location 1 adds the interferential features after the backbone, location 2 adds the interferential features after the cross level fusion module, location 3 adds the interferential features after the deformable convolution module inside the adaptive residual attention module, and location 4 adds the interferential features after the adaptive residual attention module. In addition, we designed four different connection modules, as shown in Fig. 2. Module (a) is designed in a similar way to Ding [17], module (b) and module (c) automatically adjust the interference ratio through learnable parameters, and module (d) adds interference twice. We analyzed the impact of different interference locations and connection modules on network performance.

B. Adaptive Residual Attention Module
There are partial occlusions in remote sensing images, and the shape of the occluded vehicles is incomplete. The convolution kernel of deformable convolution can adaptively extract features through offset learning. Therefore, we use deformable convolution residual blocks to improve the detection network's  attention to occluded objects. In addition, the feature information of the occluded targets and some inconspicuous targets are weak, and the context information has a better enhancement effect on the insignificant features, so we introduce a new attention mechanism for feature enhancement. As shown in Fig. 3(a), the front end of the module is composed of deformable convolution residual blocks, and the back end is an attention mechanism.

C. Cross Level Fusion Module
Due to the small sizes of the vehicles in the remote sensing images, fewer pixels can be used, and after multiple downsampling, some feature information is lost. In the network, the low-level layers contain spatial features, and the high-level layers contain semantic features, so we combine spatial information with semantic information to obtain richer high-level feature information. The structure of the cross level fusion module is shown in Fig. 4. First, we adjust the receptive field of the feature layers by using atrous convolution to improve the feature fit between feature layers and then aggregate with high-level layers. Since the output of the first layer contains too much background information, which is not conducive to the detection process, only the outputs of layers 2, 3, and 4 are aggregated with high-level features.

D. Decoder With the Skip Convolution
Skip connections are used in many networks to combine high-resolution shallow features, and up-sampled output deep feature layers to guide local pixel classification with global contextual information. However, due to the large span of the feature layer, the information is discontinuous, resulting in the misclassification of pixels. To this end, we increase the coherence of features by using skip convolutions on skip connections. By combining local information with high-level semantic information, we jointly guide the generation of density maps. The skip convolution used in the decoder is shown in Fig. 5.

E. Clustering Algorithm
According to the spatial density map output by the neural network, we design a clustering algorithm to obtain the position of the vehicle target. Due to the high density of the vehicle target location in the density map, it is a convex distribution. We set the empirical threshold to obtain a binary graph and judge the connectivity of the region through the connectivity algorithm. Then, filter the binary image to get the accurate target area.

F. Loss Function
The trunk network uses the BCE loss function for the first 150 epochs (a total of 600 epochs) and then uses the combined loss function of BCE loss and Dice loss. L1 loss is used for the discriminator. The generator loss is included in the trunk network. The loss function formulas are as follows: Loss trunk = Loss BCE + αLoss Dice (10) Loss Dis = Loss1(Dis(F mix_no_noise ), 0.9) Loss all = Loss trunk + Loss Dis (13) where y is the label value, y is the output value, and α is a constant.

IV. EXPERIMENTS AND DETAILS
In this section, we introduce the dataset used for the experiments, density map generation, training details, value evaluation, and experimental results. All experiments are performed on a piece of NVIDIA Geforce RTX 3090 24GB using the Pytorch framework.

A. Datasets
The model proposed in this article is validated on three open datasets. The first is the UCAS-AOD [31] dataset. The image data is divided into three parts: CAR, PLANE, and NEG. This article uses the CAR datasets. The CAR dataset contains 510 remote sensing images of 1280×659 size, with a total of 7114 vehicles. The second is the CARPK [32] dataset, which contains 1448 aerial images with a resolution of 1280×720, including a total of 89777 vehicles; the third is the OVDS [33] dataset, which contains 111 images with a resolution of 1368×972 Image, satellite imagery of the San Francisco area from Google-Earth. There are many disturbances in these datasets, such as tree occlusion and street scenes similar to vehicles, which increase the difficulty of vehicle detection. Some dataset samples are shown in Fig. 6.

B. Density Map
We use the 2-D Gaussian ellipse kernel function to generate density maps. The pixel value of each point represents the probability of the vehicle targets. The main formulas are as follows: where A is set to 255, σ x and σ y are the length and width of the vehicle, and θ is the direction of the target vehicle. Some dataset samples are shown in Fig. 7.

C. Training Details
The network uses the Kaiming method to initialize weights, and the generative adversarial network is initialized using Pytorch's default settings. In the training phase, we use the Adam optimizer, and the initial learning rate is 1e-4. During the training process, the first 150 epochs do not add noise at all (a = = 0), and the learning rate is unchanged. After 150 epochs add noise to the network (a = 1 or a = 0), the learning rate decays by 10 per 150 epochs (a total of 600 epochs). The input to the network is a three-channel RGB image with a size of 256×256. Preprocessing operations such as scaling and random rotation are performed before being fed into the network.

D. Evaluation Metrics
In this article, we use precision, recall, and f1-score to evaluate the experimental results, and the formulas are as follows: where true positive (TP) represents the number of positive samples that are correctly judged as positive samples, false positive (FP) represents the number of negative samples that are incorrectly judged as positive samples, and false negative (FN) represents the number of positive samples that are incorrectly judged as negative samples number.

E. Experimental Results
In order to verify the effectiveness of our proposed modules, we conducted ablation experiments. The experimental results are shown in Table I. The F1 score of the network increased by 0.43% after adding the CLFM, the F1 score of the network increased by 0.51% after adding the ARAM, and the F1 score of the network increased by 0.82% after adding the FIM. In addition, we list the number of parameters and computations for each network structure in Table I.
We analyzed the impact of different interference locations and connection modules on the network performance. We have found that the optimal results are obtained by adding interference at mode 2 using the connection module (b), and the experimental results are shown in Tables II and III.   TABLE I  RESULTS OF ABLATION STUDIES ON DIFFERENT COMPONENTS ON UCAS-AOD   TABLE II  INFLUENCES       To study the robustness of the method in this article, we refer to the experimental method of Makantasis et al. [17], [18] to observe the variation of the accuracy of the network by changing the sample number of the training set and comparing it with other methods. We set the proportions of the training set to 25%, 50%, 75%, and 100%. And the test set remains the same. The experimental results are shown in Table VIII. When the proportion of the training set is more than 50%, our network can still achieve better accuracy than other methods. However, when   the proportion of the training set is less than 25%, our network is less accurate than other networks.
In addition, we compare with some semisupervised networks. Set the training set label ratio to 50%, and the related results are shown in Table IX. Due to less labeling information, the accuracy of our network is slightly lower than that of DNN-WeiAve, but higher than that of BAS4Net.

F. Visualization
To demonstrate the effectiveness of our proposed model, we visualized part of the experimental results. Fig. 8 shows the detection results of different network parameters. With the increase of network parameters, the detection accuracy of the network increases. As shown in Fig. 9, compared with the other three methods, our network can improve the problem of missed and false detection to a certain extent. Thereby improving the detection accuracy. In addition, as shown in Fig. 10, in some areas where the target is heavily obscured, there are also some missed detections in our network.

V. CONCLUSION
In this article, we propose FICLAR-Net, a vehicle detection network in remote sensing images based on feature anti-interference and adaptive residual attention. We use the FIM to obtain interferential features and send them into the main network to enhance the anti-interference ability. We construct an ARAM based on deformable convolution, which utilizes contextual information to enhance weak features and implement adaptive feature extraction. In view of the lack of available features for vehicle targets, we propose a CLFM to obtain richer features and improve the detection performance of the network. We conduct experiments on three open datasets to verify the effectiveness of our method. In addition, this model also has shortcomings. This is a supervised network requiring manual annotation and a certain amount of memory. In the future, we will try to combine the method with the self-supervised methods in future research work to improve the learning ability of the network and reduce the use of annotated data.