Introduction
Synthetic aperture radar (SAR) is an active microwave imaging sensor, which realizes high-resolution imaging and recognition of region of interest (ROI) and plays an irreplaceable role in military surveillance and civilian remote sensing [1], [2], [3], [4], [5]. As one of the key applications of SAR, ground moving target indication (GMTI) has aroused the interests of many researchers, aiming to quickly find and locate the ROI from noise and complex backgrounds [6], [7].
Traditional SAR-GMTI algorithms are mainly based on echo such as constant false alarm rate (CFAR) algorithm [8]. Since the CFAR detector is conducted on the statistical characteristics of clutter, the detection performance is easily disturbed by background clutter, and the false alarm rate increases significantly, especially under the complex urban background. Guo et al. [9] proposed a modified SAR moving target detection method based on robust principal component analysis, which separated the sparse matrix of moving targets from the low-rank matrix of static background. In [10], based on the clutter suppression and multitarget tracking algorithm, the target trajectory was reconstructed by using the parameter estimation method on the foundation of Doppler feature and circular SAR geometry. In [11], moving targets with different parameters were classified according to the relative positions of moving target spectrum and clutter spectrum. On this basis, a two-stage GMTI method was proposed to detect multiple moving targets and especially those submerged by clutter. However, the above mentioned traditional SAR-GMTI algorithms rely heavily on the output signal-to-clutter-and-noise ratio [12] and signal processing technology. The clutter disturbance under complex backgrounds can easily affect the detection results, and the operation complexity is rather high.
In recent years, the intelligent SAR-GMTI algorithm based on deep learning has been continuously improved and developed, which realizes the detection of the ground moving targets by learning the morphological characteristics of 2-D SAR images. In [13], a new anchor-free moving target detection algorithm was proposed to improve the performance of the model by enhancing the strong scattering properties of the targets, and the effectiveness of the algorithm was verified on the published relevant datasets. Wang et al. [15] took the SAR-GMTI task as a blind inverse problem, and it was tackled by using the deep complex-valued convolutional network, which significantly improved the detection and refocusing accuracy. Aiming at the problems of insufficient feature extraction and high false alarm rate in the current SAR-GMTI algorithm based on the convolutional neural network (CNN), Mu et al. [14] proposed a network model based on the improved GoogLeNet architecture to enhance the dependencies between channels, obtain more context information, and improve the detection performance under a complex background. Inspired by transfer learning, Zhang et al. [15] proposed a subdomain adaptive residual network for moving target detection in the multichannel SAR system and was verified through three-channel SAR data.
However, these algorithms mainly focus on optimizing the detection performance of moving targets under the same background, but most of them lack flexibility and universality for the targets under varying backgrounds. In practical applications, the background where the moving target is located often changes, so it is of great research significance to realize the robust detection of the moving targets under new backgrounds. The performance of the intelligent detection algorithm depends largely on the quantity and quality of sample datasets, but it is difficult to obtain the measured SAR data due to the operating cost and complexity of the SAR system. The performance and generalization capability of the neural network will be extremely constrained when only a small number of SAR image samples are used for training.
In this article, the SAR imaging algorithm is adopted to construct the SAR sample datasets for network training and testing. On this basis, we propose a new method that enables the network to realize the robust detection of ground moving targets under different backgrounds. The contributions of this article are given as follows.
To tackle the problem of insufficient SAR image datasets when applying intelligent detection algorithms to SAR-GMTI, a new method of experimental data construction is proposed in this article to simulate multiple moving targets under varying backgrounds combined with the SAR imaging algorithm, taking the backprojection (BP) algorithm [16] as an example. Particularly, the characteristics of SAR moving targets and scenes are fully considered in the experiment to ensure proper and reliable simulation results. By traversing the location parameters of the moving targets (azimuth–range coordinates) within a reasonable range, it is easy to obtain a large number of SAR image datasets required for network training and testing accordingly.
In this article, the adaptive spatial location extraction network based on deformable module (ASLE-DM) network structure based on the spatial deformable module is proposed to adaptively model the input targets with different shapes, which realizes robust detection of polymorphous moving targets in the SAR scene. The network can adjust the positions of spatial sampling points and the size of receptive field according to the geometry transformation, thus optimizing the feature extraction mode of moving targets and improving the detection performance of the model. Especially for the SAR system where the imaging scene rotates due to the change of observation angle (such as circular spotlight SAR), ground moving targets can still be detected correctly with no need for image registration [17].
The proposed method possesses flexibility and universality for the moving targets under varying backgrounds in practical applications. For the intelligent detection of moving targets under the complex background, the interference of artificial buildings, rocks, and vegetation in the scene will bring about many false alarms in the detection results. In this regard, the multichannel clutter suppression algorithm of airborne SAR is applied to significantly suppress the background clutter, while the moving target information is fully reserved to further mitigate the false alarms in the detection results. Even when there exists channel error in practical applications, the network can still realize the robust detection of moving targets in different SAR scenes.
Construction of Multiple Moving Target Datasets
At present, most of the SAR-GMTI algorithms based on deep learning focus on the optimization of moving target detection performance under a single background. However, in practical applications, the scene where the target is located often varies, so it is of great significance to explore how to robustly detect the moving targets under a new background.
For the target detection algorithm based on deep learning, whether the network can realize the flexible and robust detection of ROI depends largely on the quality and quantity of the training sample datasets. However, it is extremely difficult to obtain the measured SAR data for network training and testing, and it is even impossible to obtain the datasets of moving targets under different backgrounds required in the experiment. Therefore, the construction of multiple high-fidelity simulation sample sets is the key to improving the performance of the algorithm.
A. Signal Model
Assuming that the airborne SAR works in the spotlight imaging mode, the linear frequency modulation continuous wave (LFMCW) transmitted by SAR can be expressed as
\begin{equation*}
{S_{0}}({\tau })=A\text{rect}\left(\frac{\tau }{T_{r}}\right)\exp (j2{\pi }{f_{0}}{\tau })\exp (j{\pi }K_{r}{{\tau }^{2}}) \tag{1}
\end{equation*}
\begin{equation*}
\begin{split} r_{0}({\tau },t)=&A\text{rect}\left(\frac{{\tau }-{{\tau }_{i}}}{T_{r}}\right)\exp (j2{\pi }{f_{0}}({\tau }-{\tau }_{i}))\\
&\exp (j{\pi }K_{r}({{{\tau }-{\tau }_{i}})^{2}}) \end{split} \tag{2}
\end{equation*}
\begin{equation*}
\begin{split} S({\tau },t)=&{A_{0}}\text{rect}\left({\tau }-\frac{2R(t)}{c}\right)\exp \left({-j4{\pi }{f_{0}}{\frac{R(t)}{c}}}\right)\\
&\exp \left(j{\pi }K_{r}({{{\tau }-{\frac{2R(t)}{c}}})^{2}}\right) \end{split} \tag{3}
\end{equation*}
\begin{equation*}
S_{rc}({\tau },t)={A^{\prime}}{{\rho }_{r}}\left({\tau }-\frac{2R(t)}{c}\right)\exp \left({-j4{\pi }{f_{0}}\frac{R(t)}{c}}\right) \tag{4}
\end{equation*}
B. BP Algorithm
With the continuous development of SAR technology, high-resolution imaging algorithms [18], [19] have emerged one after another. Taking the BP imaging algorithm as an example, the construction of diverse SAR image sample sets for neural network training and testing is completed in this article. The BP algorithm is a typical time-domain imaging algorithm. Compared with the imaging algorithms conducted in the frequency domain, such as polar format algorithm, range–Doppler, and chirp scaling, it can accurately eliminate the coupling between range and azimuth directions without any assumption of slant range approximation, and it is applicable to the radar system in any working mode [20].
The imaging diagram of the BP algorithm is shown in Fig. 1. The imaging scene is divided into
For the pixels in the imaging grid, the instantaneous slant range between each pixel and the platform at the current azimuth moment is calculated to acquire the two-way propagation delay and obtain the phase compensation factor. By interpolating the radar echo signal, the amplitude value is correspondingly assigned to each pixel and then coherently accumulated along the azimuth time to obtain the imaging result of this pixel. The pulse at each azimuth moment is coherently accumulated to obtain the imaging result of each pixel point
\begin{equation*}
I(x,y)=\sum _{t={t_{1}}}^{t={t_{2}}}S_{rc}({\tau }_{xy},t)\exp (j2{\pi }{f_{0}}{{\tau }_{xy}}) \tag{5}
\end{equation*}
C. Construction of Multiple Moving Target Datasets Under Different Backgrounds
As the SAR scene and position state where the moving target is located always vary in practical applications, the simulation dataset should possibly comprise the moving targets under different backgrounds and at different positions. In the process of simulation, the matching between the moving target and the scene (such as scale ratio and contrast) also needs to be considered reasonably.
After setting the parameters such as position coordinate of moving target according to the characteristic of scene, the simulation of moving target is conducted, and the obtained echo data are stored in the matrix for subsequent processing. For the echo simulation of SAR scene, as SAR images are gray images, it is necessary to preprocess the pixel values first, that is, normalize the value of each pixel point to 0–255 pixel by pixel. Then, the obtained echo data of each SAR scene and moving targets are adaptively added, and the noise component is added to make the simulation results closer to practical applications. The SAR images of moving targets under different backgrounds can be obtained by processing the integrated echo data by using the BP algorithm in Section II-C. In addition, the noise component is superimposed to the echo of moving target alone so as to obtain the SAR datasets for network training. The main steps for the construction of simulation datasets are shown in Fig. 2.
As the position of a moving target can be set arbitrarily within a reasonable range, it is easy to obtain a large number of SAR image sample sets by traversing the position parameters (azimuth–range coordinates) of the targets. In this article, typical SAR scenes (desert, complex urban backgrounds, forest, etc.) are selected for simulation, and the imaging results are shown in Fig. 3. Under the desert background in Fig. 3(a), some rocks show strong scattering properties, and the features of gullies and target trajectories are obvious. In addition, the overall characteristic of the desert is similar to the noise background. However, there relatively exists plenty of interference from ground objects such as vegetation, roads, and artificial buildings in other backgrounds shown in Fig. 3(b)–(f), which will make it difficult for the network to accurately detect the moving targets in the scene.
Simulation results of SAR scenes. (a) Scene 1 (desert). (b) Scene 2 (urban road). (c) Scene 3 (artificial buildings). (d) Scene 4 (apron). (e) Scene 5 (forest). (f) Scene 6 (baseball diamond).
Combined with the constructed SAR image sample sets, it is proved that the proposed method can realize robust detection of moving targets under different backgrounds. The overall framework of this article is shown in Fig. 4. In particular, the simulation dataset of moving targets under the noise background is used for network training, and the trained model is verified by using the simulation test sets of moving targets under different backgrounds.
Overall framework of realizing robust detection of ground moving targets under different backgrounds based on the ASLE-DM network.
Adaptive Feature Extraction of Polymorphic Ground Moving Targets Using ASLE-DM
The CNN [22], [23] is limited to adaptively model geometric transformations such as the variation of target shape and scale due to its fixed model structure, which greatly limits the detection performance of the network. The ASLE-DM network based on the spatial deformable module is proposed in this article, allowing the network to adjust the spatial sampling position and ROI feature mapping mode according to the geometric characteristics of the input feature map [24], [25].
A. Spatial Deformable Module
The receptive field plays an important role in feature extraction, and it should be neither too big nor too small. In the traditional CNN network structure, all the activation units in the same CNN layer have the same receptive field when processing different input feature maps, which is obviously undesirable for the deep CNN layer to encode high-level semantic information. The deformable convolution and deformable ROI pooling module can adjust the position of spatial sampling points according to the shapes of input feature maps, and the receptive field is transformed accordingly.
1) Deformable Convolution
Based on standard convolution, a 2-D offset is additionally designed for spatial sampling position points, which provides the network with the modeling capacity of the freely transformed grid points. 2-D deformable convolution is conducted in two steps: 1) position sampling is conducted on the input feature map
\begin{equation*}
O_{fm}(g_{0})=\sum _{{g_{n}}\in G}w(g_{n})\cdot I_{fm}(g_{0}+g_{n}) \tag{6}
\end{equation*}
In the deformable convolution, the extra offset
\begin{equation*}
O_{fm}(g_{0})=\sum _{{g_{n}}\in G}w(g_{n})\cdot I_{fm}(g_{0}+g_{n}+\triangle {p_{n}}). \tag{7}
\end{equation*}
Then, the current sampling point is at the irregular offset position
The feature map is calculated on the input image by standard convolution operation, and then, the position offset is obtained by another convolution layer. The dimension of the generated channel is
2) Deformable ROI Pooling
ROI pooling is commonly applied in the object detection network with region proposal structure to convert the input feature map with arbitrary size to the fixed-size output so as to realize feature dimension reduction and data compression. Deformable pooling adds an offset for each box position and learns the offset from the prior feature map and ROI to implement the adaptive positioning of input with different shapes.
Given the input feature map
\begin{equation*}
y(i,j)=\sum _{p \in \text{box}(i,j)}X(p_{0}+p)/n_{ij} \tag{8}
\end{equation*}
\begin{equation*}
y(i,j)=\sum _{p \in \text{box}(i,j)}X(p_{0}+p+\triangle p_{ij})/n_{ij}. \tag{9}
\end{equation*}
B. Basic Structure of ASLE-DM
The network is required to adaptively adjust the size of receptive field to learn more effective feature information when confronted with input targets with different shapes. As a consequence, the ASLE-DM network model structure based on the spatial deformable module is proposed in this article to offer a more flexible and effective ROI feature extraction mode and improve the adaptive modeling capability of geometric transformations. The network structure of ASLE-DM is shown in Fig. 5. The ASLE-DM network structure mainly consists of three modules.
Multireceptive field feature extraction: Original fixed-size convolution and pooling operation are substituted into the deformable module described in Section III-A. The size of the receptive field is adaptively adjusted according to the input geometric transformations. The spatial sampling positions of convolution and pooling layer are strengthened by using additional offset to obtain more diverse feature information. This module extracts the high-level semantic information of the input feature map (downsampling) and generates a feature map, which is more suitable for subsequent intensive prediction. It has strong adaptability to model the input targets with different shapes.
Shallow feature fusion: Inspired by the feature pyramid network [26], shallow feature information (such as edges and textures) and high-level features are fully fused on multiscale feature maps through splicing and addition modes. In this way, the feature map has not only rich semantic information but also accurate position information of the target, which helps the network understand the position and shape of the monitored targets in subsequent prediction better.
Multiobject prediction: The prediction head is stacked by a series of convolution layers and fully connected layers. Based on the preset prior anchors, multidimensional arrays that record the coordinate, width, and height, category and confidence of the detection boxes on the input feature maps are calculated by convolution layers. Afterward, each bounding box will be further classified via fully connected layers and softmax function, and the coordinate, width, and height of the detection box will be regressed by linear transformation, fully connected layers, and sigmoid function simultaneously. Ultimately, the parameters of the detection boxes are spliced and output as arrays after filtered by several postprocessing steps to obtain the final detection results.
C. Effective Receptive Field
Due to the change of observation angle, the imaging results of SAR scenes may rotate at different angles, as shown in Fig. 6. For the neural network, the shapes of the moving targets under surveillance are also transformed, and the morphological characteristics of such targets are not completely the same as those of standard rectangular targets. Whether the network can adjust to the geometry transformations is the key to realizing robust detection of the moving targets under different backgrounds.
Taking two CNN layers with
Receptive field and sampling position of standard and deformable module. (a) Standard convolution. (b) Deformable module.
Robust Detection of Moving Targets Under Different Backgrounds Based on Clutter Suppression
The intelligent detection of moving targets under a complex background is easily disturbed by background clutter. The scattering coefficients of artificial buildings, rocks, vegetation, and other ground objects are not much different from those of the monitored vehicles, and they show similar characteristics in imaging results, which will bring a lot of false alarms to the detection results. In order to improve the performance of the network under different backgrounds, the imaging results need to be further processed by the clutter suppression technology to alleviate the interference from background clutter [27], [28], [29].
Compared with traditional single-channel moving target detection methods, the multichannel SAR-GMTI system can effectively suppress background clutter, reserve the information of moving targets, and effectively detect ground moving targets. At present, the multichannel SAR-GMTI methods, such as displaced phase center antenna (DPCA) [30], along-track interferometry, and space-time adaptive processing [31], [32], can be flexibly selected for different SAR systems. Based on clutter suppression technology, the false alarms in the detection results can be significantly mitigated to realize the robust detection of ground moving targets under different backgrounds.
The geometric observation model of the multichannel airborne SAR system is shown in Fig. 8 (number of channels:
In this article, take dual-channel DPCA technology [30] as an example to suppress background clutter, and then, the number of channels is
\begin{align*}
R_{1}(t)=&\sqrt{(x_{0}+v_{a}t-vt)^{2}+(y_{0}-v_{y}t)^{2}+H^{2}} \tag{10}\\
R_{2}(t)=&\sqrt{(x_{0}+v_{a}t-vt+d)^{2}+(y_{0}-v_{y}t)^{2}+H^{2}} \tag{11}
\end{align*}
\begin{equation*}
\begin{split} S_{i}({\tau },t)=&{A_{0}}\text{rect}\left({\tau }-\frac{2R_{i}(t)}{c}\right)\exp \left({-j4{\pi }{f_{0}}{\frac{R_{i}(t)}{c}}}\right)\\
&\exp \left(j{\pi }K_{r}({{{\tau }-{\frac{2R_{i}(t)}{c}}})^{2}}\right). \end{split} \tag{12}
\end{equation*}
The obtained echo signal after range pulse compression can be expressed as
\begin{equation*}
S_{rc,i}({\tau },t)=A^{\prime}{\rho }_{r}\left(\tau -\frac{2R_{i}(t)}{c}\right)\exp \left(-j4{\pi }f_{0}\frac{R_{i}(t)}{c}\right) \tag{13}
\end{equation*}
In the actual SAR system, inconsistency of radar antenna pattern, error of antenna phase center, instability of platform velocity, noise, and environmental interference will cause the amplitude and phase error between channels [33], which will greatly limit the performance of clutter suppression. It is particularly important to study how to realize the robust detection of moving targets even if there exists channel error. Therefore, the amplitude and phase error are further added to channel 2, and the echo signal of channel 2 after demodulation is
\begin{equation*}
{S^{\prime}_{{rc},2}}{({\tau },t)}=A_{e} \cdot S_{rc,2}(\tau,t) \cdot \exp ({j \ast P_{h}}) \tag{14}
\end{equation*}
Then, the echo signal of channels 1 and 2 after range compression is, respectively, simulated according to the steps described in Section II-B. The echo signal after phase compensation is coherently accumulated to obtain the imaging results of each pixel in the background
\begin{equation*}
I_{i}{(x,y)}=\sum _{t=t_{1}}^{t=t_{2}}S_{rc,i}({\tau }_{xy},t)\exp (j2{\pi }{f_{0}}{\tau }_{xy}) \tag{15}
\end{equation*}
Finally, the echo data of the two channels after imaging are further processed by the DPCA technology in the complex image domain, which can be expressed as
\begin{equation*}
I_{1,2}(x,y)=I_{1}(x,y)-I_{2}(x,y). \tag{16}
\end{equation*}
Consequently, most of the background clutter is significantly suppressed, and the moving target information is sufficiently reserved.
Experiment
Before applying the ASLE-DM network to detect the ground moving targets, it is necessary to complete the construction of moving target simulation sample sets. In this article, the BP imaging algorithm is taken as an example to simulate moving targets and diverse SAR scenes. The simulation parameters of the airborne SAR system are set in Table I.
A. Construction Results of Ground Moving Target Simulation Sample Sets Under Different Backgrounds
The shape of the vehicle to be monitored can be approximately considered as an irregular rectangle, and the number of moving targets in the scene is set within 5–10 according to the characteristics of each scene. In the process of moving target simulation, such parameters as size, shape, number, and position of moving target can be set arbitrarily, so it is easy to obtain a large number of simulation sample sets for the experiment, and some imaging results are shown in Fig. 9(a)–(d). In order to make the imaging results closer to practical applications, noise component is added to the echo simulation of moving targets, and the signal-to-noise ratio is 15 dB, as shown in Fig. 9(e)–(h). The obtained moving target sample sets under the noise background are used for network training.
Simulation results of ground moving targets. (a)–(d) Imaging results of moving targets. (e)–(h) Imaging results of moving targets under the noise background.
The azimuth position coordinate of moving target is traversed within a reasonable range (step size: 1). Accordingly, 350 SAR image sample sets under the noise background are obtained, and then, the quantity is expanded to 4000 by image enhancement (image rotation, cropping, translation, etc.). The obtained moving target image sample sets under the noise background are fed into neural network for training. Furthermore, the echo signal of the moving target is separately fused with that of each SAR scene shown in Fig. 3 to obtain the test sample sets. Some of the imaging results are shown in Fig. 10.
Simulation sample sets of moving targets under different backgrounds. (a) Scene 1 (desert). (b) Scene 2 (urban road). (c) Scene 3 (urban building). (d) Scene 4 (apron). (e) Scene 5 (forest). (f) Scene 6 (baseball diamond).
In particular, the variation of observation angle of some SAR systems (such as circular spotlight SAR) will lead to the rotation of the imaging scene. Therefore, the simulation of moving target at a certain rotation angle (30
B. Results of Moving Target Detection Using ASLE-DM
The simulation sample set of moving targets under the noise background constructed in Section V-A is input into the ASLE-DM network for training, and the trained model is used to detect the SAR images under different backgrounds in the test set. The partial detection results of moving targets under different backgrounds obtained by using the ASLE-DM network are shown in Fig. 11(s)–(x).
(a)–(x) Detection results of moving targets under multiple backgrounds based on different methods. (Red rectangle represents correctly detected moving target, green ellipse represents missing alarm, and yellow ellipse represents false alarm.) Each row represents different detection methods (rows 1–4 are Faster-RCNN, Detr, Yolov8, and ASLE-DM, respectively), and each column represents different scenes (columns 1–6 are scenes 1–6, respectively).
Except that some sandstones show strong scattering properties, the overall characteristics of scene 1 (desert) are similar to the noise background, and most of the moving targets in this scene can be correctly detected. However, some moving targets are submerged in background clutter, which will produce a certain number of missing alarms, as shown in Fig. 11(s). For the moving targets in other scenes, although the imaging results of the vegetation, artificial buildings, and some strong scattering points in the scene are similar to the shapes of targets, which will bring some missing and false alarms to the detection results, the ASLE-DM network can still robustly detect most of the targets under different backgrounds, as shown in Fig. 11(s)–(x). Especially, as to the moving targets in scene 3, the rotation of SAR imaging result occurs due to the change of observation angle, while most of moving targets can still be accurately detected by the ASLE-DM network due to its strong adaptive modeling capability of input targets with different shapes. Generally, the ASLE-DM network can realize robust detection of the moving targets under different backgrounds with a certain number of missing and false alarms.
C. Comparative Experiment of Different Methods
In order to further verify the effectiveness of the proposed method, the experimental results of Faster-RCNN [34], Detr [35], Yolov8 [36], and ASLE-DM are quantitatively compared in Table II. Note that the frames per second (FPS) is used to measure the processing speed of model prediction. With the results shown in Table II, it can be seen that the mean average precision (mAP) and F1-score are superior to those of other methods and the speed of model prediction is also optimized . By comparison, the ASLE-DM network comprehensively shows better performance to other methods.
In addition, although Table II reflects the better performance of the ASLE-DM model, further validation is needed to determine whether the proposed method can better realize robust detection of moving targets under different backgrounds. Therefore, the moving target simulation dataset under the noise background is fed into Faster-RCNN, Detr, and Yolov8 networks for training, respectively, and the trained model is used to detect the moving targets in the test sample set. The quantitative comparison of detection results obtained by different methods is shown in Table III. Apart from the intelligent target detection algorithms, the echo-based CA-CFAR algorithm [37], [38] is also implemented to make a comparison. The 2-D cell-averaging CFAR (CA-CFAR) method adopts the sliding window mechanism, which consists of cell under detection (CUT), guard band, and training band. When the power of CUT is greater than the detection threshold
The simulation test set contains SAR image samples of moving targets under six different backgrounds, and the total number of moving targets to be detected is 5525. The detection threshold (confidence) of each neural network model is set as the same to make the results comparable.
In particular, recall rate and precision indicators are further calculated to display the comparison results more intuitively. The definition of recall rate and precision can be expressed by
\begin{align*}
\text{Recall} \text{rate}=&\frac{\text{TP}}{\text{TP}+\text{FN}} \tag{17}\\
\text{Precision}=&\frac{\text{TP}}{\text{TP}+\text{FP}} \tag{18}
\end{align*}
Faster-RCNN is a typical two-stage object detection algorithm, and the detection accuracy of the algorithm is improved by introducing region proposal network structure. It can correctly detect most of moving targets in the test set with the highest recall rate, but the detection precision is extremely unacceptable, which will not operate well in practical applications. Detr is an end-to-end target detection model based on the transformer architecture, and it leverages CNN, encoding, and decoding structure to implement feature extraction and inference. However, for the relatively complex training process and insufficient feature extraction of small targets, the performance of Detr is relatively poor in detection precision. Yolov8 is a typical one-stage object detection algorithm, which regards target detection as a regression problem and inputs the whole image into the network directly for inference. According to Table III, although the detection performance of Yolov8 is relatively satisfactory with decreasing false alarms, it is still inferior to that of the ASLE-DM network. Compared to the ASLE-DM network, the detection performance of CA-CFAR is easy to be interfered by complex background clutter, which brings about several missing and false alarms consequently and make it perform badly in recall rate and precision.
Moreover, some detection results of moving target under different backgrounds based on the neural network models and CA-CFAR are shown in Figs. 11 and 12, respectively. For the detection of moving targets in desert, the echo signal of some targets is merged in background clutter, which brings several missing alarms to the detection results, as shown in Fig. 11(g), (m), and (s). And there still exist a few false alarms in the detection results of Faster RCNN and Detr, as shown in Fig. 11(a) and (g). For the rotated imaging scene due to the change of observation angle, some targets can still be correctly detected by virtue of its adaptive modeling capability, as shown in Fig. 11(u). In relatively complex backgrounds, the number of missing and false alarms will increase in the detection results of the other three neural networks, while better detection results can be obtained by the ASLE-DM network, as shown Fig. 11(b), (h), (n), and (t).
Detection results by CA-CFAR. (Red rectangle represents correctly detected moving target, green ellipse represents missing alarm, and yellow ellipse represents false alarm.) (a)–(f) Scenes 1–6.
It can be seen from Fig. 12 that there are many scattering points in the detection results of CA-CFAR, which can be ignored since the moving target to be monitored occupies a certain area instead of isolated points, and only the region composed of several pixel points is judged as a target. Combined with Table III, CA-CFAR can correctly detect partial moving targets from the noise interference. However, the detection performance of the CA-CFAR algorithm will deteriorate sharply due to the disturbance of strong background clutter, and a large number of false alarms and missing alarms will be generated especially under a relatively complex backgrounds, as shown in Fig. 12(b)–(f). By comparison, the ASLE-DM network can overall realize robust detection of the moving targets under different backgrounds and shows better detection performance than the other methods with fewer false and missing alarms.
D. Robust Moving Target Detection Based on Clutter Suppression
Although the ASLE-DM network can correctly detect most of the moving targets under different backgrounds according to the comparative experimental results in Section V-C, there still exist a certain number of false alarms and missing alarms in the detection results due to the interference of complex background clutter. In order to further improve the detection performance of the ASLE-DM network, multichannel clutter suppression technology is combined to suppress background clutter significantly. Taking dual-channel DPCA technology (
(a)--(x) Detection results of the targets after clutter suppression by different methods. (in ideal conditions). Each row represents different detection methods (row 1-4 are Faster-RCNN, Detr, Yolov8 and ASLE-DM, respectively), and each column represents different scenes (column 1-6 are scene 1-6, respectively).
However, in the actual SAR system, the channel error (such as antenna phase center error, inconsistent response of each channel pattern, and receiver thermal noise) exists inevitably, which will restrict the performance of clutter suppression technology and inhibit the detection accuracy of ground moving targets. In the case of amplitude and phase error between channels (
Moreover, the corresponding detection results after clutter suppression by different methods are partially shown in Figs. 14 and 15. Due to the existence of channel error, the background clutter cannot be completely suppressed, and there is still some residual clutter in the SAR imaging results. Compared with the detection results in Fig. 11, the false alarms and missing alarms have been significantly mitigated, such as Fig. 14(d)–(v). From Fig. 15, it can also be seen that the detection performance of CA-CFAR is also improved when compared to Fig. 12. Meanwhile, even if there is residual clutter in the background, the ASLE-DM network shows better detection performance with fewer false alarms and missing alarms generated in the detection results, which can effectively realize the robust detection of moving targets under different backgrounds.
(a)–(x) Detection results of the targets after clutter suppression by different methods (with channel error). Each row represents different detection methods (rows 1–4 are Faster-RCNN, Detr, Yolov8, and ASLE-DM, respectively), and each column represents different scenes (columns 1–6 are scenes 1–6, respectively).
Detection results of CA-CFAR based on clutter suppression. (Red rectangle represents correctly detected moving target, green ellipse represents missing alarm, and yellow ellipse represents false alarm.) (a)–(f) scenes 1–6.
As to the intelligent target detection algorithms, further study of the detection results at different confidence levels is conducted for supplement to demonstrate the impact of detection threshold on the number of missing and false alarms. The detection threshold (confidence) is set as 0.65, 0.7, 0.75, 0.8, and 0.85, respectively, and the relevant experimental results obtained by different detection methods are calculated, as shown in Fig. 16.
Calculation of detection results at different confidence levels. (a) Number of missing alarms. (b) Number of false alarms. (The blue, brown, red, and yellow lines correspond to ASLE-DM, Faster-RCNN, Detr, and Yolov8, respectively.)
The total number of missing and false alarms in detection results obtained by different methods is compared in Fig. 16(a) and (b), respectively. As the detection threshold increases, the number of missing alarms embodies an upward trend, while the number of false alarms declines as a whole. From Fig. 16(a), it is not difficult to see that the number of missing alarms obtained by ASLE-DM is relatively less than those by other methods. In particular, although Faster-RCNN shows slightly better performance to the proposed method in missing alarms, much more false alarms are produced by Faster-RCNN combined with Fig. 16(b), which is unacceptable in practical applications. Moreover, from Fig. 16(b), the detection results of false alarms by ASLE-DM are mostly better than the other methods. Although Yolov8 shows similar performance to the proposed method in false alarms, the missing alarms extremely increases in Fig. 16(a). Therefore, it is well verified that the proposed method shows superior detection performance to the other methods mentioned in this article.
E. Ablation Study of the Spatial Deformable Module
In order to further verify the effectiveness of spatial deformable module, comparative experiments with different network architectures are conducted, and relevant results are shown in Table V. All of the experiments are conducted on the platform NVIDIA GeForce RTX 3090 GPU, the deep learning framework is PyTorch (version 1.10.0), and the training process consists of 200 epochs.
The ablation study of the deformable module is conducted by using Yolov5s and Faster-RCNN as baseline networks, in which CSPDarkNet53 and ResNet-50 are used as the backbone architectures, respectively. On this basis, further experiment of applying baseline architecture with three deformable layers on the simulation datasets is conducted to make a comparison.
From Table V, it is verified that the proposed architecture is superior to other methods and thus will show better detection performance of the targets in the test sample sets. The deformable module can effectively increase the mAP value of the model, and the prediction speed and the training time of the model are also optimized. As described in Section III, the spatial deformable module can further enhance the sampling positions of the feature maps and fully preserve high-level semantic information of ROI in SAR images during the process of feature extraction, which will significantly promote the final detection results of the moving targets.
In addition, further comparative experiments of the network architectures are performed on the detection results of the simulation test sample sets, and the relevant results are recorded in Table VI.
As given in Table VI, when using the trained network model to detect the SAR imaging results after clutter suppression in test sample sets, the recall rate and precision of the detection results are effectively increased due to the application of the spatial deformable module. By comparison, it is verified that the proposed method can effectively improve the detection performance of the model by virtue of its more powerful ability of feature representation and can further realize robust detection of the moving targets under different backgrounds.
Moreover, it is noted that although the introduction of the deformable module can effectively optimize the detection performance of the model with more correctly detected targets and fewer false alarms generated, it also brings more parameters and raises the calculation complexity of the model, which will significantly increase the training time. In particular, during the experiment process, it is found that the training results and detection performance of the model are no longer promoted remarkably with three more network layers substituted by the deformable module, while the training time significantly increases instead. Therefore, the number of the network layers substituted by the deformable module depends comprehensively on the calculation complexity and detection performance of the model.
Conclusion
In this article, a new method of SAR ground moving target detection is proposed on the basis of associating the ASLE-DM network with traditional clutter suppression technology, which focuses on optimizing the detection performance of moving targets in varying SAR scenes for further practical applications. Combined with the SAR imaging algorithm, the simulation sample set is constructed by traversing the position parameters of moving targets, which can tackle the problem of insufficient measured data in the SAR-GMTI algorithm based on deep learning. In order to enhance the detection performance of polymorphous input targets, the spatial deformable module is applied to make it capable to model various geometry transformations. Moreover, the multichannel clutter suppression technology is used to suppress the background clutter and then realize robust detection of the moving targets under different backgrounds. The effectiveness and robustness of the proposed method are further verified compared with other detection methods. Eventually, it is acknowledged that although we pursue ensuring the diversity and representativeness of the simulation data, the experimental sample data are impossible to cover all the SAR scenes, which means that the proposed method still needs to be replenished and improved continuously with the accumulation of diverse SAR sample data.