Prior-Guided Data Augmentation for Infrared Small Target Detection

Recently, a lot of deep learning (DL) methods have been proposed for infrared small target detection (ISTD). A DL-based model for the ISTD task needs large amounts of samples. However, the diversity of existing ISTD datasets is not sufficient to train a DL model with good generalization. To solve this issue, a data augmentation method called prior-guided data augmentation (PGDA) is proposed to expand the diversity of training samples indirectly without additional training data. Specifically, it decouples the target description and localization abilities by preserving the scale distribution and physical characteristics of targets. Furthermore, a multiscene infrared small target dataset (MSISTD) consisting of 1077 images with 1343 instances are constructed. The number of images and the number of instances in MSISTD are 2.4 times and 2.5 times than those of the existing largest real ISTD dataset single-frame infrared small target (SIRST) benchmark, respectively. Extensive experiments on the SIRST dataset and the constructed MSISTD dataset illustrate that the proposed PGDA improves the performance of existing DL-based ISTD methods without extra model complexity burdens. In comparison with SIRST, MSISTD has been evaluated as a more comprehensive and accurate benchmark for ISTD tasks.


I. INTRODUCTION
I N THE past few years, single-frame infrared small target detection has received more and more attention. Compared to other sensing devices (active imaging radar, visible imaging), infrared imaging has the following characteristics: antiinterference, excellent portability [1], strong cloud penetration, blind spot detection, etc. For these reasons, the small infrared target small detection (ISTD) has been widely used in military applications such as early warning, missile tracking systems [2], precision guidance, and maritime surveillance. However, infrared sensors usually start at a long distance from the target. So the target usually occupies only a small part of the image. In extreme cases, they are just Gaussian point-like targets. Other reasons, such as complex backgrounds, flash noise, atmospheric scattering, optical defocus, etc., also lead to ultralow signalto-noise ratio (SNR) [3] and weak texture features in infrared images. In addition, the infrared imaging platform and the small target are moving at high speed, which makes it difficult to acquire a sufficient number of single-frame images. All of the above reasons greatly reduce the performance of temporal recognition.
Single-frame ISTD methods are roughly classified into two categories: data-driven and model-driven methods. For modeldriven algorithms, they are divided into four categories including filter-based methods [4], [5], [6], low-contrast-based methods [7], [8], [9], [10], low-information-based methods [11], [12], [13], and data structure-based methods [14], [15], [16], [17]. The filter-based methods are divided into those for the spatial domain and the frequency domain. They are widely used because of the small number of hyperparameters. The data structure-based approach models the context and the target separately from different data structures. It mainly includes subspace structure-based, dictionary-based, and tensor representation-based approaches. However, the effectiveness of model-driven approaches is very dependent on the specific scenario and the manually adjusted hyperparameters. When the field of the dataset varies too much, the performance of the above algorithms drops dramatically.
The current approaches based on DL are divided into five categories: optimization approaches for specific criteria, GAN-based methods, transformer-based methods, global and local feature fusion methods, and feature fusion methods based on specific attention modules. McIntosh et al. [18] developed a target-toclutter ratio detection network (TCRNet) by optimizing the TCR criterion to prioritize the representation of the infrared small object. GAN-based methods [2], [19] were proposed to focus on essential features of ISTD and predict the intensity of targets. Targets are identified and reconstructed by the generator This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ according to the data distribution. Meanwhile, discriminators are constructed to improve the data-fitting ability of the generator. Vision transformer [20] and swin-transformer [21] have been applied in all downstream tasks [22] in the vision domain. Liu et al. [23] only used simple ViT architecture to model foreground and background. Global and local feature fusion-based methods usually consider extreme class-imbalance between the target and background pixels. They usually use the global attention module and the local attention module [24], [25] to extract global and local information, respectively. Coarse-grained features extracted by the large-scale branch and fine-grained features extracted by the local branches are fused through the developed bottom-up structure. Feature-fusion-based methods [26], [27], [28], [29], [30] usually design different kinds of feature enhancement modules manually. These plug-and-play components make the features of small targets more prominent while making the background effectively suppressed, but they greatly increase the complexity.
On the one hand, most of these models are evaluated on multiframe small target datasets and single-frame small target datasets with prominent targets. There is no comprehensive benchmark to use these datasets to evaluate the performance of individual models. And there is no apparent performance gap between model-driven methods and data-driven methods on these datasets. On the other hand, existing models improve the performance of certain evaluation metrics by increasing the parameters and complexity of the models. However, few of them notice the enhancement of models by data augmentation (DA). In addition, generic DA algorithms are designed for mainstream datasets and are not optimized and changed specifically for ISTD. Also, the existing DA algorithms change the scale distribution and grayscale intensity distribution of small targets. Representative DA methods such as mosaic augmentation [31], cut-mix [32], and mixup [33] mix the information contained in different images with appropriate changes to ground truth labels. In practice, these methods often leave the performance of the model unchanged or even degraded. Therefore, a dataset is constructed to properly evaluate the gap between model-driven and data-driven methods. In addition, it allows for a more accurate assessment of the performance of existing DL methods. A DA strategy is proposed to address the limitations of generic DA algorithms. It improves the performance of DL-based methods without increasing the algorithmic complexity of the model and introducing additional data. To this end, the contributions of this work are as follows.
1) A prior-guided data augmentation (PGDA) method is proposed to generate more training samples for infrared small target detection. It introduces the prior scale distribution and physical representation of targets to keep the statistics of the original and generated samples consistent. In addition, the PGDA decouples the description ability of the target and the localization ability of the target, which enhances the DL methods performance. 2) A multiple scenes infrared small target dataset (MSISTD) is constructed to improve the diversity of the ISTD dataset and the DL methods of generalization. It extends the original 427 datasets (SIRST) to 1077 and adds high-quality annotations. It involves more diverse backgrounds, richer target shapes, and less prominent samples. 3) Extensive experiments on existing SIRST [30] and the constructed MSISTD illustrate that the proposed PGDA allows DL methods to obtain improvement in both IoU and nIoU metrics without extra model complexity burdens. The strategy outperforms other methods. In addition, it is sufficiently robust and effective for the ISTD problem when compared with generic data enhancement methods and small sample training. The rest of this article is organized as follows. In Section II, we describe the framework of PGDA as a whole and its two components. In Section III, the MSISTD dataset is constructed and elaborated. In Section IV, extensive comparison experiments and ablation experiments are presented. Section V concludes this article.

A. Data Augmentation for ISTD
The previous work [34] concentrates on GAN-based synthetic data augmentation. It achieves the image-to-image translation to produce synthetic background images from visible spectrum images. By means of the intensity modulation of GAN, targets are rendered on the synthetic images. They allow most DA methods to be improved. However, it requires a large number of visible and infrared image pairs to train the modal transfer network. In addition, the training of GAN networks causes a large amount of computational complexity and model burden. Moreover, the synthesized background images require manual selection and extensive hyperparameter settings.
In fact, there are large sets of IR background data that do not require synthesis. Modality transfer from visible images to infrared images is not necessary. Furthermore, if we remove the intensity modulation, it degenerates into the regular copy-paste type methods. Copy-paste type methods are simple ways to combine information from multiple images in an object-aware manner to copy instances of objects from one image and paste them onto another image. Depending on the different mainstream tasks, it differs from copying the exact pixels of the object in semantic segmentation to copying the pixels within the bounding box in target detection. Google copy-paste [35] did not model surrounding visual context rather than combining large-scale jittering and self-training. Chen et al. [36] introduced AdaResampling, which pretrained the semantic segmentation network and used median filtering to select the position of the object. Instaboost [37] simply introduced random jittering to objects, which could improve the performance of Mask R-CNN [38]. Cut-paste-and-learn [39] proposed a simple way to generate a large quantity of datasets. It utilizes cutting object instances and pastes them on actual backgrounds. And models were trained on the mixed dataset. Small target augmentation [40] used oversampling and enhancement techniques to solve the problem of sample imbalance.
Most of the abovementioned methods focus on the performance of the generic dataset and are not specifically designed for small IR targets. In practice, copy-paste augmentation works terribly on ISTD problems because it is sensitive to the scale of information and physical representation. The proposed PGDA follows the above paradigm with some differences. 1) We dynamically adjust the grayscale intensity and area ratio value of the pasted instances after introducing prior knowledge of the training set. 2) We perform multiscale random cropping on the original image to obtain subgraphs. They replace the original image as the background where the instances are pasted. 3) We employ the nongrid transforms to make the shape of targets more diverse. On the one hand, the proposed PGDA makes more detailed optimizations for the above problems so that copy-paste's performance get sufficient improvement. On the other hand, the proposed method improves the performance of the existing DL methods performance without extra model complexity burdens and lots of visible-infrared image pairs, compared to the GANbased methods.

B. Datasets for ISTD
Model-driven algorithms and most DL algorithms train their network on self-built datasets. These datasets are not sufficient for training a network with good generalization. Only a small minority of them are available such as SIRST [30] MFSIRST [19]. These datasets facilitate the ISTD task, where correlation algorithms can be compared. However, the following limitations prevent these from being used as a comprehensive and accurate benchmark.
The current ISTD datasets have few real images. Besides, most of them are synthetic. Except for the SIRST [30], the existing ISTD datasets use image synthesis methods to expand the number of samples. Zhang et al. [29] deflated the SIRST dataset to a fixed size. Then, they cropped out four subfigures on each image. Through affine transformation, a training set including 8525 images and a testing set with 545 images were finally generated. Wang et al. [19] collected the shape and texture information and background information to generate 10 000 figures. However, a great many generated images occurred pixel missing problems after the affine transformation. The physical representation of the small target already differs from the real ones. They have the problems of low-resolution and pixel-missing, which lead to application bottlenecks in many ISTD tasks.

III. PROPOSED PGDA STRATEGY
The proposed PGDA is shown in Fig. 2, which mainly consists of two parts: the target enhancement module and the subfiguresplit module. In this section, we first introduce the overall framework. Then, we present the details of the subfigure-split module and the target enhancement block.

A. PGDA Framework
The proposed PGDA generally obeys the following paradigm: 1) selecting a set of instance images and background images; 2) cropping the instances and performing a series of data augmentation processes on them; 3) selecting the locations and background where the instances are pasted; 4) fusing the instances and the background to generate a new image. In the proposed strategy, the target enhancement consists of Gaussian paste and signal-to-clutter ratio (SCR) perception. SCR perception is used after Gaussian weighting to make them match the physical representation of the actual targets. Subfigure-split is to perform multiple multiscale random crop of the background and adjust the resolution of the background according to the scale of the target. PGDA decouples the ability to describe the target and the ability to localize the target, which overcomes the limitations of the general-purpose algorithm for the ISTD problem.
Before starting the whole algorithm, we calculate the target scale distribution and the SCR distribution of the training set to integrate them into the algorithm as prior knowledge. The area ratio is used to measure the scale characteristics of the target. the proportional ratio between the total number of pixels occupied by a small target and the total number of pixels in the background is defined mathematically below where P ixel T represents the total number of pixels of the target, and P ixel B represents the total number of pixels of the whole background. We calculate the area ratio values for all small targets in the training set and count all values below 0.15%. This area ratio is set as A.
The signal-to-clutter ratio (SCR) is used to describe how difficult it is to detect small targets in infrared images [8], [41], [42]. The larger the SCR value, the easier the target is to detect. SCR is mathematically defined as follows: where μ represents the mean value of the target pixels, μ b and σ b represent the average pixel value and standard deviation in the neighboring region, respectively. We set the radius d of the small target neighborhood to 20. In Fig. 1, the visualization of the SCR calculation process is shown. We calculate the SCR values for all small targets. To eliminate outlier values such as too large and too small, all the above SCR values are retained from 20% to 80% after using the data binning algorithm. We record this fraction of SCR values as S. After counting the above two distributions, we crop down all the small targets according to the labeled images. We define this set of small target images as T . Then, We utilize the suburespit module to generate a candidate background set which is defined as B. Next, we randomly select the targets, subfigure from T and B, respectively. The locations of the targets are also randomly selected. Combining the SCR distribution S and scale distribution A, PGDA is designed to dynamically adjust the relative grayscale intensity differences and scale differences between the targets and background image. Finally, we paste the target to the selected position in the background image to achieve spatial reorganization.

B. Subfigure-Split
According to the convention, the size of targets less than 0.15% is defined as small targets in the ISTD field. However, the generic DA algorithms do not take into account the scale distribution properties of the small targets in ISTD. In addition, the contradiction between low signal-to-noise ratio of low-resolution images and small-scale targets hinders the effectiveness of the generic DA algorithms. Compared to the copy-paste [35], the existing data are insufficient to use a supervised model trained with PGDA to generate pseudolabels on unlabeled data. To enhance the diversity of the dataset, subfigure-split is designed to depend solely on the train set rather than introduce additional data and solve the above drawbacks.
Subfigure-split is employed in two ways. The first way is to use the subfigure-cutting algorithm in the remote sensing field to obtain different images by controlling the resolution of subfigures and the overlapping area between subfigures and subfigures. The second way is to perform multiple multiscale random cropping on a single image to obtain multiple images. In Fig. 2, the second way is illustrated. All the images and masks in the training set require a subfigure segmentation algorithm to obtain a large number of background images. The set of background images is defined as B, where B is subdivided into background images with the presence of targets and pure background images. Two parts are represented as B T and B O , respectively. In the subsequent processing, we increase the resolution of all subfigures using interpolation for the later steps of pasting. After selecting a pair of targets and background, we calculate the corresponding area ratio. Next, we compare whether the value is in the scale distribution. If it is or is not in this range, we randomly assign a new area ratio (β) value according to A. We scale this area ratio by changing the object sizeÂ±10%. According to this value, we fix the resolution of the background to change to another resolution.
In order to verify whether the algorithm can preserve the scale distribution, the subfigure-split is employed to generate different synthesis rations of small target images. In Fig. 3, the distribution of the histogram clearly indicates that the scale distribution of the generated images after processing by the augmentation algorithm is consistent with the trend of the distribution of the original dataset. Also, the distribution of the number of targets per image is constant since the small targets are pasted onto the subfigures. To achieve this outcome, two components play a key role. In the subfigure-split module, we give preference to pure background images B O among all the collections of images cropped down. Because random cropping actually enlarges the subregions of the image. If there are already small targets in the background, this leads to a change in the scale distribution of the generated dataset. Second, the area ratio of each target is dynamically allocated to ensure the approximate invariance of the scale distribution before fusing the target and background.
Subfigure-split greatly improves the attention of subregions by multiscale random cropping. Moreover, the decoupling and reorganization process of target and background space enhances the diversity of data and effectively alleviates the problem of overfitting DL methods.

C. Targets Enhancement
The ISTD problem is different from the target detection of common objects. It has two important properties. The first is that target is characterized as Gaussian point-like or has a more pronounced Gaussian edge attenuation. Second, due to the infrared imaging mechanism, the temperature of the target is greater than the temperature of the surrounding environment. That is why ISTD is widely used in areas such as missile warning, camouflage identification, and ultralong range target tracking. However, most of the existing methods for fusing background and infrared small targets ignore considering the characteristics of ISTD. The representations of the generated images are different from those of the real images. In Fig. 4, we list two typical problems in current small target image fusion algorithms.
The first problem is the incorrect Gaussian grayscale representation of the target. Due to the low-resolution characteristics of the infrared target, some of the images do not clearly characterize the full Gaussian characteristics. Also, the cropping algorithm loses a portion of the target's edges, where the Gaussian attenuation is most apparent. For these two reasons, the generated target will have only one grayscale intensity. In Fig. 4, the specific phenomenon is also illustrated.
The second problem is the background migration of the target. Since the infrared imaging mechanism is based on the temperature difference of different targets, the gray intensity of the cropped-down target is determined by the current background. When it is pasted onto another image, the grayscale intensity of the target may be weaker than that of the background. In Fig. 5, we have listed two typical examples. The first one is that the overall intensity of the target is much lower than the background intensity, which results in exactly opposite temperature representations of the target and the background. In the second example, the central gray intensity of the target matches the gray intensity of the background, but the edge attenuation of the Gaussian makes the outermost circle of the small target lower Fig. 2. Overview of the proposed PGDA. We crop all the instances in the training set and selected several small targets randomly. Then, we perform a series of pre-processing operations including Gaussian paste and SCR perception on them. Also, we randomly select the subfigure and place the enhanced targets into the selected positions on it. Finally, targets and background images are fused. than the gray intensity of the background. These cases become very prominent when the background has clouds, buildings, forests, etc.
The target enhancement module is mainly designed for the above two cases and data enhancement processing of the cropped small targets T . The target enhancement consists of three main parts: geometric transformations, nonrigid transformations, and intensity adjustment. Geometric transformations include common horizontal and vertical flips, mirror images, affine transformations, etc. Elastic transform [43], grid distortion [44], and piecewise [45] affine are applied in the part of nonrigid transformations. They are employed to make the shape of small targets more diverse. The intensity adjustment module includes two modules: Gaussian paste and SCR perception. Gaussian paste is designed to get the weight coefficient map after using 2-D Gaussian blur on the mask. The pixels in the original small target image are attenuated according to the weights of the map. We explain why the Gaussian paste method is needed later in this article. SCR perception algorithm is an algorithm that uses prior knowledge to adjust the grayscale intensity of a target. Its specific process is described in Algorithm 1. Before pasting the small target, the SCR value of the position neighborhood is calculated to sense the grayscale difference between the foreground and background to dynamically adjust the target grayscale intensity value. When the SCR value is exactly in the range of S, the background and the target are directly fused. Otherwise, the foreground is enhanced and diminished by randomly assigning random SCR values in random S to solve backward for the difference in the average gray intensity of the foreground according to (2).
Gaussian paste overcomes the problems caused by lowresolution and cropping algorithms, which makes the generated target obey Gaussian distribution forcibly to satisfy the first property of ISTD. SCR perception calculates the intensity difference and stability difference before the target and background fusion. By introducing prior knowledge of the training set and random intensity variations, the small target also has some flexibility while maintaining the SCR distribution of the original dataset. More importantly, it ensures that the gray intensity of the target is always higher than that of the background to satisfy the second property of ISTD. In Fig. 6, we list some representative generated images and visualize the corresponding heatmaps to observe the Gaussian representations and grayscale intensities of small targets.

A. Image Collection and Overview
The constructed MSISDT dataset has 1077 images, which includes a total of 1343 instances. It expands the number of images of the original SIRST by 2.4 times and the number of instances by 2.5 times in the Table I. In addition, it greatly expands the context of single-frame small target detection from being limited to the detection of aerial targets only to the detection of multiscene, multiscale, and lower SCR targets. It includes pure sky background, dense cloud sky background, low signal-to-noise ratio sky background, sea-land interface background, woodland sky background, urban sky background,  urban building background, forest-to-ground background, inland river scene, and other scenes. Several representative figures are shown in Fig. 7. The dataset includes 427 images from the original SIRST dataset [30], 100 single-frame real datasets from the dataset [19], and 650 high-resolution single-frame images with high-quality annotations.
The MSISTD dataset is collected by acquiring a large number of images of infrared sequences, and then screening those that meet the definition of small targets for annotation and processing. For this purpose, we process over 1200 sequences of video data [46], [47], [48]. There is also a big portion of them, obtained by actually filming. In addition, the small targets usually keep a long distance away from the infrared camera, as well as several limitations of infrared imaging. Small targets are characterized by two important properties, extremely low scale, and extremely low SCR. As a result, we can not perceive and distinguish the types of targets by humans, which leaves the MSISTD dataset with only two categories: foreground and background. We provide annotation forms for the semantic segmentation of datasets, and transformation scripts for the annotation of the bounding box and key point. The MSISTD is currently publicly available at https://github.com/Crescent-Ao/MSISTD. Fig. 8. Number of small targets which their area ratio above 0.15% in the SIRST and MSISTD datasets is too small in the distribution. Therefore, the maximum value of the x-axis coordinate of the histogram is limited to 0.15%.

B. Dataset Statistics and Comparison
We perform the corresponding visualization analysis for the scale and SCR characteristics of the dataset in Fig. 8. The average area ratio indicator of the SIRST is 0.048%, while that of the proposed MSISDT is 0.069%. The constructed MSISTD contains more small targets with different shapes to increase the diversity of the dataset, which leads to an increase in the average area ratio. In the two distribution plots, the MSISDT scale distribution trend is in line with the original SIRST distribution trend, while the overall number of small IR targets has increased substantially. In terms of SCR metrics, the mean value of SCR in the MSISTD dataset is 8.94, while the corresponding value in the original dataset is 12.03. This can increase the detection ability of DL methods under extreme conditions. From the visualization of the distribution, the MSISTD has greatly increased the number of small targets with low SCR values while maintaining the original data scale distribution. This also indicates the diversity of scenarios in the dataset, which greatly increases the detection difficulty. We count the distribution of the resolution of the dataset similarly. In the MSISTD dataset, the average length of images is 377 pixels, and the average width of images is 286, while those of the images in the SIRST dataset are 303 and 220, respectively. The higher resolution allows DL methods to extract effective semantic information from datasets.
In addition, the current ISTD datasets have less diversity of scenarios and targets. They mostly focus on two scenes including dense clouds and different sky backgrounds. The pixel intensities in these scenes are smoothly transformed without flash noises. DL methods are easily overfitted on these datasets. In addition, these datasets pay less attention to the targets, which are performed as Gaussian point-like targets. Monotonous backgrounds and salient targets make DL methods easily overfitting. When scenes and target types are switched, DL methods perform worse than model-driven methods. In this article, a multiple scenes infrared small target dataset (MSISTD) is constructed to improve the diversity and quantity of the ISTD dataset. It covers lots of different scenes and backgrounds and contains different types of targets including drones, people, vessels, helicopters, and so on.
In summary, the MSISTD dataset has significantly increased the number of instances while maintaining the original dataset scale distribution. Moreover, the SCR distribution of the dataset tends to be more toward low SCR, because of its background diversity, and greater detection difficulty. The MSISTD dataset properly evaluates the gap between model-driven and datadriven approaches. In addition, it allows for a more accurate assessment of the performance of existing deep learning. More importantly, the overall image resolution of the dataset is substantially improved, which allows it to be widely used in deep learning algorithms.

A. Dataset and Evaluation Metrics
The attempt is toward DA methods for DL methods and model-driven methods, so we evaluate the proposed strategy on two datasets, i.e., (SIRST dataset and MSISTD dataset) using different separate models. Since too few test images may cause low confidence in the model predictions, we adjust the ratio of training samples and testing samples on the SIRST dataset for experiments. All single-frame small target datasets are split into 50% for training and 50% for testing. Note that, since MSISTD includes SIRST, most of the comparison experiments are conducted in the former.
The MSISTD uses ten metrics to evaluate deep learning models. We use intersection over union (IoU), and normalized intersection over union (nIoU) to evaluate shape description ability. In addition, we also use three common metrics including Precision, Recall, and F1-measure to evaluate the overall performance of the model. We utilize the AUC metric instead of the ROC curve in traditional algorithms since most deep learning algorithms tend to have lower threshold points and quantified less variance in the ROC curve. At last, P d and F a are used to evaluate localization ability where T i , P i , T P i , N stand for label, predict result, true positive pixel numbers, and the number of samples where T correct and T All represent the numbers of correctly predicted targets and all targets. Correspondingly, P false and P All represent the numbers of falsely predicted pixels and all image pixels. If the centroid derivation of the target is less than three pixels, the targets are considered as correctly predicted. Otherwise, they are defined as false alarms.

B. Implementation Details and Comparison Methods
The models chosen for the experiments mainly include DL models. Comparative experiments and ablation experiments are conducted on the SIRST and MSISTD datasets. Furthermore, small sample training experiments are also conducted. The former is to verify the effectiveness and robustness of the proposed DA strategy, and the latter is to demonstrate that the PGDA also works in extreme case.
Different public models have different optimizers and learning rate tuning strategies involving too many hyperparameters. We adapt the corresponding settings in these papers. The models chosen for the experiments mainly include data-driven models and model-driven models. The former is to verify the effectiveness of the proposed data PGDA, and the latter is to demonstrate the difficulty of MSISTD dataset detection.

C. Experimental Results
We conduct comparisons on the SIRST and MSISTD datasets, respectively. Our metrics on SIRST are consistent with the original papers. Only two metrics (IoU and nIoU) are selected. On the SIRST in the Table II, all models have corresponding   TABLE II  COMPARATIVE EXPERIMENTAL RESULTS OF FIVE DETECTORS ON THE MSISTD  DATASET improvements in their respective metrics. As for the IoU metric, all models show an average increase of 1.53 and a maximum increase of 2.60 after using PGDA. On the nIoU metric, all models show an average increase of 1.14 and a maximum increase of 2.05 after using PGDA. On the MSISTD dataset, the PGDA strategy increases the IoU metric by an average of 1.68 and the nIoU metric by an average of 1.60 for all models. Moreover, it enables the model IoU metric to improve by 2.97 and the nIoU metric to get the max improvement of 6.21. The visualization is shown in Fig. 9. The results on two datasets successfully confirm the effectiveness and robustness of the PGDA strategy.
Since the MSISTD dataset is twice as large as the SIRST, more evaluation metrics are used to fully evaluate the PGDA. In Table III, we present the results of different models in detail, which indicate the PGDA performance. Due to the forced increase in the number of small targets, the distribution of the training dataset is slightly different from the distribution of the original dataset. Therefore, some of the models are lower in the Prec metric. However, due to the high performance of the strategy in recall boosting, the F1 metric of all models has improved accordingly.
The experimental assertions are likewise verified for both Pd and Fa metrics. The subfigure-split algorithm enhances the attention of the model to the subfigure. Due to the randomness of the target paste position, the model can cope with various cases of pictures. The above two aspects make the model's localization ability improve substantially. We conduct comparison experiments with each of the seven different models. On the existing SIRST, all algorithms obtain performance improvements in both metrics. The PGDA strategy results in improvment performance improvements in most of the individual models when conducted on the MSISTD. Although there is a reduction in some of the metrics, we split IoU/nIoU, Prec/Recall, and Pd/Fa into three groups. All three groups   10. Problem of positive and negative sample imbalance is solved to some extent by FSDNet due to its focal loss. Therefore, the performance of the FSDNet gets significant improvement. We compare some of the current mainstream DA algorithms including oversampling [40], mixup [33], and cut-pastelearn [39]. Due to the specificity of the ISTD problem, Mosaic causes the scale distribution of the target to change, so we do not compare this algorithm. The oversampling method causes the ratio of positive and negative samples to increase, but destroys the ratio of the number of samples per image, even when the original dataset is retained. In the Table IV, the indicator decline phenomenon is also consistent with the performance. Mixup is also quite an effective method for DA algorithm. It does not destroy the original scale distribution, and makes the detection difficulty substantially higher by fusing two images, which improves the generalization of the algorithm to some extent. However, due to the SCR characteristics of the small infrared target, it is extremely difficult to detect. Also, image fusion makes small targets drowned in noise. To summarize, the former destroys the scale and number distribution of small targets, and the latter destroys the SCR distribution of small targets. The two components of PGDA perform excellently in preserving the scale and SCR distribution of the dataset. It is more applicable to the ISTD problem than the generic DA algorithm. We also explore the performance of the strategy for a very low number of images in Fig. 10. In this experiment, we use two models (ACM and FSDNet), and the experimental settings are the same as those in Table III [54] can address the part of the sample imbalance problem. But it also gets a large improvement with four different training set proportions. In summary, the PGDA strategy still recombines context and target to increase the diversity of the extreme small sample data, which enhances the performance of DL methods.

D. Ablation Study
We set up four groups of ablation experiments to demonstrate the effectiveness of the proposed PGDA in Table V.
In order to illustrate the difference between PGDA and copypaste algorithms, we split out the Gaussian Blur module in the target enhancement module. When only the Gaussian Blur module works, the proposed PGDA algorithm degenerates into the copy-paste [40] algorithm. When the SCR perception module dynamically assigns grayscale values, the synthesized image increases the descriptive power of the target by conforming to the physical representation of the target. This makes the copy-paste algorithm improve the IoU indicator by one point. When only the subfigure split module works, the original background images are replaced with multiscale cropped subfigures. The algorithm has achieved a great improvement in the two indicators of IoU and nIoU. The multiscale subgraph not only realizes the spatial reorganization of the target and background but also realizes the DA of random cropping. When the two components act simultaneously, the strategy aligns the infrared mini-target in terms of both SCR and scale, achieving significant improvement in both shape description capability and target localization. Ablation experiments demonstrate that the proposed PGDA strategy makes improvements by optimizing the scale information and physical representation.

VI. CONCLUSION
In this work, a PGDA strategy has been proposed to expand the diversity of training samples indirectly without introducing additional data. The two components of PGDA (target enhancement component and subfigure split module) successfully decouple target description capability and target localization capability by introducing prior knowledge of the training set. The strategy effectively preserves the scale distribution of the dataset and the actual representation of small targets. A new MSISTD dataset is constructed. The number of images and the number of instances in MSISTD is 2.4 times and 2.5 times than those of the existing largest ISTD dataset SIRST, respectively. It consists of 1077 images with 1343 instances. Moreover, it measures the performance gap between current data-driven and model-driven methods, while accurately evaluating DL methods. Ablation experiments, small-sample training, and comparison experiments on the MSISTD dataset and SIRST dataset demonstrate that the proposed PGDA is sufficiently robust and effective for the number of datasets and types of models.
Despite the demonstrated benefits, the existing algorithms can only locate but not identify the target. The strategies to improve the infrared image quality and the algorithms to extract target classification features in ISTD are still to be studied in the future.