FTC-Net: Fusion of Transformer and CNN Features for Infrared Small Target Detection

Single-frame infrared small target detection is still a challenging task due to the complex background and unobvious structural characteristics of small targets. Recently, convolutional neural networks (CNN) began to appear in the field of infrared small target detection and have been widely used for excellent performance. However, existing CNN-based methods mainly focus on local spatial features while ignoring the long-range contextual dependencies between small targets and backgrounds. To capture the global context-aware information, we propose fusion network architecture of transformer and CNN (FTC-Net), which consists of two branches. The CNN-based branch uses a U-Net with skip connections to obtain low-level local details of small targets. The transformer-based branch applies hierarchical self-attention mechanisms to learn long-range contextual dependencies. Specifically, the transformer branch can suppress background interferences and enhance target features. To obtain local and global feature representation, we design a feature fusion module to realize the feature concentration of two branches. We implement ablation and comparative experiments on a publicly accessed SIRST dataset. Experimental results show that the transformer-based branch is effective and suggest the superiority of the proposed FTC-Net compared with other state-of-the-art methods.


I. INTRODUCTION
I NFRARED small target detection is increasingly applied in practical fields, including maritime surveillance [1], infrared warning, infrared guidance, infrared search, and tracking, and has made remarkable achievements. However, due to the lack of discriminative features such as color and texture, small size (less than 80 pixels in a 256 × 256 image [2]), long imaging distance, and low signal-to-noise ratio, infrared small targets are easy to be submerged by the noise in the complex and changeable background and cloud sea waves, making it difficult Manuscript  to detect. As a result, improving the detection rate of infrared small target detection tasks is still an inevitable demand for practical application. To solve the above challenging task, many methods have been proposed, including filtering-based methods [3], [4], [5], vision-based methods [6], [7], [8], [9], and low-rank-based methods [10], [11], [12], [13]. The filtering-based method can suppress the uniform background. Nevertheless, the detection performance of small targets decreases when the background is complex, which means poor robustness. The vision-based method is mainly applicable to the scene where the target brightness is relatively large and different from the surrounding background. The low-rank-based method is time-consuming and has a high false alarm rate for infrared images with dark targets. The above methods rely on prior expert knowledge to extract handcraft features and are sensitive to varied scenarios.
Benefiting from the development of computer vision in many applications, the performance of the infrared small target detection method based on convolutional neural networks (CNN) is gradually improving. Liu et al. [14] proposed an end-to-end network based on multilayer perception to localize small targets. Fan et al. [15] enhanced infrared image contrast by applying a modified convolutional neural network and thus improved detection performance. Wang et al. [16] presented a conditional generative adversarial network that uses one discriminator and two generators to achieve a suitable balance of false alarm and miss detection. Specifically, segmentation-based methods for small target detection have begun to receive attention. Dai et al. [17] proposed an attentional local contrast network to capture long-range contextual interactions and applied a cross-layer fusion module to realize infrared small targets segmentation.
Most current image segmentation-based methods for detecting infrared small targets rely on the design of CNN architecture. However, the CNN-based methods ignore the long-range dependencies in infrared images. More specifically, convolutional networks tend to focus on the local information of an image, thus weakening the importance of the overall connection. Different from the standard CNN-based methods that process images pixel-by-pixel, a vision transformer (ViT) [18] treats an image as a series of patch tokens (i.e., smaller parts of the image consisting of multiple pixels). At each layer of the network, the ViT uses multihead self-attention to process patch tokens based on the relationship between each pair of tokens. As a result, the ViT can build a global representation of an entire image. Existing methods have achieved good performance, but the task of infrared small target segmentation is still worth exploring, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and further enhancements can be made in modeling long-range dependencies.
We made a comparison to illustrate that the long-range dependency is significant for the infrared small target segmentation task. To get the correct segmentation, the network needs to accurately distinguish whether the pixels of the image correspond to the background or the target. Due to the low signal-to-noise ratio of infrared images, background clutter can easily be mistaken for the target. Learning the long-range dependencies of targets and background interference in infrared images is critical, which helps to prevent the network from misclassifying a background pixel as a target pixel and reduces false positives. As shown in Fig. 1, due to the cloud background with the usual fuzzy and fractal structures on the cloud margins, the CNN-based method classifies a part of the background similar to the target (the red circle highlights the region) as the target. In contrast, our method avoids this error and predicts a more efficient result. This false positive situation is avoided because the proposed fusion network architecture of transformer and CNN (FTC-Net) learns the long-range dependencies between the target pixel region and the background.
To this end, we combine the sequence-model transformer with the CNN model to enhance the ability to capture long-range and large-range dependencies. Specifically, in the CNN-based branch, the modified U-Net with skip connections obtains the feature representation of local details to retain as many small targets as possible. The designed transformer branch is flexible and can be scaled to extract high-level global context information. Moreover, a feature fusion module (FFM) is designed to fuse local details with global contextual features. The experimental results show that our proposed method achieves the IoU and nIoU gains of nearly 2% and 4.2% on the SIRST dataset, respectively.
Our contributions can be summarized as follows: r We design the FTC-Net to construct long-range dependencies and fully explore the global context between infrared small targets and background.
r We present an FFM to concatenate features extracted from the CNN branch and the transformer branch, which can obtain global context information while retaining location details as much as possible for detecting small targets.
r Experimental results on the SIRST dataset show the superior performance of the proposed FTC-Net, which is robust to clutter background, various target sizes and shapes.
The organization of this article is as follows: In Section II, we present the related work. In Section III, we illustrate the composition architecture of our FTC-Net. In Section IV, we show the experimental details and final results. In Section V, we draw the conclusion.

A. Single-Frame Infrared Small Target Detection
In the field of computer vision, single-frame infrared small target detection is always an important topic and research hotspot. Typical traditional methods include filtering-based methods [19], vision-based methods [20], [21], [22], [23] and low-rankbased methods [24]. The filtering-based detection method highlights small targets by differencing the original image from the filtered background image. Alternatively, the frequency difference between the target, background, and clutter is used to filter out the background from the clutter by designing the corresponding filter in the frequency domain. The vison-based approach mainly utilizes the saliency map. According to the visual perceptual properties of the human eye, the presence of small targets leads to significant changes in local texture rather than global texture. As a result, the local difference or variation-based algorithms excel in different small target detection tasks. The low-rank basis has excellent stability performance for ordinary infrared images. However, for infrared images with dark targets, some strong clutter signals may be as sparse as the target signals, leading to higher false alarm rates.
Recently, CNN-based methods perform better than traditional methods. Liu et al. [14] proposed an end-to-end network based on multilayer perception to localize small targets. Wang et al. [25] proposed a feature extraction network with an attention mechanism while incorporating the YOLO [26] algorithm for target detection. Specifically, segmentation-based small target detection methods have begun to receive attention. To achieve a balance between miss detection and false alarm, Wang et al. [16] proposed a novel generative adversarial framework consisting of one discriminator and two generators. The segmentation results were obtained by computing the average outputs of two generators. Dai et al. [18] designed a segmentation-based infrared small target detection network, which proposed an asymmetric contextual modulation (ACM) framework for information exchange between high-dimensional features and low-dimensional features. In addition, Dai et al. [17] designed an attentional local contrast network, using a local attention module and a cross-layer fusion procedure to preserve local spatial features and enhance the segmentation performance of small targets.
Although the detection capacity of CNN-based approaches is gradually improving, the problem of effectively capturing longrange dependencies in infrared small target images remains a challenge.

B. ViT Transformers
Our work is motivated by the vision transformer (ViT) [18], which has achieved impressive performance in many tasks in the field of computer vision [27], [28], [29], [30], [31] and has been applied to target detection. In segmentation transformer [32], they integrated a transformer framework on the base of a fully convolutional network, and designed effective decoders to improve segmentation performance. The feature pyramid network in SOTR [33] can effectively distinguish low-level feature information, and the twin transformer can capture the association between remote contexts. In recent years, the framework combining CNN and transformer has been used in medical images. In two-dimensional (2D) medical image segmentation task [42], transformer and CNN are combined to form an enhanced encoder. In transfuse [43], the authors discovered the complementarity of transformer and CNN in image segmentation tasks. Recently, it is worth mentioning that an efficient hierarchical ViT architecture, called Swin transformer [34], whose representation was computed with shifted windows, achieved good performance on many vision tasks. In infrared small target detection, learning the long-range contextual dependencies of targets and background is critical [44], and transformer-based methods remain to be studied. While convolutional neural networks are still the main framework for all kinds of vision tasks, we must acknowledge that the potential of transformer-based architectures in this area cannot be ignored as well. In this work, we attempt to combine a hierarchical transformer and CNN model to detect infrared small targets.

III. METHOD
In this section, we introduce our FTC-Net in four parts. The model architecture of our method is shown in Fig. 2. We design an infrared small target detection network with two branches. Specifically, one branch uses a powerful hierarchical transformer to capture large-range dependencies, and the other branch uses a variant of U-Net to extract local details. Finally, a feature fusion module fuses the features extracted from the hierarchical transformer branch and the modified U-Net branch.

A. Architecture Overview
As shown in Fig. 2, for the transformer branch, we use a hierarchical construction with 4×, 8×, 16×, and 32× downsampling operations to obtain the image feature map with different sizes. To transform the inputs into sequence embedding's, the infrared input images are first to split into nonoverlapping patches with the size of 4 × 4, which is treated as a "token." For the CNN branch, we apply the U-Net with skip connections. Considering the contextual gap that exists from the encoder to the decoder, the feature representation with less local details may influence the final segmentation results. The skip connections in the U-Net help to recover fine-grained details in the segmentation task.
To better fuse the output features of the two branches and reduce the semantic gap, we design an FFM after the CNN branch and the transformer branch in the FTC-Net framework.

B. Swin Transformer Block
Multihead self-attention (MSA) is used to obtain multiple information from multiple inputs. Each attention module focuses on a different aspect of the feature map and finally gets a combined result to obtain relevant information on different subspaces. The Swin transformer block replaced the traditional MSA module with shifted windows. Fig. 3 shows the specific structure of the Swin transformer block, which consists of a LayerNorm (LN) layer, a shifted window-based multihead self-attention module (SW-MSA), a residual connection, and a multilayer perception (MLP) with the nonlinearity GELU function. The MLP Head contains two fully connected layers. The idea of layer normalization is similar to batch normalization. Compared with batch normalization that takes mini-batch size samples per neuron, LN normalizes the input of all neurons at a certain layer in the deep network. Layer normalization calculates the mean μ l and standard deviation σ l of all neurons separately for each sample as follows: where l, a l i , H represents the l th hidden layer of the forward neural network, the input vector (the weighted vector of the  output of the front layer network), and the number of hidden units, respectively. In the two successive transformer Blocks, the regular window-based multihead self-attention (WMSA) module and the SWMSA module appear alternately. WMSA divides the input image into nonoverlapping windows, and then performs self-attention calculations within different windows. Unlike a regular window, the shifted window is offset and remerged on the image. The advantage of this design is to ensure the window information interaction and reduce the computational complexity. To strengthen connections across windows, the partitioning mechanism for the regular window and shifted window alternate in the hierarchical Swin transformer block. The calculation formula is as follows: whereẑ l and z l represent the outputs of the WMSA module and the multilayer perception module of the l th block, respectively. In addition, we add relative position bias for each head during the calculation of similarity: where Q, K, V ∈ R M 2 ×d denote the query, key and value matrices; d is the dimension of the query or key, and M 2 is the number of patches in each window, respectively. We calculate the offset between the absolute position of each pixel and other positions, obtain the relative position index, and find the bias matrixB that can be learned according to the index value. Bias matrix B is added directly to the attention matrix, and the corresponding values are derived from the bias matrixB ∈ R (2M −1)×(2M +1) .

C. Feature Fusion Module
To address the feature and semantic inconsistencies between the hierarchical transformer and CNN decoder outputs, we use an FFM to eliminate discrepancies.
As shown in Fig. 4, the FFM has two inputs, which are the output T ∈ R C×H×W of the hierarchical transformer and the output feature map D ∈ R C×H×W of the variant U-Net. For an input feature map X ∈ X C×H×W , spatial squeeze generates a vector G(X) ∈ G(X) C×H×W through the global average pooling layer. Each element in G is calculated as follows: This formula adds up all the elements in X and then divides them by a total dimension. Its actual meaning is to compress all the spatial information on the k th channel into a value, and G(X) is a tensor obtained after compressing spatial information one by one on all channels. We combine the obtained tensor with the spatial information to calculate the attention mask: where L 1 ∈ R C×C and L 2 ࢠR C×C are the weights of two linear layers. Then the activation value σ(M ) is obtained through the sigmoid layer so that it is between [0, 1]. The activation value allows us to distinguish the importance of channels, which is then multiplied with input T to obtain a feature map that has been calibrated with information:

D. Loss Function
Two loss functions are applied in the proposed model, including the binary cross-entropy (BCE) loss and the Dice loss.
The segmentation-based infrared target detection method can be regarded as a binary classification task (target versus background). As a result, the BCE loss function is applied and where N is the sample number. Dice loss is a common evaluation metric in medical image segmentation tasks [45], which is similar to an infrared small target detection application. It comes from a similarity function in statistics, which is proposed to measure the similarity of two samples and gives a value of 0 or 1. The formula of Dice loss is as follows: where X and Y represent the predicted pixels and the ground truth pixels, respectively. The overall optimization objective of proposed model is: where α and β are the balance coefficients.

A. Dataset
We evaluate our method using the SIRST dataset proposed by Dai et al. [41]. This open dataset is constructed with high-quality images and labels. It contains 427 single-frame infrared images, including short, medium, and 950 nm wavelengths, which are roughly divided into 70% for training, 10% for validation, and 20% for testing. In this dataset, the targets mainly appear in complex backgrounds such as sky, ocean, and city. It is easy to find that these small IR targets are relatively faint pixels, and some are distributed in the background without distinguishable characteristics. Considering this challenging detection task, we perform data enhancement by filtering, cropping, and horizontal and vertical flip methods to increase sample size and avoid overfitting in the training process.

B. Evaluation Metrics
Two evaluation metrics are explored for testing infrared small target detection performance. On the one hand, we use pixellevel evaluation metrics such as IoU and nIoU, which mainly focus on the shape of target segmentation. On the other hand, we use the target-level evaluation metric probability of detection (Pd) to assess the localization ability of the proposed model for infrared small targets. Generally, infrared small targets lack shape and texture information, a mispredicted pixel will result in a Pd decrease of 11.1% when evaluating a 3 × 3 small target.
1) IoU: Intersection over Union (IoU) is one of the most widely used metrics in the field of segmentation. Specifically, IoU represents the area of overlap between the labeled and predicted pixels divided by the union area between the predicted and labeled pixels.
where A inter and A All represent the interaction areas and union areas, respectively.
2) nIoU: The average value of IoU of all targets in the images at a certain judgment threshold is used as the nIoU metric. nIoU is defined as follows: where N, TP, T, and P denote the number of total samples, true positive, true, and positive samples, respectively.

3) Probability of Detection:
Pd is a widely used target-level metric. It represents the probability of successful detection of the target.
where P correct and P All are the number of correctly detected targets and all targets, respectively.

4) ROC:
The horizontal and vertical axis of the receiver operation characteristics curve (ROC) are false positive rate (FPR) and true positive rate (TPR), respectively. FPR represents the probability of being misclassified as positive among all actually negative samples. TPR represents the probability of being classified as positive among all positive samples. ROC is a measure under a sliding threshold, which can effectively reflect the overall target detection performance.

C. Implementation Details
The proposed FTC-Net consists of a CNN-based segmentation branch and a hierarchical transformer branch. In the transformer branch, the number of down-sampling layers is set to 4. In the four Swin transformer blocks, the channel C ,C 1 is set to 128, 64. The number of MSA head is set to 4, 8, 16, and 32, respectively. As shown in Fig. 2, the number of the Swin transformer block is set to 2, 2, 18, and 2, respectively. The input image datasets with the different resolutions are first resized into a fixed size of 384 × 384. All models are implemented using PyTorch library on a computer with four Nvidia GeForce 1080Ti GPUS and are trained from scratch. We set the batch size to 8 and use the Adam optimizer with an initial learning rate of 0.001 to train the designed model.
1) Quantitative Results: We used the same dataset, publicly released code and settings in our experiments to ensure the authenticity of the comparison. Experimental results can be observed in Table I.
The prediction results using the above approaches are evaluated with IoU, nIoU, Pd, and ROC. Compared with traditional methods, the CNN-based detection framework has a significant improvement in all evaluation metrics. This is because the SIRST dataset has clutter in the background and contains challenging images with different target shapes and target sizes. The CNN-based methods are robust to changing backgrounds. In contrast, the local contrast and rank-based methods are generally scene-specific and can only suppress uniform backgrounds to a certain extent. In addition, traditional methods mainly focus on overall target positioning rather than fined shape matching, which gains relatively poor performance.
Compared with the CNN-based methods, our FTC-Net has achieved obvious improvement in detection performance. The designed network combines the advantages of both CNN and transformer, while taking into account the deep and shallow layers of the network as well as long-range dependencies. Therefore, the network can ensure accurate localization and precise segmentation of small infrared targets. As shown in Table I, compared with the state-of-the-art ALC method, our proposed FTC-Net has consistent improvements with IoU, nIoU, and Pd gains of approximately 2.02%, 4.22%, and 2.48%, respectively.
We further evaluate our FTC-Net and other existing methods using the ROC metric. The area under the ROC curve (AUC) is a measure of the performance of the classification model, and a larger value means better performance. The results are shown in Fig. 5 and Table I. It can be seen that the proposed FTC-Net has the best results, indicating the effectiveness of our network. The ROC of conventional methods is under the CNN-based methods, which means relatively poor performance. Compared to the ALCNet method with the attentional local contrast module, the presented FTC-Net has a larger area under the ROC curve with a value of 0.9658. The results further demonstrate the effectiveness and robustness of our method in infrared small target detection.
2) Qualitative Results: To visually understand the detection performance, qualitative results of five representative methods on the SIRST dataset are shown in Figs. 6 and 7. As can be seen, the ALC model and our FTC-Net are significantly superior to the traditional methods, which obtain more accurate segmentation results. As shown in Fig. 6(c) and 6(d), the traditional method performs well on images where infrared targets are clearly distinguished from the background. However, it is easy to miss the detection when the targets are similar to the background. In addition, there are often a large number of false alarm areas, as shown in Fig. 6(a) and 6(b). It can be explained from two aspects. First, traditional models tend to focus on the differences between the small target and the background environment, which is not suitable for detecting targets in dark environments. Second, some strong clutter signals may be as sparse as target signals, resulting in a higher false alarm rate. From Figs. 6 and 7, we can conclude that the CNN-based methods perform better than the conventional methods in terms of detection accuracy and are less prone to higher false alarm cases. Moreover, compared with ALCNet, our FTC-Net did not miss the target in Fig. 6(c) and did not generate false detections in Fig. 6(a). For detection accuracy, our FTC-Net has a more accurate segmentation of the target contour. These qualitative results illustrated that the designed dual-path network is well adapted to the challenges of various complex backgrounds, target shapes, and sizes, thus showing better segmentation performance.

E. Ablation Study
To explore whether each component is helpful to model performance, we conduct ablation studies on the SIRST dataset. Specifically, we analyze the detection results by removing every single module.
As shown in Table II, "CNN," "Transformer," and "FTC-Net" denote the CNN branch, transformer branch, and proposed method with feature fusion module, respectively. Compared with the CNN branch, the detection performance is better with the addition of the transformer branch, and the IoU value increases from 74.09% to 76.69%. After adding FFM, the value of IoU is increased by 1.03%. Compared to the baseline, our proposed FTC-Net improves the values of IoU, nIoU, and Pd by 3.63%, 1.36%, and 2.72%, respectively. The results of the ablation study illustrate the effectiveness of feature fusion of two branches. As for runtime, the proposed FTC-Net takes about 0.313s to test on a 384 × 384 image, which is slightly higher than a single Transformer branch or CNN branch.
The excellent detection capability can be attributed to two perspectives. On the one hand, the transformer with a selfattentive mechanism helps to capture long-range dependencies and thus achieves more accurate segmentation. On the other hand, our results reveal that the FFM can fuse local details and global semantic features, eliminating ambiguity generated with decoder features.

F. Visualization of Feature Maps
To better understand and illustrate the effectiveness of FTC-Net, a visualization of feature maps is presented. Grad-CAM [38] intuitively displays the feature maps learned by the network in the form of a heatmap. Grad-CAM can help us analyze the focus area of the network for a certain category. It performs reverse propagation by selecting the node with the largest softmax value and using the average value of the gradient as the weight. The weights of all the corresponding categories of the feature maps are obtained and then the weighted sum is made.
As shown in Fig. 8, we apply the Grad-CAM method to visualize feature maps of our FTC-Net and CNN-based network. To make a better comparison, we choose the deep feature graph of the network to output heatmaps. It can be seen that the CNNbased network is easy to be disturbed by the background with a similar size and brightness to the target. As shown in Fig. 8(a) and (d), the CNN-based network generates false attention in the background with strong noise and multiclutter. However, the feature map from the FTC-Net is more sensitive to the target, which generates accurate shape segmentation.

G. Error Diagnosis
In this section, we analyze some inaccurate detection results on the SIRST testing dataset. The results in Table I show that the detection probability of our proposed FTC-Net is already quite high and achieves good performance. However, there are also false positives and false negatives, as shown in Fig. 9, the segmentation errors mostly come from some incorrectly predicted pixels which distribute around the target boundary.
False detections occur for two reasons. First, the infrared images in the SIRST dataset are used to capture airborne moving targets far away. As a result, the target edges are blurred, and the difference between the targets and background is small in the image. The segmentation results are prone to generate errors at the image boundary. The second reason that affects detection results is that the ground truth is manually labeled. There are visual biases and ambiguous pixels for the actual images, which influence the training process of the proposed model. Besides, for the small target with around 3 × 3 size, each pixel error will have a great impact on the final detection results.

V. CONCLUSION
Precise shape segmentation is the key point of infrared small target detection. In this work, we propose an infrared small target detection network named FTC-Net. Different from existing target detection methods based on CNN, the proposed FTC-Net contains a hierarchical transformer branch to capture long-range contextual dependencies between the targets and background. In addition, to address the feature inconsistency between the transformer-based and CNN-based branch outputs, we design a FFM that can well concatenate long-range contextual information and local edge details. We conduct ablation experiments to illustrate the effectiveness of the transformer branch and feature fusion module. Moreover, qualitative and quantitative results on the SIRST dataset show that the proposed approach achieves high-quality predictions with favorable detection performance and strong generalization ability.