Local Motion and Contrast Priors Driven Deep Network for Infrared Small Target Super-Resolution

Infrared small target super-resolution (SR) aims to recover reliable and detailed high-resolution image with high-contrast targets from its low-resolution counterparts. Since the infrared small target lacks color and fine structure information, it is significant to exploit the supplementary information among sequence images to enhance the target. In this paper, we propose the first infrared small target SR method named local motion and contrast prior driven deep network (MoCoPnet) to integrate the domain knowledge of infrared small target into deep network, which can mitigate the intrinsic feature scarcity of infrared small targets. Specifically, motivated by the local motion prior in the spatio-temporal dimension, we propose a local spatio-temporal attention module to perform implicit frame alignment and incorporate the local spatio-temporal information to enhance the local features (especially for small targets). Motivated by the local contrast prior in the spatial dimension, we propose a central difference residual group to incorporate the central difference convolution into the feature extraction backbone, which can achieve center-oriented gradient-aware feature extraction to further improve the target contrast. Extensive experiments have demonstrated that our method can recover accurate spatial dependency and improve the target contrast. Comparative results show that MoCoPnet can outperform the state-of-the-art video SR and single image SR methods in terms of both SR performance and target enhancement. Based on the SR results, we further investigate the influence of SR on infrared small target detection and the experimental results demonstrate that MoCoPnet promotes the detection performance. The code is available at https://github.com/XinyiYing/MoCoPnet.


I. INTRODUCTION
I NFRARED imaging system is all-weather in day and night and has high penetrability, sensitivity and concealment. Infrared imaging system is widely used in security monitoring, remote sensing investigation, aerospace offense-defense and other military mission. Recently, low-resolution (LR) infrared images cannot meet the high requirements of practical military mission. Therefore, it is necessary to improve the resolution of infrared images. A straightforward way to obtain high-resolution (HR) infrared images is to increase the size of infrared sensor arrays. However, due to the technical limitations of sensors and the high cost of large infrared sensor arrays, it is necessary and important to develop practical, lowcost and highly reliable infrared image super-resolution (SR) algorithms. Note that, modern autonomous driving technology requires the infrared imaging system to detect the target in a fairly long distance. Therefore, the target only occupies a very small proportion of the whole image, and is susceptible to noise and clutters. In this paper, we mainly focus on infrared small target SR task and investigate its influence on infrared small target detection.
The special imaging mechanism and military application of infrared imaging system put forward the following requirements for infrared small target SR: 1) High fidelity of superresolved images. Noise and false contours should be avoided as much as possible. 2) High contrast of super-resolved targets. The target contrast in the super-resolved images should be strengthened to boost the subsequent tasks. 3) High robustness to complex scenes and noise. Small objects are sometimes submerged in clutter and thus of low local contrast to the background. SR algorithms should be robust to various complex scenes and imaging noise. 4) High generalization to insufficient datasets. The lack of infrared image datasets requires that SR algorithms should achieve stable results with a relative small dataset.
The motivations of our method come from data analysis, and can be summarized as: 1) The target occupies a small proportion of the whole infrared image (generally less than 0.12% [1]) and lacks color and fine structure information (e.g., contour, shape and texture). Few information is available for SR within a single image. Therefore, we perform SR on image sequences to use the supplementary information among the temporal dimension to improve the SR performance and the target contrast. 2) Due to the long distance between the target and the imaging system, the mobility of the targets on the imaging plane is limited, leading to small motion of the target between neighborhood frames (i.e., local motion prior [2], [3] in spatio-temporal dimension). Therefore, we design a local spatio-temporal attention (LSTA) module to perform implicit frame alignment and exploit the supplementary information in the local spatio-temporal neighborhood to enhance the local features (especially for small targets). 3) Compared with the background clutter, the contrast and gradient between the target and the background in the local neighborhood are high in all directions (i.e., local contrast prior [4], [5] in spatial dimension). Therefore, we design a center difference residual group (CD-RG) to achieve center-oriented gradient-aware feature extraction, which can encode the local contrast prior to further improve the target contrast.
Based on the above observations, we propose a local mo-tion and contrast prior driven deep network (MoCoPnet) for infrared small target SR. The main contributions can be summarized as follows: 1) We propose the first infrared small target SR method named local motion and contrast prior driven deep network (MoCoPnet) and summarize the definition and requirements of this task. The proposed modules (i.e., central difference residual group and local spatio-temporal attention module) of MoCoPnet integrate the domain knowledge (i.e., local contrast prior and local motion prior) of infrared small targets into deep networks, which can mitigate the intrinsic feature scarcity of data-driven approaches [5].
2) The experimental results demonstrate that MoCoPnet can achieve state-of-the-art SR performance and effectively improve the target contrast. 3) Based on the SR results, we further investigate the influence of SR on infrared small target detection.
The experimental results show that MoCoPnet can promote the detection performance to achieve high signal-to-noise ratio gain (SNRG), signal-to-clutter ratio gain (SCRG), contrast gain (CG) scores and improved receiver operating characteristic curve (ROC) results.

II. RELATED WORK A. Single Image SR
Image SR is an inherently ill-posed optimization problem and has been investigated for decades. In literature, researchers have proposed a variety of classic single image SR (SISR) methods, including prediction-based methods [6], [7], edgebased methods [8], [9], statistics-based methods [10], [11], patch-based methods [9], [12] and sparse representation methods [13], [14]. However, most of the aforementioned traditional methods use handicraft features to reconstruct HR images, which cannot formulate the complex SR process and thus limits the SR performance. Recently, due to the powerful feature representation capability, convolutional neural networks (CNNs) have been widely used in single image SR task and achieve the state-of-the-art performance [15], [16]. Dong et al. [17] proposed the pioneering CNN-based work SRCNN to recover an HR image from its LR counterpart. Kim et al. [18] deepened the network to 20 convolutional layers (i.e., VDSR) and achieved improved SR performance by increasing model complexity. Moreover, various increasingly deep and complex architectures (e.g., residual networks [19], recursive networks [20]- [23], densely connected networks [24]- [26], attentionbased networks [15], [27]) have also been applied to SISR for performance improvement. Other than tackling image average distortion by norm loss, generative adversarial image SR networks [28], [29] employed the perceptual loss for perceptual quality improvement.

B. Video SR
Existing video SR methods commonly follow a three-step pipeline, including feature extraction, motion compensation and reconstruction [30]. Traditional video SR methods [31], [32] employ handcrafted models to estimate motion, noise and blur kernel and reconstruct HR video sequences. Recent deep learning-based video SR methods are better in exploiting spatio-temporal information by its powerful feature representation capability and can achieve the state-of-the-art performance. Liao et al. [33] proposed the pioneering CNNbased video SR method to perform motion compensation by optical flow and then ensembled the compensated drafts via CNN. Afterwards, A series of optical flow-based video SR algorithms [34], [35] emerged to explicitly perform motion estimation and frame alignment, resulting in vague and duplication [36]. To avoid the aforementioned problem, deformable convolution [37], [38] has been employed to perform motion compensation explicitly in a unified step [39], [40] through extra offsets. Apart from these explicit motion compensation methods, implicit approaches (e.g., 3D convolution networks [41], [42], recursive networks [43], [44], non-local networks [40], [45]) have also been applied to video SR for performance improvement.

C. Infrared Image SR
With the increased demands of high-resolution infrared images, some researchers perform image SR on infrared images. Traditional methods [46] consider SR as sparse signal reconstruction in compressive sensing. Based on the previous studies, Zhang et al. [47] combined compressive sensing and deep learning to achieve improved SR performance with low computational cost. Han et al. [48] proposed to employ CNNs to recover high-frequency components with upscaled LR images to generate the SR results. He et al. [49] proposed a cascaded deep network with multiple receptive fields for large scale factor (×8) infrared image SR. Liu et al. [50] proposed to use generative adversarial network and perceptual loss to reconstruct the texture details of infrared images. Chen et al. [51] employed an iterative error reconstruction mechanism to perform SR in a coarse-to-fine manner. Huang et al. [52] proposed a progressive super-resolution generative adversarial network and employed the multistage transfer learning strategy to improve the SR performance from small samples. Prajapati et al. [53] proposed channel splitting-based convolutional neural network to eliminate the redundant features for efficient inference. Yang et al. [54] proposed a visible-assisted training strategy to promote details preservation.

D. Attention Mechanism
Since the importance of each spatial location and channel is not uniform, Hu et al. [55] proposed SeNet for classification, which consists of selection units to control the switch of passed data. Zhang et al. [15] proposed a channel attention mechanism to calculate the importance along the channel dimension for channel selection. Anwar et al. [56] proposed feature attention to urge the network to pay more attention to the high frequency region. Dai et al. [27] proposed second-order attention to adaptively readjust features for powerful feature correlation learning. Wang et al. [57] explored the sparsity in SR task and proposed sparse masks for efficient inference. The spatial mask and channel mask calculate the importance along both the spatial dimension and the channel dimension to prune the redundant computations. The aforementioned studies only consider the global importance on spatial and channel dimension. Since small targets only occupy a small portion in the whole image and have high contrast with the local neighborhood, we design a local attention mechanism which can better characterize the small targets.

E. Sequence Image Infrared Small Target Detection
Sequence image infrared small target detection is significant for long-range precision strikes, aerospace offensive-defensive countermeasures and remote sensing intelligence reconnaissance. According to whether the sequential information is used, sequence image infrared small target detection methods can be divided into two categories: detect before track (DBT) methods and track before detect (TBD) methods. Based on the results of single image infrared small target detection [5], [58]- [61], DBT methods employed the motion trajectory of targets through sequence image projection to eliminate the false targets and reduce the false alarm rate. DBT methods have low computational cost and are easy to implement. However, the performance drops rapidly with low SNR. TBD methods [62]- [64] commonly follow a three-step pipeline, including background suppression, region of interest extraction and target detection. TBD methods are robust to images with low SNR but have high computational cost, which cannot meet the requirements of real-time detection. It is challenging to achieve high detection rate and low false alarm rate in real-time due to the lack of target information, the complex background noise, the insufficient public datasets and the explosion of data amount and the computational cost. Therefore, it is necessary to recover reliable image details and enhance the contrast between target and background for detection.

III. METHODOLOGY
In this section, we introduce our method in details. Specifically, Section III-A introduces the overall framework of our network. Section III-B-III-C introduce the two modules which integrate local contrast prior and local motion prior of infrared small target into deep networks.

A. Overall Framework
The overall framework of our MoCoPnet is shown in Fig. 1. Specifically, an image sequence with 5 frames LR t+i (i = [−2, 2]) is first sent to a convolutional layer to generate the initial features F t+i 0 (i = [−2, 2]), which are then sent to the central difference residual group (CD-RG) to achieve centeroriented gradient-aware feature extraction. Then, each neighborhood feature F t+i CD−RG (i = −2, −1, 1, 2) is paired with the reference feature F t CD−RG and sent to two local spatiotemporal attention (LSTA) modules to achieve motion compensation and enhance the local features. Next, the reference feature F t CD−RG is concated with two compensated neighborhood frames F t+k LST A2 , F t−k LST A2 (k = 1, 2) and then sent to a residual group (RG) and a convolution layer for coarse fusion. Afterwards, the two fused features are concatenated and sent to an RG and a convolution for fine fusion. Then, the fused feature is processed by an RG, a sub-pixel layer and a convolutional layer for SR reconstruction and upsampling.
Finally, the SR reference frame is obtained by adding the bicubicly upsampled LR reference frame to accelerate the training convergence. Note that, the number of the input frames is set to 7 in this paper and the process is the same as in Fig. 1(a). We use the mean square error (MSE) between the SR reference frame and the groundtruth reference frame as the loss function of our network.

B. Central Difference Residual Group
Central difference residual group (CD-RG) incorporates central difference convolution (CD-Conv [65], [66]) into residual group (RG [15], [26]) to achieve the center-oriented gradient-aware feature extraction, which can utilize the spatial local salient prior to strengthen the contrast of the small targets. Note that, we employ RG as the backbone of our MoCoPnet for the following reasons: RG can generate features with large receptive field and dense sampling rate, which promotes the information exploitation. The reuse of hierarchical features not only improves the SR performance [67] but also maintains the information of small targets [1], [61], [68].
The architecture of central difference residual group (CD-RG) is shown in Fig. 1(b). The input feature F t+i 0 is first fed to D central difference residual dense blocks [69] (CD-RDB) to extract hierarchical features. Then, the hierarchical features are concatenated and fed to a 1×1 convolutional layer to generate output feature F t+i CD−RG . As is shown in Fig. 1(b1), 1 CD-Conv and K − 1 Convs with a growth rate of G are used within each CD-RDB to achieve dense feature representation. The architecture of CD-Conv is shown in Fig. 1(b2). CD-Conv aggregates the center-oriented gradient information, which echoes the spatial local saliency prior of infrared small target. As shown in Fig. 2, different from handcrafted dilated local contrast measure (DLCM [5]) which can only reserve the contrast information in one direction, CD-Conv is a learnable measure and can improve the contrast of small target while maintaining the background information. In conclusion, CD-Conv is more in line with the task of infrared small target SR (i.e., recovering reliable and detailed high-resolution image with high-contrast target). DLCM and CD-Conv can be formulated as f (x, y) and g(x, y): where S x,y represents the value of a specific location (x, y) in the feature map, and (i, is the direction index. ω i,j is a learnable weight to continuously optimize the local contrast measure and θ ∈ [0, 1] is a hyperparameter to balance the contribution between gradient-level detailed information and intensity-level semantic information. Note that, θ is set to 0.7 [65] in our paper.

C. Local Spatio-Temporal Attention Module
Local spatio-temporal attention (LSTA) module calculates the local response between the neighborhood frame and the

Concatation Summation
Dot production   reference frame and uses the local spatio-temporal information to enhance the local features of the reference frames. The inputs of LSTA are the reference frame and one neighborhood frame. For a sequence with 7 frames, the operation need to be repeated 6 times. The architecture of LSTA is shown in Fig. 1(c). The red reference feature F t CD−RG ∈ R H×W ×C and the blue neighborhood feature F t−1 CD−RG ∈ R H×W ×C are first fed to 1×1 convolutional layers conv q and conv k for (a) reference frame dimension compression to generate F 0 , F 1 ∈ R H×W ×C/cr , where cr is the compression ratio and is set to 8 in our paper. The process can be formulated as:

Shared weights Grouping
where H conv k and H conv q represent 1×1 convolutions. Then, we calculate the response between each location p 0 in F 0 and the corresponding neighborhood (centered in p 0 ) in F 1 . Afterwards, the response is summed and softmax along the channel dimension to generate the attention map M . The process is defined as: where p n represents the n th value of the local neighborhood centered in p 0 with kernel size of kern and dilation rate of dila. The purple 3×3 grid in Fig. 1(c) is the local attention feature map with parameter (kern=3, dila=1). Note that, as shown in Figs 3(c) and (d), dila can be integer larger than 1 to enlarge the receptive filed without additional computational cost. As shown in Figs 3(e) and (f), dila can also be fractional to capture the sub-pixel motion between frames and we employ bilinear interpolation to generate the exact corresponding values.
Finally, dot production is performed between the local neighborhood feature F t−1 CD−RG (p n ) centered in p 0 and the corresponding attention map M (p n ) to generate the value of location p 0 in the output feature F t−1 LST A (p 0 ). The process is formulated as: LSTA first calculates the response between the reference frame and its adjacent frames to generate the attention map, and then calculates a weighted summation of these frames using the generated attention maps. In this way, the neighborhood frames can be implicitly aligned and the complementary temporal information can be incorporated to enhance the features of small targets.

IV. EXPERIMENTS
In this section, we first introduce the experiment settings, and then conduct ablation studies to validate our method. Next, we compare our network to several state-of-the-art SISR and video SR methods. Finally, we investigate the influence of SR on infrared small target detection.

A. Experiment Settings
In this subsection, we sequentially introduce the datasets, the evaluation metrics, the network parameters and the training details. (Anti-UAV [72]) releases 250 high-quality infrared video sequences with multi-scale UAV targets. In this paper, we employ the 1 st − 50 th sequences with target annotations of SAITD as the test datasets and the remaining 300 sequences as the training datasets. In addition, we employ Hui and Anti-UAV as the test dataset to test the robustness of our MoCoPnet to real scenes. In Anti-UAV dataset, only the sequences with infrared small target [1] (21 sequences in total) are selected as the test set. Note that, we only use the first 100 images of each sequence for test to balance computational/time cost and generalization performance.
2) Evaluation Metrics: We employ peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the SR performance. In addition, we introduce signal-to-noise ratio (SNR) and contrast ratio (CR) in the local background neighborhood [58] of targets to evaluate the performance of recovering small targets. As shown in Fig. 4(a), the size of the target area is a × b, and the local background neighborhood is extended from the target area by d both in width and height. Note that, the parameters of local background neighborhood (a, b, d) in HR images are set to (7,7,30), (11,11,50), (21,21,100) in SAITD 1 , Hui and Anti-UAV 2 respectively. When 4× SR is performed on HR images, the parameters (a, b, d) are set to (29,29,120), (45,45,200), (85, 85, 400). When 4× downsampling is performed on HR images, the parameters are set to (3, 3, 10), (3,3,10), (5,5,20).
To further evaluate the impact of SR algorithms on infrared small target detection, we adopt SNR gain (SNRG), background suppression factor (BSF), signal-to-clutter ratio gain (SCRG), contrast gain (CG) and receiver operating characteristic curve (ROC) for comprehensive evaluation. Note that, the common detection evaluation metrics calculate the ratio of the statistics in the local background neighborhood before and after detection. Since we first super-resolve the LR image and then perform detection, the inputs of detection algorithms, which are the outputs of different SR algorithms, are different. Therefore, direct using the common detection evaluation metrics cannot evaluate the impact of SR on detection accurately. To eliminate the influence of different inputs, we modify the first four metrics to calculate the ratio of the statistics in the local background neighborhood between the LR image before SR and the HR target image after detection. The modified evaluation metrics are shown in Fig. 4(b). We then introduce the aforementioned evaluation metrics in details. SNRG is used to measure the SNR improvement of detection algorithms and is formulated as: where [·] in and [·] out represent the metrics in the local background neighborhood of the LR images and the HR target images respectively. P t and P b are the maximum value of the target area and the background area respectively. BSF is used to measure the background suppression effect and is formulated as: where σ b is the standard deviation of the background area. SCRG is used to measure the SCR improvement of detection algorithms and is formulated as: where µ t and µ b are the mean value of the target area and the background area respectively. CG is used to measure the improvement of contrast between targets and background and is formulated as: Note that, in order to avoid the value of "Inf" (i.e., the denominator is zero) and "NAN" (i.e., the numerator and denominator are both zero), we add to each denominator in equations 6-9 to prevent it from being zero. is set to 1e − 10 in our paper. ROC is used to measure the trend between detection probability P d and false alarm probability F a , which are formulated as: where TD and FD are the number of true detection and false detection. AT and NP are the amount of targets and the num-   ber of image pixels. Note that, the criterion for judging true detection is that the distance between the detected location and the groundtruth location is less than threshold τ and τ is set to 10 pixels [71] in our paper.
All experiments were implemented on a PC with an Nvidia RTX 3090 GPU. The networks were optimized using the Adam method [73] with λ 1 = 0.9, λ 2 = 0.999 and the batch size was set to 12. The learning rate was initially set to 1e − 3 and halved in 10K, 20K, 60K iterations. We trained our network from scratch for 100K iterations.

B. Ablation Study
In this subsection, we conduct ablation experiments to validate our design choice.
1) Central Difference Residual Group: To demonstrate the effectiveness of our central difference residual group (CD-RG), we replace all the CD-Convs in CD-RG by Convs (i.e., residual group) and retrain the network from scratch. The experimental results in Table I show that CD-RG (i.e., CD-Convs) can introduce 0.12dB/0.004 gains on PSNR/SSIM and 0.06/0.09 gains on SNR/CR. This demonstrates that CD-RG can exploit the spatial local contrast prior to effectively improve the SR performance and the target contrast.
In addition, we visualize the feature maps generate by residual group (RG) and CD-RG with a toy example in Fig. 5. Note that, the visualization maps are the L2 norm results along the channel dimension [61], [74] and the red and blue boxes represent target areas and edge areas respectively. As is illustrated in Fig. 5(a), the input frame of the image sequence consists of a target of size 3×3 (i.e., the white cube at the top) and the clutter (i.e., the white area at the bottom). It can  TABLE II   ABLATION RESULTS OF THE LOCAL SPATIO-TEMPORAL ATTENTION MODULE ON THE  AVERAGE OF SAITD, HUI AND ANTI-UAV DATASETS. NOTE THAT, LSTA1  VALIDATES THE EFFECTIVENESS OF THE MODULE AND LSTA2-5 INVESTIGATE THE   IMPACT OF ITS PARAMETERS, NUMBERS, SUB-PIXEL INFORMATION   be observed from Figs. 5(b) and (c) that the target contrast in the feature map extracted by CD-RG is higher than that of RG. This demonstrates that CD-RG can enhance the target contrast (from 7.41 to 13.55). In addition, CD-RG can also improve the contrast between high-frequency edges and background (from 6.64 to 13.59). This is because, CD-RG aggregates the gradient-level information to concentrate more on the high-frequency edge information, thus improving the SR performance and target contrast simultaneously. Moreover, we conduct ablation experiments to replace all the CD-convs in MoCoPnet by DLCMs. Note that, the training process of MoCoPnet with DLCMs is unstable with sudden loss divergence due to gradient fracture. By contrast, CD-conv reserves the image feature information to update all pixels, which ensures the gradient propagation continuity. The ablation results in Table I show that CD-conv introduces significant performance gain on PSNR/SSIM (i.e., 1.01/0.039 on average) and further improve the contrast of small targets (i.e., 0.024/0.022 SNR/CR gain on average).
2) Local Spatio-Temporal Attention Module: In MoCoPnet, two cascaded LSTAs with parameters LSTA(kern=3, dila=3)     , L=1, ε=10 −7 Note that, we visualize the feature maps and attention maps generated by LSTA 3 (i.e., an LSTA with kernel size of 3 and dilation rate of 1) with a toy example in Fig. 6. Note that, the visualized feature maps are the L2 norm results along the channel dimension [61], [74]. As is illustrated in Fig. 6(a1), the target with size 1×1 (i.e., the white cube) is in the middle of the red reference frame. In Fig. 6(a2), the target is in the top left of the blue neighborhood frame. The corresponding features before LSTA are shown in Figs. 6(b1) and (b2). The aligned feature after LSTA is shown in Fig. 6(b3). It can be observed that LSTA can effectively perform frame alignment to achieve motion compensation. In addition, the attention maps are shown in Figs. 6(c1)-(c9), and the position of each attention map corresponds to the spatial arrangement in Fig. 3(b). It can be observed that Fig. 6(c1) has the highest intensity (more than 90% are 1) and represents the top-left motion, which demonstrates that LSTA can effectively capture the target motion to perform frame alignment.
Finally, we replace LSTAs in MoCoPnet by an optical-flow module (OFM) and a deformable alignment module (DAM) to compare our LSTA with the widely used optical flow and deformable alignment techniques. The experimental results are listed in Table II. It can be observed that the PSNR/SSIM/SNR/CR scores of MoCoPnet with LSTAs are higher than MoCoPnet with OFM and DAM for 0.11dB/0.004/0.015/0.009 and 0.06dB/0.002/0.005/0.006 respectively. By contrast, the number of parameters and FLOPs of MoCoPnet with LSTA modules are lower than MoCoPnet with OFM and DAM for 0.11M/2.70G and 0.19M/3.80G respectively. This demonstrates that LSTA is superior in exploiting the information among frames to improve the SR performance and the target contrast with lower computational cost. This is because, on the one hand, LSTA can direct learn motion compensation by attention mechanism without optical flow estimation and warping, which results in ambiguous and duplicate results [36], [78]. On the other hand, compared with DAM, LSTA can better incorporate the local prior to achieve improved SR performance and the training process of LSTA is more stable to converge to a good results.
In addition, we visualize the feature maps generated by OFM, DAM and LSTAs with a toy example in Fig. 7. Note that, the visualization maps are the L2 norm results along the channel dimension. As is illustrated in Fig. 7(a), the input image sequence consists of a random consistent movement of a target with size 3×3 (i.e., the white cube) in the background (i.e., the black area). The feature maps before OFM, DAM and LSTAs are shown in Figs. 7(b), (d) and (f). It can be observed that the target positions in the extracted feature maps are close to the blue dots (i.e., the groundtruth position of the target in the current feature). Then OFM, DAM and LSTAs perform feature alignment on the extracted features. As is illustrated in Fig. 7(c), the target positions in the feature maps generated by OFM are close to the blue dots. In Fig. 7(e), the blue dots and the red dots (i.e., the groundtruth position of the target in the reference feature) are both highlighted, which demonstrates that DAM does not perform frame alignment but highlight all the possible positions. The feature maps generated by LSTA1(kern=3, dila=3) and LSTA2( kern=3, dila=1) are shown in Figs. 7(e) and (f). As is illustrated in Fig. 7(f), all the target positions in the feature maps generated by LSTA2 are closer to the red dot than those of OFM. This demonstrates that LSTA is superior in motion compensation. Note that, it can be observed from Figs. 7(e) and (f) that LSTA1 and LSTA2 achieve coarse-to-fine alignment to highlight the aligned target. This demonstrates the effectiveness and superiority of our coarse-to-fine alignment strategy.

C. Comparative Evaluation
In this subsection, we compare our MoCoPnet with 1 topperforming single image SR methods RCAN [15], 5 video SR methods VSRnet [75], VESPCN [34], SOF-VSR [35] and TDAN [39], D3Dnet [30] and 3 infrared image SR methods IERN [51], PSRGAN [52], and ChaSNet [53]. For fair comparison, we retrain all the compared methods on infrared small target dataset [70] and exclude the first and the last 2 frames of the video sequences for performance evaluation.  Table III. SNR and CR scores calculated in the local background neighborhood are listed in the 2 nd − 9 th columns of Table IV. It can be observed that MoCoPnet achieves the highest scores of PSNR, SSIM and outperforms most of the compared algorithms on SNR and CR scores. The above scores demonstrate that our network can effectively recover accurate details and improve the target contrast. That is because, LSTA performs implicit motion compensation and CD-RG incorporates the center-orient gradient information to effectively improve the SR performance and the target contrast. Note that, we also analyze the running time of different methods and the results are shown in Table III. The running time is the total time tested on 100 consecutive HR frames with a resolution of 256×256 and is averaged over 20 runs. It can be observed that our MoCoPnet achieves better SR performance with a reasonable increase in running time.
Qualitative results are shown in Fig. 8. For SR performance, it can be observed from the blue zoom in regions that MoCoPnet can recover more accurate details (e.g., the sharp edges of buildings, and the lighthouse details closer to groundtruth HR image). For target enhancement, it can be observed from the red zoom in regions that, in the first row, MoCoPnet can further improve the target contrast which is almost invisible in other compared methods. In the second row, MoCoPnet is more robust to large motion caused by turntable collections [70] (e.g., artifacts in the zoom-in region of D3Dnet). In the third row, MoCoPnet can effectively improve the target contrast to be even higher than HR images (i.e., 1.82 vs. 1.75).
2) SR on Real Images: SNR and CR scores calculated in the local background neighborhood of super-resolved HR images are listed in the 10 th − 17 th columns of Table IV. It can be observed that MoCoPnet can achieve the best SNR score and the second best CR score on the average of test datasets under real-world degradation. This demonstrates the superiority of our method in improving the contrast between targets and background.
Qualitative results are shown in Fig. 9. It can be observed that MoCoPnet can recover finer details and achieve better visual quality, such as the edges of building and window. In  addition, MoCoPnet can further improve the intensity and the contour details of the targets.

D. Effect On Infrared Small Target Detection Algorithm
In this subsection, we select three typical infrared small target detection algorithms (Top-hat [76], ILCM [77], IPI [58]) to perform detection on super-resolved infrared images. The parameters of the three infrared small target detection algorithms are shown in Table V. When 4× SR is performed on HR images, the size of filters, block and stride, as well as the true detection threshold τ are enlarged by 4 times respectively. When 4× downsampling is performed on HR images, the filter sizes of Top-hat and ILCM are set to 3×3. The block sizes and the stride of IPI are set to 15×15 and 3. The true detection threshold τ is set to 3.0. For simplicity, we only use the best two super-resolved results of D3Dnet and MoCoPnet to perform detection. We also introduce bicubicly upsampled (Bicubic) images and HR images as the baseline results.
1) Detection on Synthetic Images: The quantitative detection results of super-resolved LR images are listed in Table VI. It can be observed that the SNRG, SCRG and CG of the superresolved images are generally higher than the Bicubic images. This demonstrates that SR algorithms can effectively improve the contrast between the target and the background, thus promoting the detection performance. It is worth noting that the SNRG, SCRG and CG scores of D3Dnet and MoCoPnet can even surpass those of HR. This is because, SR algorithms can perform better on the high-frequency small targets than the low-frequency local background, thus achieving improved target contrast than HR images. In addition, Bicubic can achieve the highest BSF score in most cases. This is because SR algorithms act on the entire image, which enhances targets and background simultaneously and detection algorithms have better filtering performance in smoothly changing background. Note that, BSF of MoCoPnet is generally higher than that of D3Dnet. This is because MoCoPnet can focus on recovering the local salient features in the image and further improve the contrast between targets and background, which benefits the detection performance.
The qualitative results of super-resolved LR images and detection results are shown in Fig. 10. In the LR images, the targets intensity are very low (e.g., the targets in SAITD and Anti-UAV are almost invisible). In the super-resolved images, the targets intensity are higher and closer to the HR images. This is because, SR algorithms can effectively use the spatio-temporal information to enhance the target contrast. Note that, our Mo-CoPnet is more robust to large motion caused by turntable col-lections [70] (i.e., artifacts in the zoom-in region of D3Dnet in Hui dataset). In addition, the neighborhood noise in HR image are suppressed by the way of downsampling and then superresolution (e.g., point noise are not exist in the zoom-in regions of Hui and Anti-UAV datasets). Then, we perform detection on the super-resolved images. It can be observed in Fig. 10 that all the detection algorithms have poor performance on the Bicubic images (e.g., the target intensity in the target image is very low and almost invisible in all detection results). This is because, bicubic interpolation cannot introduce additional information. However, the targets intensity in the target images of super-resolved images are higher than the Bicubic images. Among the super-resolved images, MoCoPnet is superior than D3Dnet in improving the target contrast due to the centeroriented gradient-aware feature extraction of CD-RG and the effective spatio-temporal information exploitation of LSTA.
To evaluate the detection performance comprehensively, we further calculate the ROC results which are shown in Fig. 11. Note that, ROC results on LR and HR image are used as the baseline results. The targets in HR images have the highest intensity. Therefore, high detection probability and low false alarm probability can be obtained and the detection probability reaches 1 faster (e.g., The ROC results reach 1 the fast in SAITD and Hui datasets). Downsampling leads to target intensity reduction, thus reducing the detection probability and increasing the false alarms probability. Bicubic introduces no additional image prior information, therefore, LR and Bicubic have the worst detection performance and the ROC results are significant lower than other algorithms (e.g., the ROC results of LR are the lowest and those of Bicubic are the second lowest except the ROC of Tophat in the SAITD dataset). SR algorithms can introduce prior information to improve the contrast between targets and background, thus achieving improved detection accuracy (e.g., The ROC results of MoCoPnet and D3Dnet are higher than Bicubic in SAITD and Hui datasets and even higher than HR in Anti-UAV dataset). Note that, false alarm rates of LR and Bicubic can only reach a relatively low value. This is because, IPI achieves detection by sparse and low rank recovery, which significantly decreases the false alarm rate than Tophat and ILCM. From another point, IPI suffers low detection rate of low contrast targets. Therefore, the ROC curves of Bicubic and LR images are shorter than those of HR and super-resolved images. The above experimental results show that SR algorithms can recover high-contrast targets, thus improving the detection performance.
2) Detection on Real Images: The quantitative detection results of super-resolved HR images are listed in Table VII. It can be observed that the detection performance of SR algorithms is superior to Bicubic. This demonstrates that MoCoPnet and D3Dnet can effectively improve the contrast between targets and background, resulting in performance gain of detection. Among SR algorithms, due to the superior performance of SR and target enhancement by our welldesigned modules, MoCoPnet can achieve the best SNRG, SCRG and CG scores in most cases. Note that, the SNRG and SCRG scores (achieved by IPI) of MoCoPnet in Anti-UAV dataset are 7-8 orders of magnitude lower than those of Bicubic and D3Dnet. First of all, MoCoPnet can achieve highest scores of CG. This demonstrates the target intensity can be effectively and further enhanced by MoCoPnet. Then, the differences come from the performance of background suppression. Since MoCoPnet can achieve higher scores of SR performance than Bicubic and D3Dnet, the local backgrounds of Bicubic and D3Dnet are more gentle and detection algorithms can achieve better suppression performance. IPI is superior in suppressing background clutter, therefore, sometimes the local backgrounds in the target image of Bicubic and D3Dnet are zero. Since we add to each denominator in equations 6-9 to prevent it from being zero, SNRG and SCRG scores can be very large due to completely suppressed background. In addition, bicubic interpolation suppresses the high-frequency components to a certain extent, resulting in optimal BSF value.
The qualitative results of super-resolved HR images and detection results are shown in Fig. 12. It can be observed that the targets of Bicubic images are blur while SR can enhance the intensity of target (e.g., the highlighted and sharpened targets). After processed by SR algorithms, we then perform detection on the super-resolved images. Note that, SR algorithms can effectively improve the intensity of targets and the contrast against background, resulting in better detection performance.
To evaluate the detection performance comprehensively, we further present the ROC results in Fig. 13. Note that, ROC results on HR image are used as the baseline results. It can be observed that SR algorithms can improve the detection probability and reduce false alarm probability in most cases. Compared with D3Dnet, MoCoPnet can further improve the target contrast, thus promoting the detection performance. Note that, false alarm rates of Bicubic can only reach a relatively low value. This is because, IPI achieves detection by sparse and low rank recovery, which significantly decreases the false alarm rate than Tophat and ILCM. In other words, IPI suffers low detection rate of low contrast targets.

E. Limitation
The proposed method fails when the image sequence contains repaid moving targets (Fig.14(a)) or sudden changes ( Fig.14(b)) caused by turnable collections. As we do not have a specific design for handling large motion and sudden change, the motion compensation by LSTAs in these cases can be wrong and our approach may not be able to effectively recover the targets. In future work, we aim to improve the robustness of our method to large motion and sudden change.

V. CONCLUSION
In this paper, we propose a local motion and contrast prior driven deep network (MoCoPnet) for infrared small target super-resolution. Experimental results show that MoCoPnet can effectively recover the image details and enhance the contrast between targets and background. Based on the super-resolved images, we further investigate the effect of SR algorithms on detection performance. Experimental results show that MoCoPnet can improve the performance of infrared small target detection.