Infrared small object tracking based on Att-Siam network

Aiming at the practical problems such as background clutter interference and occlusion during the tracking of small air targets by infrared imaging guided missiles, we propose an improved SiamFC infrared small target tracking algorithm. By mapping the original spatial data to a high-dimensional Hilbert space, the nonlinear mapping is implicit in the linear learner. This method has the advantages of learnability, efficient calculation, linearizability, and good generalization performance. The target tracking problem provides a new and effective approach with the actual needs of the application. We propose a Siamese web tracker (Att-Siam). Att-Siam fuses convolutional channel attention mechanism, stacked channel attention mechanism and spatial attention mechanism to improve tracking performance. Feature extraction of target objects is enhanced by fusing the channel attention mechanism and two convolution blocks at low convolutional layers. Compared with the benchmark algorithm SiamFC, the improved algorithm is tested on the infrared air weak and small target data set, and the tracking success rate and accuracy are increased by 32.3% and 20.9% respectively. The experimental results show that the proposed algorithm can adapt to complex and diverse infrared air scenes, and achieve effective and stable real-time tracking of small infrared air targets.

INDEX TERMS small infrared target; target tracking; twin network; attention mechanism

I.INTRODUCTION
With the continuous development of the target tracking field, the research on the target tracking problem in different scenarios is getting more and more in-depth. The problem of multi-target tracking with unknown number and time-varying in clutter environment is a research hotspot in recent years. Therefore, methods based on random sets have become the mainstream methods for multi-target tracking, such as probability hypothesis density filters [1]. As a part of the computer vision field, single target tracking has been widely used in real life, such as drone monitoring, security video monitoring, traffic monitoring [2]. Although target tracking has developed rapidly in recent years and has achieved many important results, the performance of trackers is still difficult to meet actual needs in complex tracking environments such as complex backgrounds, similar interference, and deformation. In recent years, deep learning has been widely used in the field of single-target tracking, and the powerful feature extraction capability of convolutional networks has greatly improved the performance of the tracker. The Siamese Network Tracker with Full Convolution (SiamFC) [3] pioneered the tracking algorithm based on the Siamese network, which takes advantage of the network's translation invariance and lifts the size limit of the input image. The Siam Region Proposal Network (SiamRPN) [4] algorithm adds a Region Proposal Network (RPN) to the structure of SiamFC, enabling the tracker to predict more accurate target boxes. The Distractor Aware Siamese Network (DaSiamRPN) [5] algorithm aims at the problem of imbalanced training samples, and improves the discriminative ability of the tracker by constructing difficult negative sample data that is easy to interfere with the same type and easy to interfere with different types. Liu Q et al. proposed a multi-task combined convolutional neural network infrared target tracking method, which is widely used in the engineering field [6][7]. Kernel-Based Learning is a new learning method developed on the basis of statistical theory. The kernel method maps the original spatial data to the high-dimensional feature space through nonlinear mapping, which can perform linear analysis in the sample space and is an effective method to deal with nonlinear pattern recognition problems. At the same time, the kernel method can also provide a way of efficient calculation. The kernel function is used to hide the nonlinear mapping in the linear learner for synchronous calculation, and the kernel function is used to replace the complex inner product operation in the high-dimensional space. The problem of complexity and "curse of dimensionality" provides a new way of thinking [8]. The kernel method has a simple structure, good generalization performance, and no over-learning problem. At present, kernel methods have been applied in support vector machines, kernel principal component analysis, time series forecasting, Fisher discriminant analysis [9]. Liu Q et al. proposed a deep neural network for infrared target tracking and achieved good results [10][11][12]. In recent years, in order to improve the tracking effect, it has also become a research direction to integrate the attention mechanism into the target tracking model. RASNet [13] adds three attention modules to the appearance features of the target object, preferentially selects feature channels with rich information, and performs weighted fusion on them to achieve cross-correlation operation; Zhang Han et al. The global attention mechanism strategy adopts the upsampling method to restore the feature map size and improve the generalization ability of the model [14]; SiamAtt [15] uses the Siamese attention network to introduce the attention mechanism into the classification branch, and uses the classification score to distinguish the foreground and background to This predicts the target position; Wang et al. proposed an efficient and lightweight channel attention module, which applies few parameters and effectively improves the accuracy in image classification and target detection [16]. But incorporating the attention mechanism into the tracker will increase a certain amount of computation, thus affecting the overall performance of the tracker. For the traditional method, when the infrared small target pixel area is small and the background contrast difference is small, there will be problems such as tracking failure due to edge clutter interference during tracking. This paper proposes a Siamese network tracker (Att-Siam) that fuses convolutional channel attention mechanism, stacked channel attention mechanism and spatial attention mechanism to improve tracking performance. This method avoids the difficulty in discriminating small infrared targets and background clutter using single depth features; Use the motion information between target frames to predict the position of the occluded target, which solves the problem that the target cannot be accurately tracked due to the reduction of feature information after the target is occluded, and has the advantages of stable and efficient tracking.

II. Related work A. The basic principle of fully convolutional Siamese network
The Siamese network regards the target tracking problem as a similarity learning problem, compares the initial frame target information of the image sequence to be tracked with the subsequent frame candidate targets, and selects the target with the highest similarity. The target tracking algorithm based on Fully-Convolutional Siamese Networks (SiamFC) is a classic algorithm in target tracking [14]. The algorithm has high real-time performance and adopts offline training to track the network parameters of the model. Online tracking does not adjust parameters and executes directly. The forward propagation operation, the tracking speed is fast; the robustness is strong, the network model and the target template are not updated during the tracking process, the target template will not be polluted, and the subsequent capture will not be affected even if the target is occluded. These advantages are very consistent for the infrared imaging guided missile tracking target with high tracking real-time and stability requirements. Therefore, SiamFC is chosen as the basic framework for target tracking. The structure of the fully convolutional Siamese network target tracking algorithm is shown in Figure 1, which consists of a template branch and a detection branch [17]. The template branch mainly takes the image z as the template input, and the target template z is the image cropped from the given target frame in the first frame of the tracking image sequence. The detection branch is mainly responsible for receiving the to-be-searched area x of the current frame, which is an image cropped from the center of the target position of the previous frame. The target template z and the area to be searched x are passed through the convolutional neural network with shared parameters to obtain their respective feature maps ψ(z) and ψ(x). As a deep feature extractor, the Siamese network extracts the features of the target template z and the area to be searched, and then sends them to the similarity function to calculate the similarity. The similarity function is a convolution operation, that is, In the formula: f is the feature extraction network; * is the convolution operation; b is the bias vector. According to formula (1), the convolution operation actually uses the target template feature map as the convolution kernel, and performs the sliding window algorithm on the feature map of the area to be searched to obtain the response map f (z, x) of similar features. Each point of the response map represents the similarity between the target template and the corresponding position of the area to be searched, and the larger the value, the greater the similarity. After bicubic interpolation, the real position of the target in the current frame can be determined according to the position of the maximum value in the response graph. Applying SiamFC in infrared air small target tracking, the experimental results show that it can maintain a good tracking speed and accuracy. It leads to tracking failure. To this end, based on SiamFC, this paper adds the target tracking state judgment criterion, and proposes an improvement strategy in the corresponding state.
Ying Chao Li: Infrared multi-small target tracking algorithm based on Att-Siam network VOLUME XX, 2017 1

B. Siamese Network Tracking Based on Mutual Attention Guidance
As shown in Figure 2, the tracking framework (Siamese Cross-attention Network) proposed in this paper consists of three parts, namely, a multi-feature extraction network for feature extraction, a mutual attention guidance network and a regression network. The backbone network adopts ResNet50, which can extract the features of different layers. At the same time, a mutual attention module is added between the template branch and the search branch, so that the information of each branch can communicate with each other. After the features of the two branches are crosscorrelated, the obtained feature map is input to the classification network and the anchor-free regression network to calculate the tracking result of the current frame.

C. Multiple Feature Extraction Network
For the tracking method based on the Siamese network, the first step is to input the template image and the search image into the backbone network for feature extraction.  [20]. Considering the purpose of the target tracking task, that is, the tracker has high requirements for positioning, and the number of layers of the convolutional network should not be too many, so as not to cause inaccurate positioning, so this paper uses ResNet50 as the backbone network [21]. Using the Conv3,

D. Satellite application mode and process analysis
The characteristics of near-real-time observation of remote sensing satellites in geostationary orbit enable them to have faster emergency response capabilities. As the current civilian Gaofen-4 satellite is mainly used to ensure meteorology, environment, and disaster reduction applications, the ship target detection and tracking needs are not fully considered in the routine task process. However, the advantages of near real-time observation cannot be fully exploited in such situations as prolonged time, slow image processing, and unsatisfactory application effect. Through analysis, the problems existing in the current routine task process are embodied in the following aspects [23].
(1) After the satellite data is landed, it is centrally processed by the transportation management department, and the timeliness is poor. After the satellite imaging data is landed, the satellite operation and management department needs to centrally process the entire orbit imaging data to generate standard Level 2 image products and distribute them to users. At the same time, there is a queue of processing tasks. At present, the processing delay is generally hour-level.
(2) The production of image products and the process of target detection and tracking are independent of each other, and the serial process increases the delay. The satellite operation and management department packs and transmits the image products of a single mission to the user after the production is completed. After the user receives all the image products of a single mission, the target detection and tracking are carried out. The task process has a large delay and the timeliness is difficult to guarantee.
(3) The user lacks an automated, accurate and efficient application system, and the detection and tracking algorithm needs to be verified and optimized. The application system of ship target detection and tracking using high-orbit remote sensing images is still in the exploratory stage. At present, there are still many technical difficulties in ship target detection and tracking [24], such as how to reduce the false alarm rate caused by broken clouds when the sea is cloudy. The lack of control point data in the sea, how to improve the positioning accuracy, etc.; the complex marine background, how to avoid the situation of losing and following the target, etc. At the same time, with the development of technology, after the resolution of remote sensing satellites in geostationary orbit is improved, the characteristics of image targets with different resolutions are different, and the corresponding detection and tracking algorithms need to be adaptively improved and optimized [25].
(4) The positioning error of the ship target is affected by factors such as satellite positioning accuracy and attitude control accuracy, and the track jitter is obvious. Compared with detection methods such as shore-based radar and ship automatic identification system, the geostationary orbit remote sensing satellite has poor positioning accuracy. If only the original image information is used for ship target detection, the heading speed error is large and the track jitter is obvious. Therefore, it is necessary to Research on the improvement technology of positioning accuracy and reduce the speed error of heading 431 Spacecraft Engineering Volume [26].
To sum up, the current satellite imaging products take a long time in task response, data transmission, processing and application, which greatly reduces the timeliness, and the time for users to obtain imaging data to get the image product is extended to several hours. Due to the lack of accurate and efficient processing application systems, the accuracy of target detection and tracking is difficult to guarantee. Therefore, the existing satellite application process cannot effectively meet the requirements of ship target detection and tracking tasks [27].

A. Spatial Attention Mechanism
Compared with the stacked channel attention mechanism, the spatial attention mechanism pays more attention to the information of the positional features of the target image in each channel, and uses the relationship between the spatial features of the channels to construct spatial attention, focusing on the most important information of the target image in each channel. The enriched part is complementary to the stacked channel attention mechanism [28]. The spatial attention mechanism adopted by the Att-Siam tracker is shown in Figure 3. The spatial attention mechanism is divided into a context module and a channel conversion module. The context module receives feature maps of the outputs of the stacked attention mechanism × × H W C N S As input, the same spatial attention for all feature channels is calculated; the context module reduces C channels to a single channel through a 1×1 convolution operation, and then passes through the Softmax function and the input Multiply, fuse and pass into the multi-layer channel conversion module; the multi-layer channel conversion module mainly calculates different spatial attention across channels. First, the number of output channels is reduced to C/8 through a 1×1 convolution operation. Through experiments, the channel When the number is reduced to C/8, the feature extraction effect is the best; then continue to apply the 1×1 convolution operation through the BatchNorm layer and the ReLU layer, and finally apply the Sigmoid function to the input

B. Classification and Regression Networks
In the regression network part of the tracker based on the traditional Siamese network, the anchor box is usually chosen to regress the information of the prediction box [29].
The anchor box-based regression mechanism fixes the anchor box by presetting a certain number of aspect ratios, and then predicts the width and height size information of the target box by selecting the anchor box regression with the highest score. However, the reasonable design of anchor boxes plays a key role in the training and tracking performance of the tracker [30]. In recent years, in the field of object detection, the target box regression mechanism without anchor boxes has developed rapidly. Its core idea is

C. Att-Siam Tracker Algorithm
The Att-Siam tracker architecture is shown in Figure 1. Different from the previous deep neural network tracker, the Att-Siam tracker uses the AlexNet variant as the feature extraction backbone network. While pursuing the speed of image feature extraction, the accuracy of image feature extraction is reduced to a certain extent. In order to make up for this part of the loss, this research Develop a tracker with high performance and high generalization ability, which integrates the convolutional channel attention mechanism into the first convolution layer and the last five layers of convolution, and also adds two image features and Two convolution blocks to better extract image feature information; in order to reduce the workload of computing parameters, only the stacked channel attention mechanism and spatial attention mechanism are introduced in the target image branch, using different characteristics within and between channels Perform feature extraction, and then calculate the target image branch and the search image branch through the improved cross-correlation formula Among them, q(z) means using convolutional channel attention mechanism, two image features are fused with two convolutional blocks, q(x) means fully convolutional network, b means stacking channel attention mechanism and spatial attention mechanism, b represents the offset. Tracker architecture of Att-Siam

C. Kernel correlation algorithm
Kernel Correlation Algorithm In order to further improve the tracking accuracy and tracking effect, Henriques et al. [29] proposed the Kernelized Correlation Filters (KCF). The algorithm adds a window function to ensure the periodicity of the image and reduces the marginal effect; at the same time, a kernel technique is introduced to map the original sample space into a high-dimensional space, and the diagonalization of the discrete Fourier transform is used to reduce the amount of storage and calculation. The calculation efficiency of the model is improved; in addition, the directional gradient histogram feature is introduced to replace the grayscale feature, and the target texture and directional features are extracted, which can achieve fast and effective tracking of the target. The KCF algorithm builds a complete theoretical system of kernelized filtering. First, a large number of training samples are constructed by cyclic shift, and then operations such as kernel skills and Fourier transform are used to avoid matrix inversion operations, which greatly reduces the amount of computation and realizes the accuracy of the target. Fast and accurate tracking, so the field of target tracking has gradually developed and grown, and more follow-up research has emerged. The typical process of the KCF algorithm mainly includes four processes: ridge regression [30], circulant matrix diagonalization, kernel correlation filtering, and rapid target detection: (1) Ridge regression for the initial linear regression model, the optimization objective function is: where λ is the fitting control parameter. Taking the derivative of formula (12) and setting it to zero, the optimal solution is 1 () HH s s s w X X I X y  − =+ (6) In the formula: Xs is the sample matrix; y is the label.
(2) Circular matrix Denote Xn×1={x1, x2, }. Using negative samples to train the classifier, the one-dimensional cyclic shift operator P is The vector x is continuously multiplied by the cyclic shift operator P to obtain a cyclic matrix C(x) of a onedimensional image combined with n vectors. For twodimensional images, by cyclic shift in the region of interest, two-dimensional image training samples can be obtained, and finally a cyclic matrix of two-dimensional matrices can be obtained. Traditional target tracking algorithms, such as mean shift algorithm, Kalman filter algorithm [31], particle algorithm based on Bayesian filtering [32], TLD (Tracking Learning Detection) tracking algorithm based on target characteristics [33], Compressive Tracking (CT) tracking algorithm [34], the performance is poor in terms of tracking accuracy and operation speed. The kernel correlation algorithm combines the advantages of the fast operation speed of the CSK algorithm, and has improved the classifier training and sample detection, and has greatly improved the target tracking accuracy and recognition success rate. The graph feature enables the algorithm to have good performance in tracking speed, surpassing other mainstream algorithms in the same period. However, the KCF algorithm is limited to the tracking of the target position, and does not consider the change of target size or the occlusion of the target, which makes the target distribution area obtained in the actual tracking deviate from the actual position. At the same time, the KCF algorithm will generate a large number of positive and negative samples when updating the target model, and it is easy to lose tracking during long-term tracking or when the target size changes. In response to these problems, a large number of literatures have carried out further research on the KCF algorithm. In order to solve the problems of scale change, target occlusion, and target loss in the process of target tracking, Danelljan et al.
[35] introduced a scale pyramid in KCF. On the basis of position estimation, a filter was added to measure the target scale. The scale filter is used to detect the change of the target scale to deal with the change of the target scale during the tracking process. Compared with the original KCF algorithm, the tracking accuracy of this algorithm is improved, but the computational complexity of the model increases significantly, so that the real-time performance of target tracking cannot be guaranteed. On this basis, Li et al. [36] proposed an adaptive scale-based kernelized correlation filtering algorithm (SAMF), which no longer performs storage training and only uses one filter, but requires more One eigenvalue extraction and Fourier transform. Although the algorithm simplifies the calculation, it still cannot meet the requirements of real-time tracking when the image is large. Wang [37] proposed a KCF algorithm based on convolutional features. The algorithm extracts target features in high-level and low-level layers respectively, and then uses convolutional neural network to calculate the response graphs of high-level and bottom-level layers, and fuses the response graphs of the two layers to achieve the goal of Position estimate for the current frame. The algorithm improves the problem that the tracking effect becomes worse when the target scale changes, but when the target is occluded, it still cannot achieve a good tracking effect, and the robustness is low. Reference [38] proposed a multi-scale correlation filtering target tracking algorithm, using two ridge regression models with strong plasticity and strong stability. The model with strong plasticity tracks the target position and builds an image pyramid with the position as the center, and the model with strong stability predicts the target. Scale changes, the algorithm can achieve multi-scale detection and tracking. Reference [39] proposed Depth Scaling Kernelized Correlation Filters (DSKCF), which extended the RGB target tracking algorithm in KCF to RGB-D tracking algorithm by fusing depth features, which improved the problems of target scale change and target occlusion. In order to deal with problems such as losing the target during the tracking process. Liu et al. [40] used the mean and standard deviation of the peak data in the response graph to correct the peak value in each frame, which improved the problem of long-term target tracking in the KCF algorithm. In order to deal with the marginal effect caused by the cyclic shift in the KCF algorithm, Danelljan et al. introduced a spatial tuning factor to constrain the filtering weight, which effectively alleviated the marginal effect at the expense of the algorithm's operation speed. Ruan Honggang [30] proposed a fast-scale kernel adaptive filtering algorithm based on sparse features. On the basis of the original KCF algorithm, a Gaussian window with adjustable bandwidth was introduced, and the target position was estimated by combining correlation filtering and sparse features, and the scale change was predicted, which could better identify the target. Isolated from background. The existing KCF algorithm uses a fixed learning rate when updating the model, which is prone to tracking drift. To this end, Asha et al. proposed a dynamic adjustment of the learning rate, which updates the rate of change by changing the position of the target in the preceding and following frames. Chen et al. set the peak value and threshold in the response graph, and update the tracking model when the peak value is greater than the threshold. The analysis by Wang et al. uses the response graph to determine whether the target drifts, but when the target size changes, it will lead to inaccurate model updates. For complex scenes, it is difficult for a single core to meet the tracking performance requirements, and the multi-core fusion of the KCF algorithm is a natural extension. The literature proposes an edge multi-kernel correlation algorithm (LMKCF) for visual tracking. LMKCF mainly uses low-rank tensor learning to alleviate the redundancy and noise of multi-kernel correlation filters in learning and updating, and establishes a forward-looking learning and updating strategy. The literature proposes a Constrained Multi-Kernel Correlation Tracking Algorithm (CMKCF), which builds a multi-kernel model with multichannel features with three different properties, and uses a spatial clipping operator on the half-kernel matrix to account for boundary effects. The literature introduces Takagi Sugeno Kang-Fuzzy Logic System (TSK-FLS) into the KCF algorithm, and proposes a fuzzy kernel correlation filter (FKCF) algorithm. In the FKCF algorithm, the TSK fuzzy system antecedent mapping is used instead of the kernel mapping. By improving the algorithm of the consequent parameters, the tracking accuracy of the FKCF algorithm is improved when the target moves violently. The literature fuzzifies the polynomial kernel and the Gaussian kernel to obtain a more robust kernel function. From the perspective of multi-kernel fusion, a Multiple Fuzzy Kernelized Correction Filter (MFKCF) is proposed. The algorithm has higher tracking accuracy and adaptability sex. In order to reduce the cumulative error in the tracking process, the literature proposes a dual-kernel adaptive filtering algorithm based on spatiotemporal saliency (KCFSS). It helps to reduce the cumulative error, and then establishes a dual-core tracking model between the original target and the salient area, which can fine-tune the target tracking position. Inspired by the successful application of the kernel collaboration method in face recognition and other fields, the literature proposes a target tracking algorithm based on kernel collaboration, which maps the dictionary matrix to a high-dimensional space, and introduces L2 regular least mean square to speed up the calculation. speed for higher tracking accuracy. Aiming at the problems of insufficient graphic information and insufficient target features in infrared target tracking, scholars have also conducted in-depth research on the KCF algorithm to ensure the tracking effect of the KCF algorithm in the field of infrared target tracking. Battistone et al. adopted the method of local analysis and structured vector machine, and achieved a good ranking in the infrared target VOT competition. In order to solve the problem of fixed size in KCF, Montero et al. proposed the SKCF algorithm, which introduced an adjustable Gaussian window function and a scale estimation model in the KCF algorithm, which has a high tracking accuracy in infrared target tracking. Zheng Wuxing et al. integrated grayscale and saliency features into the KCF algorithm, and the proposed algorithm can be effectively used in the tracking of airborne infrared targets.

A. Experimental environment and experimental results of UAV dataset
The experiment uses the PyTorch framework to operate, the environment is shown in  [20] dataset contains 123 challenging droneshot videos, and SWV123 evaluates the performance of the tracker according to the AUC (Area Under Curve) score and the accuracy score. The AUC curve graph is drawn according to the overlap ratio between the tracking frame given by the tracker and the target real frame, which reflects the overlapping relationship between the predicted frame and the real frame. The drawing of the accuracy curve is based on the distance between the center of the tracking box given by the tracker and the center of the target real box, reflecting the position gap between the two boxes The algorithm in this paper and the recent mainstream tracking algorithms are tested simultaneously on 123 videos in the SWV123 dataset, and the test results are evaluated and ranked. The tracking algorithms used for comparison are ECO [21], SiamRPN++ [3], SiamCAR [10], SiamRPN [2], SiamBAN [11], DaSiamRPN [3], ECO-HC [21] and SRDCF [22]. As shown in Figure 5 and Figure 6, the algorithm in this paper ranks second in the success rate score, which is 0.0638. Although it is lower than the first anchor box algorithm SiamRPN++, with a difference of 0.4%, it ranks first in the accuracy score, reaching 0.850, which exceeds the SiamRPN++ algorithm by 1%, indicating that the regression network without anchor boxes is more robust to the computational task of the target position than the regression network with anchor boxes. In particular, in the challenge of size change, fast moving and similar objects in all videos of the SWV123 dataset, the success rate and accuracy of the proposed algorithm (first or second) outperformed most mainstream algorithms, indicating that the proposed algorithm The introduced mutual attention and anchor-free mechanism can effectively allow the tracker to overcome the problems of tracking object deformation and similar interference.

B. Simulation experiment and result analysis
In order to verify the effectiveness of the proposed swarm   Figure 7. On the two-dimensional plane, the tracking results of the single Monte Carlo simulation of the two algorithms are shown in Figure 8. In Figure 7, the real motion trajectory of the target is represented by a black dotted line. "." represents the real cluster center, the blue "⊙" represents the starting position of the target motion, " → " represents the target motion direction, and gives 8 label corresponding to each target. In Figure 8, " △ " represents the tracking result of the KFDA(Kernel Fisher discriminant analysis) swarm structure model under the condition of box particle PHD filtering, and " ○ " represents the tracking result of the swarm evolution network model under the condition of box particle PHD filtering.  In order to better verify the performance of the two models, we conducted 50 Monte Carlo simulations and took the average value to observe the running results. The estimation results of the number of groups and the OSPA distance are shown in Figure 9 and Figure 10. It can be seen from Figure  8 that the two models are more accurate in terms of group number estimation, but comparing the two, the group structure model trained by data is more stable in terms of group number estimation than the group evolution network model, and the fluctuation range is smaller. The global adaptability is stronger, and no human intervention is required. After the model is trained, it can be used directly. However, the effect of estimating the number of groups in the group evolution network model depends on the setting of the threshold, and the setting of the threshold requires constant attempts to finally find a more suitable value, and the efficiency is relatively low. It can be seen from Figure 9 that the OSPA error of the box particle PHD filter based on the KFDA swarm structure model in the whole tracking process is smaller than that of the box particle PHD filter based on the swarm evolution network model. It can be seen that the KFDA swarm structure model estimates the number of groups more accurately, less tracking error.

C. Analysis of ablation experiments
In order to verify the importance of the corresponding solution to the overall performance when background clutter interference and occlusion occur in the tracking of small targets in infrared air, an ablation experiment is performed on the test set. The experimental results are shown in Table 1. It can be seen that the algorithm in this paper is based on SiamFC through tracking state evaluation, and after adding the method for solving the background clutter interference problem, the success rate and accuracy are improved by 20.5% and 10.7% respectively compared with SiamFC; The Kalman filter position prediction method for the occlusion problem, the success rate and accuracy are respectively improved by 24.7% and 6.8% compared with SiamFC; on the basis of SiamFC, the method to solve the problem of target interference and occlusion by background clutter is added, and the success rate and accuracy are improved respectively compared with SiamFC 32.4% and 21.8%. This shows that each solution for infrared air small target tracking improves the algorithm tracking effect accordingly, and the improvement effect is the best after the solutions are combined. Although the tracking speed drops, the algorithm still has high real-time performance. The accuracy of the tracking state judgment method has a great influence on the subsequent processing effect of different states, so the selection of the relevant parameters a1, a2, a3 and a4; and as for the judgment of the target tracking state is very important. In order to verify the influence of parameters on the performance of the algorithm, first use the SiamFC algorithm to test part of the infrared data set, count the APCE value and the maximum peak value of each frame of the response graph during the test process, combined with the tracking state analysis, preliminarily set a1=0.45, a2=0.57, a3=0.12 and a4=0.86; then, test a, take different values between 0 and 1, and fix the remaining 3 parameters, the performance of the algorithm on the infrared test set. As shown in Figure 11, when a1=0.54, the success rate is the highest, so change a1, VOLUME XX, 2017 1 to 0.54. According to the same method, determine the values of parameters a2, a3, a4, as in turn. It can be seen from Figure 11 that when a1=0.54, a2=0.67, a3=0.09 and a4=0.87, the success rate is the highest, indicating that the judgment of the current tracking state is the most appropriate at this time, and the tracking effect is the best.

E. Application case study
Since the above ideas have been practically applied in the project, the following is an application case of using the integrated system of data processing and application based on the imaging data of the Gaofen-4 satellite to carry out the test and verification of target detection and tracking at sea. The Gaofen-4 satellite adopts an area array staring method for imaging, and has visible light, multi-spectral and infrared imaging capabilities. The resolution of the visible light band is up to 50m, and the infrared band is up to 400m. By controlling and adjusting the imaging angle of the camera, it can completely cover China and surrounding areas. To expand the application of the application mode of the Gaofen-4 satellite, in this experiment, the Gaofen-4 satellite is used to image the Yellow Sea and the Bohai Sea, and the visible light and near-infrared spectrum is used to image the designated sea area. In practical applications, the inter-frame interval is set as 1min. From the start of the satellite to shoot the target area to the generation of the initial ship target track, the total delay is within 6 minutes, and the rolling update delay of the target track is less than 1 minute. The timeliness of the whole process is shown in Table 3. Table 3 Timelineness of ship target detection and tracking tasks Task content Time Gaofen-4 satellite starts shooting T0 The original code stream is received on the ground T0+53s 1st frame 1A image generation T0+2min41s Ship target detection in the first frame T0+3min29s

3-frame continuous track generation T0+5min8s
After the actual deployment of the experimental verification system test, the traditional process from satellite imaging to the acquisition of target tracking information has been reduced from the traditional process to the minute level, which greatly reduces the delay of ship target detection and tracking application. Continuous tracking and monitoring are implemented to verify the efficiency of application modes and processes and their availability in typical application scenarios. The target detection trajectory extraction process is shown in Figure 12. The size of the target slice is set to be 20km × 20km, and the position coordinates of the center point of the target slice are (123.327°E, 36.7824°N). The first frame of 1A-level images generated by the 50m near-infrared spectrum of the Gaofen-4 satellite captured the target area; Figure 12(b) shows that after slicing the image, a total of 3 targets were detected in the red frame area; Figure 12(b) c) Detect the generated target trajectory after tracking the target continuously for 6 frames. Target detection trajectory extraction of remote sensing satellite For the sake of intuition, Figure 13 shows the tracking results of our method in four groups of video sequences. It can be seen that: in sequence 1, at frame 000022, the lens jitters, resulting in tracking failure. At this time, the method in this paper can only continue to track some targets; in sequence 2, when the main target is separated and the number of targets to be tracked increases sharply (It is easy to cause tracking failure), the method in this paper can track most of the targets stably; in sequence 3, when the target has a large deformation, the method in this paper can still track stably. The method in this paper deviates in the tracking of small targets, but can still keep track of the main target.

V. Conclusion
This paper proposes an improved SiamFC infrared small target tracking algorithm. In this paper, a mutual attention module is designed to realize the information flow between the two branches, and combined with spatial attention, the tracker network can autonomously focus on key regions of target features. In this paper, comparative experiments are carried out on public datasets, and the results show that the tracking effect of this algorithm is better than other mainstream twin tracking algorithms in the problems of simulated interference, fast motion, scale change and background interference. Compared with other classical tracking algorithms on the infrared test set, the results show that the proposed algorithm has good performance in tracking small targets in infrared air, and has good tracking stability when the target is disturbed or occluded by complex backgrounds. It can also meet the requirements of real-time tracking.