Efficient Real-Time Tracking of Satellite Components Based on Frame Matching

In order to obtain the satellite’s in-orbit attitude information, it is necessary to track the satellite components in satellite video sequences. To solve the problem of low illumination and target occlusion in space environment, we propose an efficient satellite component tracking technique based on Rethinking Space-Time Networks with Improved Memory Coverage (STCN). We classify the pixels in the query frame by feature matching network that establishes the corresponding relationship between the frames. Unlike STCN, we reduce the contribution of background region in feature matching and enhance the robustness of the model in low illumination environment, thus improving the segmentation results. For lost targets due to the overturning and occlusion of satellite components, a position information encoder module is designed to further raise the tracking performance of the model. In addition, we present a local matching module to upgrade the existing feature matching methods. Experiments demonstrate that compared to STCN, our method heightens the tracking performance (J&F) by 10.1% and can achieve multi-object recognition at 15+ FPS.


I. INTRODUCTION
With the rapid development of spacecraft technology, people pay more and more attention to the tasks of target satellite identification, tracking and attitude estimation. It has become an important development trend in the field of satellite technology in various countries to vigorously develop information acquisition and processing technology related to satellites and other aircraft [1], [2], [3]. Among them, target detection and recognition, whose main content is to accurately identify the types of space targets and effectively invert the target attributes such as satellite geometry size, is an important prerequisite and guarantee technology for satellite docking.
In recent years, vision-based satellite component tracking and detection technology have attracted much attention The associate editor coordinating the review of this manuscript and approving it for publication was Rosario Pecora . because of its advantages of simple implementation and low power consumption. Especially in practical applications, video sequences are the mainstay. There are three main problems in tracking and detecting satellite components: (1) When the satellite is in orbit, the illumination intensity of the satellite is constantly changing due to its constantly changing orbital position. Especially when the satellite runs to the back of the earth, its illumination intensity is low, and the satellite local components in the video sequence will be difficult to observe.
(2) In the process of tracking and detecting satellite local components, the satellite may turn over and occlusion, resulting in the loss of the tracked target and the decline of detection accuracy.
(3) When observing the target satellite, because the target satellite is always in motion, it may cause the image captured by the imaging equipment to shake and blur, resulting in unclear imaging and interference with the tracking and detection of satellite local components. The above problems pose great challenges to the tracking and detection of satellite local components.
In this paper, the satellite tracking task is realized by video object segmentation (VOS) [6], [7], [8] of satellite components. VOS belongs to the field of computer vision and has important applications in many fields. In this work, we focus on the VOS of semi-supervised satellite components, in which the ground truth segmentation masks of one or more objects are given for the first frame in the video. With the rapid development of deep learning and the introduction of DAVIS data-set [4], [5], The task of semi-supervising VOS has also made great progress in recent years. Many early studies used online learning strategies to fine-tune the corresponding network by giving the first frame mask [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. This method has high accuracy, but it takes a long time to infer.
On the basis of ensuring accuracy, recent works are faster than the online fine-tuning method. Space-time memory networks (STM) [26] introduces a memory network for the first time, improves the accuracy of video segmentation by storing the features of historical frames, and introduces the concept of global matching. Collaborative video object segmentation by foreground-background integration (CFBI) [23] heightens the segmentation effect by comparing the features of foreground and background and has a good effect on multiscale targets. Fast end-to-end embedding learning for video object segmentation (FEELVOS) [24] uses global matching and local matching mechanisms for each frame to segment the current frame through information transmission. Kernelized memory network (KMN) [25] uses static images to pre-train the model, and Gaussian kernel is introduced to enhance the effectiveness of the memory network. Memoryaugmented self-supervised tracker (MAST) [27] proposes a self-supervised training model, which achieves the same performance as the supervised methods without any annotations. Mask Selection Network (MSN) [28] uses temporal consistency to forward and reverse the video sequences, and uses the difference of masks given by them to correct the network, which effectively suppresses noise and achieves extremely high accuracy. Reliable Propagation-Correction Modulation for Video Object Segmentation (RPCMVOS) [29] corrects the noise propagating in the network from the local correlation frame and the global correlation frame by introducing the propagation modulator and the correction modulator. Multimodal Transformers (MTTR) [30] uses the transformer to model the VOS task as a sequence prediction problem, which greatly simplifies the model and has impressive results in multiple metrics. The above methods have achieved impressive results in accuracy, but it is difficult to meet the realtime requirement of satellite component tracking. Learning fast and robust target models (FRTM) [31] adopts a new network structure composed of two lightweight modules, which combines online learning and offline learning at the same time, achieving high frame rate and good performance. Hierarchical Memory Matching Network (HMMN) [32] proposed a new memory module, which effectively utilized the time smoothness, classified the memory, and realized more accurate memory matching. Rethinking Space-Time Networks with Improved Memory Coverage (STCN) [21] forms an efficient and robust framework by establishing the corresponding relationship between frames. However, the robustness of this model to the rollover and occlusion of satellite construction is poor. Pixel-Level Bijective Matching for Video Object Segmentation (BMVOS) [22] introduces a bijective matching mechanism to make every pixel have a chance to contribute. Although this method has a fast-processing speed, it is difficult to meet the aerospace requirements in terms of accuracy.
In this work, we take STCN [21] network with the best accuracy and speed as our backbone network. STCN is a simple, effective and efficient framework for video object segmentation. We propose an attention module with shared weight to improve the detection ability of the model for satellite components in low illumination environment. Through this attention module, we can extract the features of satellite components in the image more effectively, and distinguish the foreground from the background. To enhance the tracking accuracy of the model for the overturning and occlusion of satellite components, we introduce a position information coding scheme and propose a local matching module based on transformer [33]. By capturing the position information of satellite components in video images, it can effectively avoid the problem of target loss caused by the change of gray scale, shape and other features caused by the overturning and occlusion of satellite components. Because the changes of adjacent frames in the video sequence are small, our local matching module can match features from adjacent frames, thus bringing more excellent performance and more efficient memory usage.
The contributions of this paper are summarized as follows: 1. We propose a video object segmentation model for satellite component tracking, which is more robust to low illumination in space environment, overturning and blocking of satellite components than STCN.
2. We propose three simple and efficient modules: attention sharing module, position information encoder and local matching module. 3. Our network can reach the speed of 15+ FPS while maintaining high efficiency.

II. SATELLITE COMPONENT TRACKING METHOD
Given the satellite component mask of the first frame of a video sequence, we process the satellite component masks of subsequent frames in sequence. We store the features of keyframes through memory storage, compare the features of the current frame with those of keyframes, and then classify each pixel of the current frame.

A. OVERALL FRAMEWORK
The overview of our framework is illustrated in Figure 2. We adopted STCN [21] as the baseline backbone. As with STCN, we use resnet50 [34] as key encoder to get image features, and resnet18 as value encoder to get mask features. Resnet, as a classic feature extraction network, is more robust to few-shot learning than the newly proposed Swin-Transformer [49] and EfficientNet [50]. In addition, image features are more complex, and difficult to extract than mask features, so a deeper network is needed to extract image features. We add the attention module to the key encoder, which can help us clearly distinguish the foreground from the background in the image, and make the model connect with the features of satellite components in the subsequent feature matching. We add a position information encoder to the value encoder. This module allows us to get the position of the corresponding satellite component in this frame. By sending the corresponding mask and the position information into the value encoder, a more accurate mask feature map can be obtained. Figure 3 shows the inferring process of our method. The memory in the figure contains the feature map of the key frame and the previous frame. The position information is embedded into the feature map of the memory bank through the position information module. The current frame gets the corresponding feature map through encoder. After that, the feature map obtained from the current frame is sent to the attention sharing module for local and global matching with the feature map in the memory. Finally, the matched result is sent to a decoder to obtain a mask.
We send the feature maps of the keyframe and the previous frame into the global matching module and the local matching module, respectively. By comparing the features of the current frame and keyframe, the feature query of the current frame and previous frame searches all possible key feature combinations. The final matching score is obtained by linearly adding the global matching result and the local matching result. We will introduce the local matching module later.
The global feature matching formula is as follows: where M represents the keyframe in memory, which we collectively call the memory frame, Q represents the current frame, K represents the feature matrix obtained by key encoder, i and j respectively represent the positions of the memory frame and the current frame in the video sequence, S represents the correlation degree, and L2 represents the Euclidean distance. Finally, the matching matrix of the memory frame and the current frame can be obtained, and the result can be obtained by matrix multiplication calculation of the matching matrix and the mask features in the memory module: where V M represents the mask feature matrix in the memory module. Finally, the memory characteristic matrix V I is passed to the decoder to generate a mask.

B. ATTENTION SHARING MODULE
At present, attention mechanism [33] is widely used in various fields of deep learning. Attention mechanism is used in most unsupervised VOS tasks [35], [36], [37], [38], [39], but rarely used in semi-supervised tasks. The images captured during the satellite docking process have the characteristics of single background and low illumination. By adding attention module, the foreground and background in the video can be effectively distinguished, and the robustness of the model to low illumination environment can be enhanced. Our attention sharing module is shown in Figure 4. The feature matrix obtained by the key encoder takes the maximum, average and variance along the channel dimension, and a new feature matrix is obtained. The feature matrix passes through a 7 × 7 convolution kernel to obtain a weight matrix with channel 1, which represents the attention distribution probability of the frame. Finally, the weight matrix is used to point multiply the key features and the corresponding mask features respectively.
As shown in Figure 5, the abscissa in the figure represents the number of pixels, and the ordinate represents the variance of the corresponding pixels along the channel direction. The red dotted line represents the pixels of satellite components, and the blue solid line represents the background pixels. It can  be clearly seen from the figure that the characteristic variance of satellite components is generally larger than that of the background. Figure 6 shows our attention map. The left side of the figure shows the satellite image, and the right side shows the corresponding attention heat map. We show a total of three groups of probability maps of attention distribution of images. We use the variance of the channel dimension of the feature map as one of the output features. We consider that the background of the satellite image is single and the variance of the corresponding features is small, while the features of the satellite components are complex and the variance of the corresponding features is large. Using variance value as output feature can effectively improve the ability of the model to distinguish foreground from background.

C. POSITION INFORMATION ENCODER
For the tracking task of satellite components, satellite components often turn over and occlusion. If the appearance information is used in feature matching, it is easy to be visually disturbed, because it is completely based on the visual information. To alleviate this problem, we propose a position information encoder. In fact, the deep convolution neural network itself has a certain ability to encode absolute position information [40], [41], [42], but it is very limited. In the field of machine translation, the relative position information in the sequence is effectively encoded by the extended selfattention mechanism [37], and in the field of object detection some people introduce the position information into transformer [44]. However, there is no position information coding method suitable for satellite component tracking.    [44], which is composed of sinusoidal functions with different frequencies. The difference is that we don't embed the position code by linear addition. To retain the original mask features fully, we embed the position code by splicing the mask and location information. Other position information is embedded in the same way. Relative position embedding is a learnable position information matrix with 64 channels, which is transformed into the same size as the mask feature map by linear interpolation. We use three coordinate matrices to form coordinate position embedding, which respectively represent the position change on the X axis, the position change on the Y axis and the position change of the polar coordinate with the center as the origin. Our position information embedding module is defined as: where ⊕ indicates catting in the channel dimension. By embedding the above three position information matrices, the position information of the satellite components in the current frame can be effectively obtained, and a more accurate feature matrix can be provided for the subsequent feature matching links.

D. LOCAL MATCHING MODULE
Global matching is responsible for comparing the feature information of keyframes in a video sequence with the current frame, while local matching is responsible for comparing the feature information of the current frame position in a video sequence in spatial-temporal neighborhood. Global matching has the advantages of simple implementation and higher reasoning speed. Moreover, in feature matching, because there are many keyframes, global matching has strong robustness to the wrong feature matching in partial keyframes. However, there is no concept of time consistency in global matching.
If the appearance information of the segmentation target is similar to that of the background, or the appearance information of different segmentation targets is similar, it will probably lead to wrong segmentation results. This is fatal to the tracking of satellite components. Because in the process of satellite component segmentation, segmentation objects with very similar appearance information often appear, such as two very similar solar panel wings or antennas. Local matching mainly focuses on the information in the spatial-temporal neighborhood of each current frame position. Because the image changes little in the adjacent video frames, especially the position information changes. Therefore, it is more efficient to deal with local matching of similar targets. In order to enhance the detection accuracy of the model for targets with similar appearance, we not only added the position information encoding module, but also proposed the local matching module. Several existing works also use local matching [23], [24], [43] or optical flow [45] to improve the segmentation accuracy of the model. However, few people use transformer to design local matching module. Our local matching module is defined as follows: where f Q represents the feature map of current frame, f M represents the feature map of memory frame, v M represents the memory mask feature map, S loc represents the local affinity matrix, k Q and k M represent the feature vectors of the current frame and the memory frame, respectively. The current frame f Q and the memory frame f M get the feature matrix through the key encoder, and the feature matrix gets k Q and k M through two full connection layers respectively. Like transformer, k Q and k M are query values and key values respectively. p(i) represents the spatial neighborhood with the pixel i of the corresponding query frame as the center in the memory frame. P(k Q ) and v p are the relative position embedded information [37] in the local affinity matrix and the VOLUME 10, 2022 memory mask feature map, respectively. The final matching result loc is achieved by dot multiplication of the local affinity matrix and the mask feature map.

III. EXPERIMENTS
In this section, we describe the experimental results obtained from this study. In section A, we introduced our datasets and evaluation metrics. In section B, we introduced our training details. Section C shows the performance of our model in different scenarios, and compares our model with the latest methods. In order to verify the effectiveness of each module proposed by us, we conducted extensive ablation experiments in Section D. In section E, we give the detailed running time of each component of the model.

A. DATASETS AND EVALUATION METRICS
The dataset of satellite images we use comes from Systems Tool Kit (STK), which is the world's top satellite simulation software produced by AGI Company of the United States.
The version of STK we use is 10. We mainly use STK to provide a high-precision visual simulation module, which can provide users with high-fidelity visual support in space. We collected 18 video sequences as our training set and 6 video sequences as our verification set. A video sequence contains a satellite, and each video sequence contains 40 to 60 satellite images. Figure 8 shows a partial dataset image. In order to ensure the validity of our data set, while simulating the satellite docking scene, there are many images in our data set when the satellite turns over and blocks. We choose the solar wing and antenna as our tracking objects. First, these two satellite components exist in almost all satellites, so to estimate the attitude of the satellites during docking, has high universality. Secondly, the solar wing and antenna always exist in pairs. However, for many of the most advanced models at present, it is difficult to accurately distinguish two similar tracking objects, and the tracking of similar objects is very common in the tracking of satellite components.
We use J&F, an evaluation index commonly used in VOS. J score is calculated as the average Intersect over Union (IoU) score of prediction mask and the ground truth mask, which describes the accuracy in the whole mask area. F score is calculated as the average boundary similarity measure between the prediction mask and the ground truth mask, TABLE 1. Satellite component dataset. We marked thousands of pictures for testing and training. All our video sequences contain the situation that the satellite components turn over. The overturning of satellite components in the table is defined as: in a video sequence, the color information of the tracked object changes, and the occlusion of satellite components is defined as: in a video sequence, there are some frames where the tracked object does not appear.
which describes the accuracy of the object boundary. J&F refers to calculating the average value between them. The evaluation metric is defined as follows: where S p represents the prediction result of the model and S GT represents the ground truth. B p and B r respectively represent the precision and recall of the prediction mask relative to the ground truth. B p and B r are defined as follows: B r = P T P GT (13) where P T is the number of boundary elements correctly predicted by the model, P all is the total number of boundary elements predicted by the model, and P GT is the total number of boundary elements with the ground truth.

B. TRAINING DETAILS
We use an 11GB 2080Ti GPU with the Adam optimizer [46] using PyTorch [47] to train our model. In the process of data preprocessing, firstly, we reduce the short edge of the image to 480 pixels, which can effectively speed up the training and inferring of the model with little impact on the accuracy. After that, we will randomly flip the image horizontally and shake the color. We use a batch size of 4 and 3000 iterations during training. In each iteration, we select three time-sequential frames from a video sequence, with the first frame as the starting frame to form a globally matched training sample. Then, the previous frame of the last two frames is taken as a training sample for local matching. A total of five images are taken from the video sequence. First, we use the globally matched first frame and the corresponding locally matched frame to predict the second frame, and then use the first and second frames as global matches to predict the third frame together with the locally matched frame of the third frame.
The momentum of our Adam optimizer is set to β 1 = 0.9, β 2 = 0.999, the basic learning rate is 10 −5 , and the L2 weight decay of 10 −7 . Our learning rate decays with a decay ratio of γ = 0.1. We use cross entropy as our loss function. In addition, in training, we choose the top-p% pixels with the highest loss to carry out back propagation. p is 100 in the first 1000 iterations, then linearly decreases to 15 in 1000 to 2000 iterations, and finally remains unchanged. Our model doesn't need to set any hyperparameters when inferring. Figure 9 shows the change of loss function of three methods in training. It can be seen that the learning efficiency of the three methods is excellent. However, after 1000 iterations, the loss of our model is obviously lower than the other two. This shows that our model has better performance for satellite component tracking. Because it is difficult for us to obtain a large number of labeled satellite images, we use transfer learning to solve the problem of few-shots learning. Specifically, we use STCN [21] pre-trained parameters to initialize the network skeleton. Standard normal distribution initialization parameters are used for attention module, position information module and local matching module. Moreover, Siamese network [48] is used in the backbone structure of our network, which is more robust to few-shots learning. Figure 10 shows the influence of the pre-training model on the loss during training. It can be seen from the figure that the model is easier to converge after using the pre-training model. In addition, the basic learning rate of the model with no pre-training is 10 −6 , and the other hyperparameters are the same as the model with pre-training. We choose different basic learning rates and optimizers to train our models with no pre-training. Most of these models can hardly converge, and even the gradient explosion with loss value of NaN will occur. Using the pre-training model can not only make the training easy, but also improve the performance of the model [51]. Especially for few-shot learning, it is very important to use pre-training model.
We use data augmentation to enrich our dataset. Specifically, we perform the same color jitter of (brightness=0.1, contrast=0.03, saturation=0.03), and random gray scale with a probability of 0.05 on the extracted images in the video sequence. After that, we perform color jitter of (brightness=0.01, contrast=0.01, saturation=0.01), and perform different random affine on each image. Table 2 tabulates the comparison between our method and the most advanced methods in VOS segmentation benchmark at present. We made a comparison on three standard metrics: region similarity J, contour accuracy F and average score J&F. For the calculation of model speed, we calculate multiobject FPS, which is defined as the total number of output masks divided by the total time for the model to process all images. For the speed of comparison methods, we use the same device to measure according to the above standards. Comparisons between different methods. All models are trained on the same device using our dataset, and the corresponding pre-training weights are loaded before training. All three methods are iterated 3000 times.  Table 3, we listed the performance of our model in different scenarios. It is obvious from the table that our model performs well in common satellite scenes. Our model is not only robust to the turning and blocking of satellites, but also shows excellent performance in low illumination and complex background. Figure 11 shows the comparison between our model and other models in terms of speed and accuracy. Although our model is inferior to STCN in speed, our model is far superior to other models in evaluation metric J&F. Figure 12 shows VOLUME 10, 2022  how our model compares with other models during mask propagation. The red box and blue box in the figure indicate the effect comparison of the left wing and the right wing of the satellite under different models. It can be seen that other models are difficult to accurately distinguish satellite components with similar appearances since they have no specific design for processing low illumination and highly similar objects. Figure 13 shows the comparison of the effects of different models in the case of satellite component overturning and occlusion. GT in the figure represents the ground truth. Due to its attention sharing module, position information encoder and local matching module, our model is more effective for the overturning and occlusion of satellite components compared to other models.

D. ABLATION STUDY
In Table 4, we analyze the effect of each module on the model. The comparison of speed and accuracy between different  modules can be seen more intuitively from Figure 14. As can be seen, enabling a certain module alone has a very limited improvement in accuracy. Among them, we are most interested in the local matching module. One of the reasons is that this module occupies the most computing resources. In the case of taking up a lot of computing resources, the improvement of model accuracy by local modules is very limited, but if it is enabled with other modules at the same time, the performance of the model can be greatly enhanced. In short, location information and attention information play a greater role in local matching. We speculate that this is because the gap between memory frame and query frame is small in local matching, and the performance of similar location information and similar attention weight matrix is stronger in matching.
As shown in Table 5, the performance of the model with pre-training is far superior to that without pre-training. This is because the model can hardly converge without loading pretraining weights. It is difficult to learn some basic texture and color features only by our data set, especially for encoders that need to extract features.

E. ANALYSIS OF RUNNING TIME AND REAL-TIME PERFORMANCE
We analyze the running time of each component of the two models in Figure 15. The running time in the graph repre- sents the time required for the model to process one image. Although the speed of our method is lower than that of STCN, it can still reach the processing speed of 66.2 milliseconds. Our method can be used in real time. We add a local matching module which takes a lot of computing resources to improve the performance of the model, as described in the ablation experiment.

IV. CONCLUSION
We propose a simple and efficient tracking framework for satellite components. We propose the attention sharing module, which can effectively improve the performance of the model and solve the problem of low illumination tracking in space environment. Our proposed location information module and local matching module effectively solve the problem of tracking target loss caused by the overturning and occlusion of satellite components. Compared with the most advanced methods at present, our method has more excellent performance in tracking satellite components. However, when the tracking target in the video sequence is lost for a long time and the components of the satellite are overturned, the local matching module will fail because there is no corresponding tracking target in the previous frame, and the ideal performance may not be achieved only by global matching. Moreover, in the docking task of satellites, higher speed and lighter weight models are needed. We will further complete it in the future work.