Transformer Meets Remote Sensing Video Detection and Tracking: A Comprehensive Survey

Transformer has shown excellent performance in remote sensing field with long-range modeling capabilities. Remote sensing video (RSV) moving object detection and tracking play indispensable roles in military activities as well as urban monitoring. However, transformers in these fields are still at the exploratory stage. In this survey, we comprehensively summarize the research prospects of transformers in RSV moving object detection and tracking. The core designs of remote sensing transformers and advanced transformers are first analyzed. It mainly includes the attention mechanism evolution for specific tasks, the fitting ability design of input mapping, diverse feature representation, model optimization, etc. The architectural characteristics of RSV detection and tracking are then described across two aspects. One is moving object detection for motion-based traditional background subtractions and appearance-based deep learning models. The other is object tracking for single and multiple targets. The research difficulties mainly include the blurred foreground in RSV data, the irregular object movement in traditional background subtraction, and the severe object occlusion in object tracking. Following that, the potential significance of transformers is discussed according to some thorny problems in RSV. Finally, we summarize ten open challenges of transformers in RSV, which may be used as a reference for promoting future research.

Moving object detection and tracking are the fundamental premise for advanced visual tasks, such as scene content analysis and understanding [7], [34], [35]. They are widely used in intelligent monitoring, dynamic observation for moving objects, and other application scenarios. In addition, the well-developed object detection and tracking has good reference value and significance for remote sensing video (RSV) interpretation [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46]. For example, improved natural models can be transplanted into RSV detection and tracking with great development potential and prospects for further research [47], [48], [49], [50], [51]. RSV moving object detection and tracking are discussed in this review, hoping to bring some application value. The relationship between each section is shown in Fig. 1.
For moving object detection (MOD), motion-based traditional machine learning methods label sparse foreground objects via modeling background information. They adopt the alternating direction method of multipliers (ADMM) [52]  function iteratively. However, the complex background with sparse foreground characteristics in RSV makes noise affecting the model robustness as a significant research hot spot. It is sensitive to object irregular motion [53], [54], [55], which relies on interframe registration. On the other hand, appearance-based deep learning models mainly take advantage of feature learning from the convolutional neural networks (CNNs) [56], [57] and the recurrent neural networks (RNNs) [58]. They rely on many training samples while lacking semantic distinction for motion artifacts [56]. The attention mechanism is added to enhance the object semantic features or distinguish objects from the background, enabling the detection more accurately [38], [58], [59].
RSV detection and tracking still have great improvements in model representation and performance optimization [7], [34], [35]. Transformers show strong potential for dealing with temporal dynamics [93], [94], [95], [96], [97], [98]. RSV detection and tracking methods can be inspired and further improved from transformers with high efficiency and low latency performance. Therefore, a general overview of the application transformers in RSVs is needed, especially moving object detection and tracking, which will benefit to RSV interpretation. In this article, we mainly discuss the practical problems transformer can solve for RSV detection and tracking. The development of RS transformers is first analyzed. RSV moving object detection and tracking methods are then systematically investigated. The potential development of transformers in RSV object detection and tracking is discussed before raising ten open challenges. The primary contributions of this article are summarized as follows.
1) RS transformers are introduced from backbones to various downstream tasks, while the advanced transformers are from backbones to video and efficient transformers. It mainly discusses the input embedding, position encoding, and diversified feature designing to help readers grasp the research status effectively. 2) RSV detection and tracking are researched on the model optimization design and performance analysis. It mainly elaborates on moving object learning detection and object transformer tracking, which analyzes the model characteristics and research difficulties in detail. Besides, the datasets with corresponding evaluation indicators and experimental performance are also introduced. 3) Potential research directions of transformers in RSV detection and tracking are pointed out. Then, ten open challenges faced by transformer and RSV are discussed from the corresponding theoretical basis, bringing a good reference for promoting future work. The rest of this article, as shown in Fig. 2, is organized as follows. Section II describes the motivation for this review. RS transformers are briefly summarized in Section III. Section IV portrays moving object learning detection methods in RSVs. Section V explains object transformer tracking in RSVs, including SOT and MOT. Section VI discusses the potentials of transformers in RSV moving object detection and tracking, and Section VII provides ten promising open challenges. Finally, Section VIII concludes this article.

II. MOTIVATION
Transformer is essentially suitable for video tasks due to the sequence characteristics of the video [93], [99], [100], [101], [102]. It has recently been shown to closely resemble the structure of human hippocampus without the aid of any biological knowledge [32], [33]. Moreover, the attention mechanism in transformer imitates the selection mechanism in brain activity. Because of the support across these brain-inspired biological theories, the advantages of transformer interpretability are excavated more deeply [29], [103]. With the gradual increase in performance and memory requirements in RS field, RS transformers based on locality, feature diversity, and hierarchy have successively enriched the backbone networks [11], [104], [105], [106], [107]. Besides, their corresponding train techniques have improved to adapt different downstream RS tasks, which can effectively enhance spatial information under limited computing resources [14], [15], [108], [109], [110].
1) RSV data characteristics with the complex scene: Low contrast between foreground and background can lead to blurred object boundaries, which may be affected by object shadows or noise [36], [51], [62], [118], [119]. Meaningless geometric properties and motion patterns from outliers interfere with the model performance. In addition, the local redundancy of video data can introduce a lot of repeated calculations [7]. 2) Foreground separation and spatiotemporal information utilization for MOD: The traditional model relies on motion information, which is less sensitive to irregularly moving foregrounds and more to texture changes [120], [121], [122]. For deep learning models, it is crucial to effectively use motion information and spatiotemporal continuity for preventing false/missed alarms with achieving efficient detection [4], [57], [58], [116], [123]. Transformer can not only be used to enhance the object semantic features, but also improve the long-range modeling ability due to the video sequence nature [96], [97], [101], [124].

III. RS TRANSFORMERS
Transformer, composed of pure attention mechanism [131], [132], has been shown effective for long-term relationship construction as an encoder-decoder mode in natural language processing tasks [30]. Besides, the reason for the excellent performance of transformer is not only multihead self-attention (MHSA), but all the components in the block are playing a role [133], [134]. Next, we will introduce the transformer preliminaries, RS, and advanced transformers.

A. Transformer Preliminaries
Transformer mainly contains position encoding module, multihead attention mechanism, feedforward network (FFN), residual connection, and layer normalization module. The overall architecture is illustrated in Fig. 3. Next, we will describe the encoder and decoder module from the image processing perspective.
The output is finally passed to the decoder after N 1 encoder stacks.
a) Input embedding: The input elements are embedded in distributional space W to make the machine process the input sequences [30].X whereX ∈ R h 2 C×N is the flattened patches' sequence of the input X ∈ R H×W ×C . (H, W ) and C are the resolution and channel of the input, respectively.x i ∈ R h 2 ,C is the ith flattened patch, (h, h) denotes the resolution of each patch, and N = HW/h 2 represents the number of patcheš where the outputX is the patch embeddings generated by the flattened sequenceX map to W through the embedding matrix E. b) Position encoding: The RNN is a linear sequence that naturally encodes the position information into the model. The convolutional layer of the CNN retains position-relative information, while transformer, which contains no recurrence, learns the position information through the hidden state computation. The position information is beneficial to transformer [141] X =X + P where the positional encoding P has the same dimension as the patch embeddings. It adds toX for supplying the positional information. Besides, there are kinds of position encodings, as shown in Fig. 4, like sinusoidal functions [10], [30], [108], [124], relative positional encodings [25], [136], [138], [142], learnable embeddings [12], [22], [141], [143], [144], and dynamic position encoding with depthwise convolution (DWconv) [145]. c) MHSA mechanism: As an essential part of the transformer model, MHSA operates differently with modular neurons. Transformer with several head attention layers reproduces the contents of memory during computation [146], [147], [148]. It shows that transformer can move information to the output and other places in the context. As shown in the lower left part of Fig. 3, the MHSA mechanism A as the core part of transformer concatenates each self-attention outputs A i , which defined as follows:  [13], [14], [30]. (b) Spatial reduction attention [24]. (c) Pooling attention [150]. (d) Efficient attention [28].
where m is the number of heads. The self-attention mechanism collects the relevant information between each token to other tokens in the sequence. As shown in Fig. 5(a), it can be calculated as The single-head self-attention result A i is computed by the dot product between the Softmax function with value V i . Besides, the attention matrix Q i K T i is normalized as a probability distribution by the Softmax function.
iX are intermediate representations of the input tokens X, usually represent as different from linear transformation of tokens [149]. M q i , M k i , and M v i are the learned weight matrices for the query, key, and value, respectively. Different single-head self-attention results can be constructed by mapping the tokens with varying weight matrices.
To reduce the computational complexity of the transformer model, most methods modify the attention module with different perspectives, especially the computation of attention weights [151]. For example, ShiftViT replaces the attention mechanism with a partial shift operation [152], [153]. Some frameworks change the attention weight calculation to the firstorder approximation of Taylor expansion, which reduces the computational complexity to linear [154], [155]. SwinV2 proposes a scaled cosine function to replace dot product operation [156]. d) Add and norm strategy: Transformer leads to loss of practical information and gradient vanishing problem due to the stacked layers [133]. Some frameworks have proved that residual connection and layer normalization can solve the above problems [133], [134]. As shown in the upper left part of Fig. 3, the specific performance takes three different forms. They are written asÃ Here, post-norm [30], res-post-norm [156], and pre-norm residual units [157] are defined in (6)-(8), respectively.Â represents the input matrix of the residual normalization module, which is the output matrix of the MHSA mechanism after the residual and layer normalized computation. LN function expresses the layer normalization. e) Feed forward network: This module is crucial to the entire transformer structure, which takes the averaged attention values and transforms them into a more tractable form before inputting the next layer [152]. It usually presents in the following form: FFN Â = ReLU LinearLayers Â .
We take (6) representation as an example. FFN consists of a linear layer and an activation function [157].
2) Decoder Module: Each autoregressive decoder takes the previously generated decoder result as input when developing the next consequent. Its components are similar to those of the encoder; difference is the masked MHSA and the multihead cross-attention mechanisms. a) Masked MHSA mechanism: This mechanism has the same structure as the MHSA in the encoder. The discrimination is the input tokens that need to be masked by adding −∞ [158], [159], that is, just relying on the token information at the current subsequent without any future information [141] A where Q i , K i , and V i are the projection results between the input tokens X to the corresponding learned linear matrices M q i , M k i , and M v i . M i is the mask of the ith head self-attention. b) Multihead cross-attention mechanism: The input is designed to handle two embedded inputs with the same dimension, which is different from MHSA. The key-value pairs come from the same input, and the query from another. It can capture contextual information more effectively [161]. The multihead cross-attention mechanism in the decoder module can be written as follows: The inputs K and Q for calculating the attention weight matrix are the encoder result and the masked MHSA mechanism output of the decoder, respectively. Besides, the encoder output V is assigned attention weights to highlight the interest regions [162]. Some methods construct cross attention from a clustering perspective to improve the model rationality [163], [164].

B. RS Transformers
Transformer has developed so fast in RS field. The difference between RS transformers is mainly reflected in the following three aspects: 1) processing: a definition that maps a specific task to model input/output in a sequence of vectors; 2) diversity of position embedding types: like sinusoidal functions [30] and learnable embeddings [141]; 3) efficient transformer designs: such as specific-question structured sparsity patterns in masked attention. Various RS transformers are listed in Table I . They are briefly summarized in this section, such as transformer backbones for feature representation learning and high/mid-level and low-level transformers in RS interpretation.
1) Transformer Backbones: They are gradually expanding in RS field. The supervised and self-supervised learning RS transformers will be discussed in the following subsection. a) Supervised learning transformers: A straightforward approach is to replace the backbone with transformer blocks [12], [104], [105], [154], such as MAP-SwinT [104] replaces ResNet in MAP-Net [165] with SwinT block to achieve multiscale feature extraction. Some models add specific modules to their backbones for feature enhancement [11], [166]. The value tokens of the CTN model are calculated by the 2-D convolution layer, as shown in Fig. 6(a), realizing the combination of convolution and transformer [11].
Swin transformer has a good development in some RS tasks [20], [106], [107], [167], [168]. SwinT blocks are adopted as encoders in semantic segmentation. They construct corresponding decoders to generate enhanced semantic features [106], [167]. For generating high-quality RS image time series, SwinSTFM [107] proposes a feature extraction and fusion module composed of SwinT blocks in Fig. 7. An unmixing-based fusion block is introduced in the multilevel fusion module to complete the fusion of features at different levels. SwinSUNet [20] designs a pure transformer network with the Siamese U-shaped structure [169] at the image change detection task in Fig. 8. In the low-level vision task of pansharpening, DR-NET uses the SwinT blocks to process the multispectral and panchromatic images separately before performing feature fusion [168]. Besides, it introduces convolutional block attention module (CBAM) and efficient channel attention [170] in an image reconstruction stage to enable the network focus on crucial    information, thereby obtaining images with uniform spectral information and sufficient spatial details. b) Self-supervised learning transformers: Selfsupervised learning is a variant type of unsupervised learning, which uses self-supervision to analyze the laws and key information in the datasets [10], [46], [171], [172], [173]. It learns a general feature representation to make the model transferable for downstream tasks [10], [22], [174]. Using the label-free self-distillation contrastive learning mechanism, LaST captures long-range contextual information of RS images with the SwinT backbone [174]. It solves the hard negative sample problem by self-distillation contrastive learning.
As a self-supervised pretraining transformer, BERT [175] achieves good generalization performance in RS. HSI-BERT [22] introduces BERT into hyperspectral image (HSI) classification to capture the global dependencies across pixels. The pixel embedding, which contains a learned linear transformation and a learned positional embedding, is used in all input dimensions. SITS-BERT [10] adopts a BERT-based self-supervised learning for model pretraining. It captures spectral-temporal features in RS image time-series classification tasks after fine-tuning.
2) High/Mid-Level RS Transformers: They are mainly described in image classification, object detection, semantic segmentation, and change detection tasks. a) Image classification: It has crucial research value as a primary RS interpretation task mainly based on transformer or Fig. 9. Local-enhanced transformer block [19]. neural network. Most of frameworks use a hybrid scheme to improve modeling capabilities.
CNN-enhanced transformers: They use ViT or SwinT variants as a central framework to perform feature extraction on different RS images. It is found that MHSA and convolution modules exhibit opposite behaviors, which resemble low-pass and high-pass filters, respectively [176]. Therefore, CNNs and transformers have been fused differently to promote the representation learning.
Generally, for HSI classification, most models first adopt a convolutional network to map the image as corresponding convolutional features and then use a transformer model to achieve subsequent classification [13], [18], [108], [109], [143], [177], [178]. In addition, DHViT adopts a convolutional token embedding to adjust tokens [18]. SSFTT proposes a Gaussianweighted feature tokenizer module by adding a Gaussian distribution weighted matrix [13]. It makes the tokens conform to the distribution characteristics of the sample. Moreover, some methods directly split the image and input it into transformer after flattened [11], [19], [22], [174]. SPRLT-Net proposes a spatial partition restore module to extract complex spatial relationships [19]. The flowchart is shown in Fig. 9, where the spatial partition module splits the HSI patch into several overlapping subpatches centered on a pixel. At the same time, the spatial restore module is used to aggregate all subpatches to a feature map. Some models have improved the self-attention mechanism to realize spectral awareness for HSI. For example, BS2T introduces a multihead spatial-spectral self-attention module, which acts as the spectral information on the attention weight matrix [177]. HiT proposes a conv-permutator module with the DWconv operations to encode spatial-spectral features from height, width, and spectral dimensions [109]. SSTN proposes a spatial attention and a spectral association module [178]. Among them, the spectral module generates masks through a 3-D convolution operation on spatial information to model the correlations between spectral kernels and spatial information. Besides, it finds the optimal architecture setting by the factorized architecture search framework to achieve better accuracy.
To further enhance the overall performance of transformer, some frameworks use parallel design between transformer and CNN to extract local and global information [160], [179]. As shown in Fig. 6(b), CTNet concatenates the semantic features of ViT streams with the local structural features of CNN streams to predict sample labels [160]. GLNS designs a fusion network to integrate the output features and uses a twofold loss function  [20], [167], [180], [181], [182].
to compact the classification features [179]. MSTNet proposes a multilevel feature aggregation decoder to improve the feature expression ability, which fuses different level features generated by other transformer encoder blocks [143]. For joint classification of hyperspectral and light detection and ranging data, DHViT proposes a spectral sequence and a spatial hierarchical transformer module [18]. The former sends the flattened feature vector to transformer for extracting spectral features. The latter extracts the spatial features of these two modal data. Finally, a cross-attention module is presented to exchange the classification and patch tokens from different modal features for achieving the heterogeneous feature fusion.
Transformer-enhanced CNNs: As an essential means to obtain discriminative features, the attention mechanism can effectively improve the modeling ability. The model with different attention mechanisms represents different information [183].
In the channel feature learning, the convolution operation, which fuses all the channels by default, pays more attention to the receptive field, while some models use the channel attention mechanism to realize adaptive enhancement of feature weights of virtual channels [180], [186], [189] or to strengthen the correlation between channel features [181]. The calculation process of channel attention is shown in Fig. 10. The feature undergoes a global average pooling module and two fully connected layers, realizing the channel weighting. Notably, SAFF proposes a nonparametric self-attention layer, which sequentially weights the spatialwise and the channelwise [186]. CAG proposes a cross-attention mechanism consisting of a horizontal and vertical attention mechanism [184]. It uses a combination of weight multiplication and maximum weight matching strategies to expand the feature difference.
Some models adopt the self-attention mechanism instead of spatial convolution operation to capture long-distance information relations effectively [187], [188], [189]. WFCG proposes a position and a channel attention module composed of the self-attention mechanism to simulate spatial and channel attention [187]. These two modules are concatenated in series to capture higher level abstract HSI feature information. In HSI classification, the spectral attention mechanism is introduced to capture long-range dependencies of feature maps [188], [189]. In particular, a feedback spatial attention module using multiscale spatial information and a feedback spectral attention module are proposed in FADCNN to strengthen semantic information in the spatial-spectral dense networks [189].
b) Object detection with transformers: The attention mechanism is mainly used for feature enhancement. IAANet adopts MHSA to model the coarse-grained candidate regions at pixel level and outputs attention-aware features to distinguish objects from the background [14]. In the design of channel Fig. 11. CBAM [168], [193], [194], [202], [204], [209]. feature correlation, SSE-CenterNet introduces a spatial shufflegroup enhance attention module, which shuffles the channels to improve the relationship between groups [192]. It divides the feature map into multiple groups along the channel dimension and generates an attention factor at each spatial location within each group to learn higher level semantic information.
The CBAM extends of the channel attention [194]. As shown in Fig. 11, it includes channel and spatial attention in series and applies to multiscale feature enhancement in the object detection field [193]. Some models use improved hybrid attention modules to perform multiscale feature enhancement [195], [196]. They obtain the final features after multiplying by the generated attention weighted map with the original feature map to highlight object features. For example, RSADet proposes a lightweight scale attention module, including a parallel spatial and a channel max pooling submodule [195]. FPN-MSDAM proposes a multiscale deformable attention module, which cascades multiscale features through channel axes and generates attention maps using a convolution layer and a sigmoid function [196].
c) Semantic segmentation with transformers: Some models use the transformer blocks [103], [210] as the encoder to extract multiscale features and design the decoder with different attention modules for feature fusion and refinement [105], [106], [167]. For example, SETR-MFPD designs a dimension attention module including channel and spatial attention mechanisms to connect the multiscale feature pyramid decoder [105]. DC-Swin designs a decoder with a densely connected feature aggregation module [106]. As shown in Fig. 12(a), it generates enhanced semantic features through a shared spatial attention (SSA) and a shared channel attention (SCA) with cross-scale connections. Another method feeds the local features and global contextual features into the transformer encoder to realize dual-branch semantic correlation [15]. The projected local feature tokens are set as query, and the contextual feature tokens as key and value.
The attention mechanism variants can be incorporated into the network backbone for capturing feature correlations [151], [197], [199]. In the channel attention mechanism designs, UDA-SS proposes a covariance-metric-based channel attention module to an unsupervised framework [199]. It assigns high weights to feature maps with high covariance through convolution and channel correlation computation for representing other feature maps. SCAttNet cascades the channel and spatial attention module in CBAM [197]. MANet proposes a multiattention network that combines kernel and channel attention mechanisms to refine information in positions and channels [151]. Among them, the kernel attention mechanism uses the kernel smoothers to replace  [106]. (b) SwinT embedding U-Net model [198]. the attention weight matrix calculation. The channel attention uses the attention weight calculation based on the dot product.
For the U-Net backbone improvement, the attention mechanism enhances the feature extraction ability with well segmentation accuracy [183]. MaResU-Net replaces the skip connections of the baseline network with a linear attention mechanism and adopts the 2 -norm to ensure nonnegativity [155]. ST-UNet introduces a relational aggregation module (RAM) to integrate the SwinT block into the U-Net encoder hierarchically [198]. As shown in Fig. 12(b), it proposes a spatial interaction module (SIM) across window MHSA (W-MSA) and shifted window MHSA (SW-MSA) blocks to improve modeling capabilities. This module includes dilated convolution and global average pooling operations. A feature compression module (FCM) consisting of a soft pooling operation and a bottleneck block with dilated convolution is introduced to improve the segmentation accuracy of small-scale objects while preserving details. STrans-Fuse proposes a parallel two-branch structure of SwinT and CNN [2]. It designs an adaptive fusion module based on the self-attention mechanism to enhance spatial details selectively. d) Image change detection with transformers: This task is to identify surface changes from a pair of bitemporal RS images covering the same place. Some models concatenate the multitemporal feature maps into the transformer encoder to achieve spatiotemporal context modeling and then input the enhanced features into subsequent convolutional layers for generating the final prediction results [3], [207]. CDViT proposes a transformer block composed of two cascaded MHSAs to model the spatial and temporal context features [3]. BIT uses transformer to enhance the original features [16]. It proposes a Siamese semantic tokenizer to generate two token sets from the extracted bitemporal features. The cascaded token sets are fed to a ViT encoder and then sent to a Siamese transformer decoder after splitting.
For the multiscale features, transformer blocks are used to operate on different scale features [21], [110], [201]. MSCANet introduces a spatial attention module for token embedding and designs a transformer structure for each scale [21]. It also proposes a contextual aggregation connection to aggregate high-level decoding features into low-level features for fusing multiscale information.
The attention mechanism plays a vital role in the consistency of cross-temporal features [200], [202], [203], [204], [206]. For example, Bi-SRNet adopts the self-attention mechanism in both temporal and change branches [206]. Remarkably, a crosstemporal semantic reasoning block is proposed in the change branch, where attention maps are projected on its opposite temporal branches. DASNet adds a dual-attention mechanism to obtain distinguishable feature representations [200]. As shown in Fig. 13, it consists of a spatial attention module for modeling local contextual features and a channel attention module for long-range semantic dependencies. Besides, CBAM is used for obtaining more discriminative multiscale features [202], [204]. SRCDNet proposes a stacked attention module with multiple CBAMs to enhance adequate information in hierarchical features [204]. DARNet introduces a hybrid attention module to fuse bitemporal multiscale features [203]. It contains an efficient spatial-temporal attention module with cross attention to capture the long-range feature dependencies and a channel attention module in CBAM to model the channel contextual information. A residual connection is finally added to facilitate the error backpropagation. e) Other image processing fields with transformers: The attention mechanism, especially the channel attention module, plays an essential role in modeling key features [182], [191]. In the satellite image time-series classification task, CA-TCN adds a channel attention block to enhance the critical feature in  [208]. MS and PAN mean multispectral and panchromatic, respectively. the channel dimension and mine deeper phenological information [182].
3) Low-Level RS Transformers: In the image despeckling field, SAR-CAM introduces a continuous attention module, which consists of multiple concatenated residual channel attention blocks (RCABs) and CBAM with residual connections [209]. The RCAB adopts the channel attention module with residual connection to make the network focus on highfrequency channel features.
The convolutional features are used to implement transformer modeling [17], [130], [208]. For the multi-image super-resolution task, TR-MISR proposes a transformer-based fusion module to fuse low-resolution image features after the encoder [130]. The fused features are input into the decoder for obtaining high-resolution images. Not only that, some models design parallel transformer and CNN branches [17], [208]. In the super-resolution HSI restoration, Interactformer proposes an interactive attention unit through elementwise multiplication to adjust the information interaction of branches [17]. In addition, a separable self-attention module is designed in the transformer branch to achieve linear complexity calculation. It obtains attention weights at the width and height dimensions of features and, finally, acts on the input in turn. PAN-Tran designs a pan-sharpening transformer in the transformer branch to realize the fusion of panchromatic and multispectral image features [208]. As shown in Fig. 14, this branch contains a hard-attention and a soft-attention module to fuse the two kinds of image information.

1) Transformer Backbones:
Similar to the type of RS transformer backbones, the supervised-learning-based, selfsupervised-learning-based, and reinforcement-learning-based transformers are introduced.
a) Supervised learning transformers: We divide the supervised transformer backbone into the pure transformer and the convolutional transformer backbone for easy distinction.
Pure transformers: ViT, which only uses transformer encoder, requires a lot of training data and needs to be developed regarding feature and data diversity [210]. In the input token operations, PVT adopts a spatial reduction operation to reduce the spatial dimension of key-value pairs [24]. As shown in Fig. 5(b), it realizes the downsampling of input sequence, while PVTv2 replaces this with an average pooling operation [212]. MViT series introduces the pooling constraints [23], [150]. As shown in Fig. 5(c), it incorporates decomposed relative position embeddings and uses the residual connection to compensate for the pooling strides effect in attention computation [150]. LV-ViT adds local supervision on the output of each patch, which exploits the complementary information between the patch and class tokens [213].
Some transformers focus on designing attention mechanisms [214], [215], [216], [217], [218]. For the local attention mechanisms, Focal Transformer designs three window levels for each query by incorporating fine-grained local and coarsegrained global interactions [214]. ELSA proposes an enhanced local self-attention [215]. It has a Hadamard attention with Hadamard product to generate local attention efficiently and a ghost head inspired by GhostNet [219] to increase channel capacity. To capture long-distance information, BOAT proposes a bilateral local attention, which uses a feature-space local attention as a supplement to the image-space local attention [216]. To improve the patch feature expression ability in the local area, transformer nesting methods divide the patch into several subpatches in a nested way and pass through inner and outer transformer blocks in turn after flattening [217], [218].
Different from the standard transformer block in Fig.  15(a), some different attention mechanisms are stacked in the transformer block, which achieves two consecutive attention mechanisms in Fig. 15(b) [103], [156], [220]. SwinT proposes a W-MSA and an SW-MSA, which realizes cross-window connections as well as expands the receptive field [103], while Twins-SVT stacks the global subsampled attention and the locally-grouped self-attention, achieving an effective attention paradigm [220].
Convolutional transformers: The convolutional token embedding can be incorporated to capture the local information [140], [221]. CvT replaces the linear projection of input tensors with convolutional projection to reduce semantic ambiguity [221]. To further control the interest region of the transformer model, DAT proposes a deformable attention module that shifts key-value pairs to target regions by a query-independent offset network [25]. NAT controls the receptive field of each token within its neighborhood range by taking the position corresponding to the query as the center [139]. Besides, DWconv performs well in reducing data dimensions and maintaining network performance. For example, replacing the entire or part attention calculation [223], [227], designing the positional encoding [142], and expanding the receptive field in FFN [212], [230].
The attention mechanism can be replaced to achieve stable performance with less computational overhead [140], [142], [222], [224], [227], [246], [247]. CSWin performs the selfattention operations on horizontal and vertical stripes in parallel [140]. It adjusts the stripe width according to the network depth. VAN proposes a large kernel attention module, which captures long-range relationships through a decomposition diagram of large-kernel convolution operations [222]. CoaT designs a conv-attentional module, which adopts a co-scale mechanism to predict results using a series of serial and parallel blocks [142]. ACmix proposes a two-stage manner to integrate convolution and self-attention [224]. Moreover, the parallel strategy can be used to realize the fusion of convolution and attention [227], [246]. As shown in Fig. 16, Mixformer adopts the bidirectional interactions to enhance the model ability across branches simultaneously [227].
The high-resolution architecture could be integrated with visual transformers to enhance cross-resolution interactions [229], [230]. Besides, HRViT performs a heterogeneous branch to optimize key components of the model jointly [229]. The mix-block is designed to reduce the computational cost and achieve efficient networks. ViTAE stacks the reduction and normal cells to form different variant structures [228]. The reduction cell obtains multiscale context information through the pyramid reduction module. It uses MHSA and parallel convolutional modules to model long-range dependencies and local context. The normal cell has a similar structure to the former except for the pyramid reduction module. Conformer designs a dual-branch structure with CNN and transformer [226]. It fuses representations through a feature coupling unit module.
In the ResNet bottleneck block applying and improving, TRT-ViT follows the hierarchical route from the stage to block and forms hybrid architectures with the bottleneck in a standard transformer [225]. BoTNet designs a bottleneck transformer block, which replaces the convolution layer with MHSA [231]. It significantly improves performance by replacing the last three bottleneck blocks with the designed block. RepLKNet replaces the self-attention with a depthwise large convolution kernel, resulting in a larger effective receptive field [248]. b) Self-supervised learning transformers: Visualtransformer-based self-supervised learning frameworks have been proposed to learn features with more substantial generalization [26], [245]. SimMIM proposes a self-supervised learning framework based on masked image modeling to learn semantic information [26]. As shown in Fig. 17, it randomly masks some input patches and predicts the masked patch values by a transformer encoder and a lightweight one-layer prediction network. Swin UNETR transfers it to the medical image pretraining, achieving good experimental results after fine-tuning [232].
c) Reinforcement learning with transformers: Reinforcement learning is adopted to make the model learn attention decision for deciding the focused perceptual area before the proposed transformer [249], [250]. To make transformer suitable for the reinforcement learning optimization process, GTrXL designs a gating layer to replace the residual connection with incredible model stability [27], as shown in Fig. 18. On this basis, AT-RL adds an adaptive attention span to selectively focus on past time steps, improving the attention computational efficiency [233]. CoBERL combines GTrXL with long short-term memory (LSTM), and BERT [175] with contrastive objectives to learn a better representation [234]. As for offline reinforcement learning, the agent only learns from the limited data without environmental interaction. Transformer has shown great potential   [235]. (b) Long short-term transformer [240]. and mines optimal policies in data with its powerful sequence modeling ability [251], [252].
2) Video Transformers: Video tasks need to deal with temporal dynamics information. The current transformer-based models have been extensively explored with the development of pure video transformers [93], [101], [235], [236], [237], [238]. a) Video classification: Some models focus on improvements to the transformer block. ViViT migrates transformer from image to video tasks and proposes different structural paradigms [93]. It develops improvement strategies in feature embedding, spatiotemporal encoder, and self-attention. As shown in Fig. 19(a), TokShift-xfmr designs a token shift module [235]. It swaps the partial content of the current frame with neighboring time stamp for modeling the temporal relationship within the transformer encoder.
b) Video action recognition: The temporal attention mechanism is introduced to make the model learn dynamic scenes efficiently with increased memory consumption. X-ViT restricts temporal attention to a local temporal window for achieving space-time attention with linear complexity [238]. It exploits the depth of transformer to obtain full temporal coverage of video sequences. And different positional embeddings are designed for space and time tokens.
Some models employ divided temporal and spatial attention instead of self-attention to aggregate spatiotemporal information [94], [98], [236], [237]. VidTr proposes a topK pooling operation based on the standard deviation in the temporal attention [94]. It reduces the temporal dimension and eliminates the redundancy caused by the same content in multiple frames. Besides, Motionformer designs an approximation scheme to speed up the calculation [236]. TIME designs a self-supervised model to learn the video temporal dynamics, eliminating spurious correlations in the spatiotemporal dynamics [101].
c) Video restoration: A neural network is used for feature extraction, while transformer is for feature alignment and long-term dependence modeling [95], [124], [239]. As shown in Fig. 20, ET-Net proposes a token pyramid aggregation strategy with transformer to model the internal correlation and intersected correlation of tokens [124]. VRT designs a temporal mutual self-attention to achieve feature extraction and alignment [95]. The proposed attention connects multihead mutual attention and MHSA in parallel. In addition, the attention mechanism plays an essential role in temporal feature processing, effectively highlighting object edge features [239], [253]. d) Video object and instance segmentation: VisTR applies an encoder-decoder transformer to model feature similarity and instance feature prediction in the temporal order [96]. STM uses the attention mechanism to perform calculations between the image information of the current frame and the object masks of past frames [99]. In particular, as shown in Fig. 19(b), AOT proposes a long short-term transformer to model LSTM [240]. It designs an identification mechanism to achieve a unified object segmentation strategy, which embeds multiple-object masks into a feature space. e) Video reasoning: In future frame modeling tasks, learning the spatial relationship and object dynamics is vital [137]. AVT designs a ViT encoder for each video frame to anticipate future actions [97]. To reduce memory consumption, it proposes a causal transformer decoder using causal masking to focus on specific input parts. OCVT takes targets as the center based on unsupervised learning, which encodes the scene as tokens and uses transformer to learn the spatiotemporal dynamics between targets [137].
f) Video frame interpolation: It aims to synthesize intermediate frames in video frames for improving the frame rate. The CNN and transformer are combined to improve the attention in the transformer block, achieving the long-distance pixel correlation [98], [241]. VFIformer designs a cross-scale window-based attention mechanism to expand the receptive field and gather multiscale information [241].
3) Efficient Transformers: An efficient transformer with low latency and high parameter efficiency has always been crucial [228], [242], [243], [244]. It could run efficiently on resource-constrained hardware with improved representation by adjusting the loss function, training, or modeling techniques. It introduces the aspects of model design and knowledge distillation. a) Transformer model designs: Efficient self-attention is crucial for long sequence modeling. ResT adds a DWconv operation to MHSA for reducing the dimensions of key-value pairs [28]. As shown in Fig. 5(d), it adds a convolutional operation to the attention weight calculation for increasing the interactions among different heads.
The mix-block could reduce the computational cost and achieve efficient networks. SPViT proposes a weight-sharing scheme between MHSA and convolutional operations, which adopts a single-path search space to formulate the operation search as a subset selection problem [243]. Alternatively, some methods choose to stack blocks alternately [228], [242], [244]. CoAtNet alternately stacks the DWconv and self-attention to design the model cleverly [242]. Next-ViT effectively stacks the next convolution and transformer block by a next hybrid strategy [244].
b) Knowledge distillation: DeiT designs a transformerspecific distillation to improve ViT, distilling the teacher CNN backbone network into a transformer-based student model [29]. It interacts with a distillation token from the teacher model with the patch embedding. In this way, the student model can improve its training speed and quality. Based on DeiT, DINO introduces self-distillation with no labels, which combines the proxy task in BYOL [172] and ViT for self-supervised learning [245]. It learns the semantic segmentation representation of the input image efficiently.

IV. MOVING OBJECT LEARNING DETECTION IN RSVS
The primary purpose of MOD is to locate and identify continuously moving objects for a given video and then track these objects successfully [4], [35], [36], [116], [254]. This field is generally divided into motion based and appearance based. The former models background for realizing the foreground motion detection. The latter applies the artificial neural network to extract the motion and appearance information of objects. The categories of MOD have been classified in detail in Table III. The modeling of the traditional motion-based methods and the construction of the appearance methods are briefly summarized. It can be a clear understanding to MOD. The last part of this section introduces the attributes and characteristics of some popular datasets, as well as evaluation metrics. The intention is that our introduction should be helpful for readers to have a holistic understanding of MOD. In the following, we will introduce these two models separately to make readers more aware of MOD development.

A. Motion-Based Models
These frameworks mainly detect moving objects according to motion patterns, which adopt frame difference and background subtraction to separate the foreground and background [120], [121], [122]. Frame difference eliminates most of the unchanged background through computing pixelwise differences in intensities between consecutive frames and then extracts moving objects. Background subtraction is the mainstream method in traditional MOD, which achieves foreground detection with different background models.
1) Frame Difference: It has advantages with high efficiency and low memory consumption. The difference calculation between frames is a relatively simple operation to eliminate background information where d t is the tth video frame, and d t represents the absolute interframe difference between foreground and noise. The current frameworks mainly learn how to separate the foreground objects after frame difference calculation. The most direct method performs a threshold to separate moving objects and background, while different threshold selection affects the number of moving objects [120]. AMS-DAT designs a binarization threshold under the premise of object scale invariance [120]. Some frameworks use prior morphological information to remove background noise [35], [54], [55], [255]. For differentiating foreground from noises, PDT proposed a local noise model via fitting noise patterns with a probability distribution [55]. AMS-DAT uses the spatiotemporal continuity of object motion to eliminate false detections [120].
Three-and multiframe difference methods have been proposed to detect irregularly moving objects [35], [256]. VTD-FastICA takes three consecutive frames as input and uses an improved independent component analysis method, FastICA, to integrate image information in the space domain [256]. MMB models frames as perturbed low-rank matrices to detect slow-moving objects and uses a pipeline filter to draw the trajectory [35].
2) Background Subtraction: This traditional method mainly separates a video sequence into foreground and background, which labels the moving objects through the background model. It can be divided into the following steps.

1) Given a video sequence with n frames
where d t is the tth vectorized video frame, and s is the pixel values contained in each frame. Generally, the sequence is decomposed into three components, namely background matrix . . , r n ] ∈ R s×n , and noise matrix E = [e 1 , e 2 , . . . , e n ] ∈ R s×n . 2) The low rank is generally imposed on the background and sparsity is on the foreground. The optimization problem is defined as arg min Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. where λ 1 > 0 and λ 2 > 0 are the weights of the foreground term Ω(R) and the noise term E 2 F , respectively. Rank() expresses as a low-rank matrix factorization, and Ω() refers to the structured sparse induced norm of R, which is generally expressed as f ∈F R 1 / ∞ to promote the sparsity on the foreground [257]. · F represents the Frobenius norm. E is used to represent noise explicitly, so that the model can better reflect the fundamental data structure. The background is highly correlated in a lower dimensional subspace, while the foreground exists as the sparse outliers in the background. 3) To solve (13), the ADMM is used for optimization, which transforms a multioptimization-variable problem into a single-optimization-variable problem [254], [258], [259].
These models are mainly assigned to the statistical background and sparse background models, which will be introduced in the following subsection. a) Statistical background: It models the background by the adjacent pixel information, and its input features are pixel level and region level.
For the background construction, HMAO regards to background and foreground as peer unknown variables, which decomposes the background into temporally low-frequency and high-frequency components [36]. VSF-BST utilizes thermal pixel intensity and spatial video salient feature, named Akin-Based Local Whitening Boolean Pattern (ALWBP) feature descriptor [118]. It considers the effect of other neighboring pixels, discriminating foreground in flat cluttered regions. AV-BSM proposes a real-time adaptive vector-based background subtraction [260]. Each pixel is transformed into a vector with a spatial-temporal signal through a vector representation method. It uses the specified time interval scheme to initialize the background model.
For foreground detection, AV-BSM determines whether it is a foreground object by calculating the number of vector collinearity [260], while some methods employ the Markov random fields for improving the robustness [36], [118].
b) Sparse background: This method mainly decomposes the video sequence into a low-rank background and sparse foreground [257], [261]. The background is highly correlated in a lower dimensional subspace, while the foreground exists as the sparse outliers in the background.
Background modeling estimates the rank minimization of background based on principal component analysis (PCA). SLRC proposes a dedicated background model for multiscenario video sequences, which uses dictionary learning-based sparse coding to represent the background model for each scene [262]. MODSM imposes the saliency map on the background, enabling the estimated foreground with high-level semantic objects and fewer false alarms [263], while foreground modeling generally emphasizes the smooth constraint of foreground boundary to reduce noise influence. Since moving objects are a collection of spatially correlated pixels, structured sparse is mostly adopted instead of pixel sparse. SLRC adds contextual regularization and sparse representation into the foreground model [262]. KRMARO integrates kinematic regularization into the principal component pursuit of the foreground, which uses the Euclidean distance and motion angle to model the motion of the candidate region [121]. 3DTV-RPCA presents 3-D total variation regularization to achieve the continuity of moving objects [53].
For the optimization problem of the objective function, ILR-SUSD proposes an inexact alternating direction method based on augmented Lagrange multiplier and proximal operators to solve the optimization problem [115]. E-LSD provides the direct expansion of the ADMM to solve video with poor spatial resolution and low contrast [254]. SLRC develops a three-stage alternating optimization method consisting of the SOFT-IMPUTE method, PALM, and 2-D FFT [262]. MCMD develops a batch optimization method with ADMM and an online stochastic optimization method [258]. KRMARO integrates a backtracking behavior into an inexact augmented Lagrange multiplier, which obtains the moving objects only when the frames are optimally aligned [121]. To eliminate the satellite motion influence and reduce false alarm rates in video frames, MCMD proposes a moving confidence score through the dense optical flow estimation to emphasize the difference between real object motion and satellite movement [258]. 3DTV-RPCA introduces an auxiliary variable to model noisy data for reducing noise impact [53].
Unlike the above robust PCA-based methods, O-LSD is an online structured sparse model combining the stochastic optimization and the structured sparse penalty to improve update estimation [264]. STOMF proposes a temporal difference motion prior model to obtain the motion information matrix and weight matrix for extracting the entire motion regions [122]. Besides, a postprocessing method is presented to detect normal-scale and small-scale moving objects using partial spatial information reconfirmation and partial spatial background information reuse methods.
Tensor, a higher dimensional data structure than 2-D matrix, is more appropriate for capturing higher order relationships in data. WSNM-STTN decomposes the video frames into tensor form based on E-LSD [254] and applies a weighted Schatten p-norm to the background for providing an adaptive threshold [259]. TLISD proposes a tensor low-rank and invariant sparse decomposition method for background [119]. Based on the tensor PCA, 3D-PSCATV-CS provides an automatic weight assignment to the singular value tubes of the background tensor [265].
In the foreground constraint, 3D-PSCATV-CS adopts a 3-D Piecewise Smoothness Constraint combination based on Anisotropic Total Variation (3D-PSCATV) for the foreground to encode the spatiotemporal smoothness and temporal coherence [265]. TLISD models the illumination changes as noise variables via the k-support norm and generates a set of illumination-invariant representations as prior maps to distinguish moving foregrounds from illumination changes [119]. TF-TTV proposes a dynamic half thresholding low-rank tensor total variation (DHLRTTV) and a static half thresholding low-rank tensor total variation (SHLRTTV) algorithm according to dynamic and static background influence, respectively [37]. DHLRTTV divides the foreground into the dynamic background and the exact foreground. It adopts the 1/2 -norm regularization for diminishing dynamic background effect and the tensor total variation regularization for the foreground smooth. SHLRTTV, compared with DHLRTTV, ignores the dynamic background component. The augmented Lagrange multiplier with an alternating direction minimizing approach is finally proposed to solve the optimization problem.

B. Appearance-Based Models
The traditional motion-based models require consistent global illumination and rely on video registration [55], [56]. In addition, they are sensitive to irregular motions and texture changes in the physical world. On the other hand, several appearancebased deep learning MOD frameworks have emerged [4], [58], [116], [123]. They are divided into the following four categories, namely image-object-detection-based, RNN-based, visual-tracking-based, and optical-flow-based models, respectively.
1) Image-Object-Detection-Based Methods: The object detector or semantic segmentation method can be employed to MOD directly [266]. ML-SAR uses Faster-RCNN to detect object shadows in video SAR frames [57]. As shown in Fig.  21(a), WS-MOD trains the detector with the foreground masks, which generated as the binary pseudo labels by background subtraction and threshold segmentation method [268]. To eliminate the obvious false positives, LRP adopts a discrete histogram mixture model through a recursive learning algorithm to measure the object category possibility [266]. ML-SAR employs an improved density-based clustering method in consecutive frames to correlate object shadows with solid correlation [57]. For missing alarms, it presents Bi-LSTM to predict the lost locations based on the detection of contextual information.
To aggregate more detailed features, DeepFoveaNet proposes two encoder-decoder network modules inspired by the monocular vision of birds [38]. It contains a Peripheral-CNN for detecting contextual information in the scene and a Deepfovea-CNN for small moving foregrounds to simulate visual attention. DSFNet proposes a 2-D static stream with a feature fusion block to obtain the object details [116]. In object motion cues extraction, it presents a lightweight 3-D dynamic stream with three 3-D convolutional layers. The overall flow is shown in Fig. 21(b), where FFB represents the feature fusion module. 2D conv and 3D conv represent the 2-D and 3-D convolution blocks, respectively. These two stream features perform fusion through a progressive hierarchical feature fusion manner. ClusterNet combines the motion and appearance information through a convolutional network and obtains object locations with heatmap estimation [56].
2) RNN-Based Method: As shown in Fig. 22, ETE-MOD uses a deep convolutional encoder and decoder network to extract the semantic information of video frames [58]. It proposes an attention convLSTM to enhance the semantic features, which adds a soft attention mechanism after convLSTM. In addition, it adopts a spatial transformer network for enhancing the robustness of global and local motion, as well as a conditional random field layer for smoothing foreground boundaries.
3) Visual-Tracking-Based Methods: Trackers need accurate and robust object features to achieve correct results. Thus, the feature extraction in the tracker-based method is critical. Dogfight uses pixelwise and channelwise attention to distinguish object boundaries from the background [59]. As illustrated in Fig.  23(a), the pooling and attention block contains a spatial pyramid pooling and an attention module with pixelwise and channelwise. The channelwise attention is implemented by channelwise multiplication of the attention vector with convolutional feature maps. In comparison, the pixelwise attention performs pixelwise multiplication of pixel attention mask to give functional regions with high weights. To generate high-quality object proposals, UDOLO proposes an object occupancy map in Fig. 23(b), which is served as a selective attention mechanism, guiding the detector to focus on essential parts [39].
For obtaining object candidate regions, JP-DP-TBD, which is based on the dynamic-programming-based track-before-detect (DP-TBD) algorithm, uses both object position and radial velocity information in video SAR image and corresponding range-Doppler spectrum [4]. ES-TBD adopts an expanding and shrinking strategy, combining the particle filter and dynamic programming algorithms to obtain effective transition states for object position components [269]. It presents a region-partitioningbased track-before-detect algorithm to maintain known object trajectories and detect newborn objects.

4) Optical-Flow-Based Methods:
This traditional method mainly calculates object velocity between frames in MOD. To  gain object candidate positions from adjacent frames, OF-VAS uses optical flow to obtain the object motion information and generates candidate objects by Otsu segmentation method [123]. The flowchart is shown in Fig. 24, and the Gabor filter combines the obtained results to receive a quaternion image. The final detections are achieved by the quaternion Fourier transform and phase spectrum reconstruction.
Optical flow can be used under unsupervised learning for MOD. ACM-MOD proposes an unsupervised adversarial contextual model consisting of a generator and an inpainter [270]. The generator produces an object mask through the image and its optical flow, while the inpainter attempts to inpaint back the optical flow, which is masked out by the generator. It is trained jointly in an adversarial manner to learn the complex relationship between foreground and background.

C. Datasets and Evaluation Metrics
1) Datasets: Some public RS datasets for moving object learning detection are listed in Table IV. The PESMOD dataset comes from small object drone videos on the Pexels website [271]. Its targets include vehicles and pedestrians, which challenge is the occlusion in complex environments. The Chang Guang Satellite Technology Company Ltd. (CGSTL) provides many free RSVs for scientific research. The Valencia dataset from the CGSTL is widely used in the MOD experimental verification [55], [120]. It covers different city-scale information with small moving objects. In addition, the MOD task of the VISO dataset includes a training set with 13 470 images, a validation set with 535 images, and a test set with 3725 images [35]. Its main challenges include complex backgrounds, illumination changes, and dense lanes.
2) Evaluation Metrics: There are multiple evaluation indicators for MOD, including precision, recall rate, F 1 score, precision-recall (PR) curve, average precision (AP), and mAP. They are defined as follows: a) Precision: It is expressed as the proportion of true positive (TP) in overall detections, which included TP and false positive (FP), namely Precision = TP TP + FP (14) where TP is the truly detected boxes for the correct coverage and FP represents the false detected boxes. Due to the small target pixels in RSV, TP also defines as the detection overlaps with the ground truth box [35]. b) Recall rate: It refers to the ratio between TP and all ground truth boxes, in which false negative (FN) represents the ground truth boxes missed by the detector, i.e., c) F 1 score: It is the harmonic mean of precision and recall rate, which is a traditional criterion for binary classification between interest objects and nonobjects, expressed as Here, the IoU value in the ith frame among a given video sequence is defined as where box i α and box i β represent the i α th ground truth and the i β th detected box area in the ith frame, respectively.
. i N and i M represent the number of detections and ground truths in the ith frame, respectively. and are the intersection and union of the two regions, respectively. |.| is the number of pixels occupied by the region. The cost tensor (CT) is composed of CMs with continuous K frames to reduce the computational burden, expressed as The final correlation index is obtained through the optimal associations between detections and ground truths, which uses the Hungarian algorithm in the spatiotemporal domain.

V. OBJECT TRANSFORMER TRACKING IN RSVS
RSV object tracking plays an indispensable role [82], [86], [90], [112], [125], [127]. It can provide a cost-effective processing method for motion analysis and object monitoring, especially for requiring on-site measurement with installation difficulty. The current research status of SOT and MOT will be discussed in this section.

A. Single-Object Tracking
It is mainly divided into two directions. One is the traditional CF tracker based on estimation, and the other is the deep-learning-based tracker. CF tracker fuses the hand-crafted or convolutional features of the tracked object based on the pure CF and estimates the object location through the Bayesian method. SOT is mainly divided into four steps.
1) Build a tracker model. For a video sequence, the object in the first frame is sent to the tracker for subsequent tracking. 2) Extract object candidate region at subsequent frame. The candidate region is obtained by an inference-based filter or a DNN-based model. 3) Achieve the accurate object position and mark it with a rectangular box. The position is inferred through the object information saved in the tracker. 4) Update the tracker. The object information at the current frame is sent into the tracker. If the sequence is terminated, the loop ends. Otherwise, continue to step 2. These models are classified in detail in Table V. It marks the key characteristics and the baseline model corresponding to the tracker. It also keeps the usage of transformer/attention in each tracker from template extraction, search region extraction, and correlation calculation. To understand each tracker more intuitively, the solved challenges of the tracker are recorded, which mainly include occlusion, similar objects, and complex scenes. Not only that, the commonly used RS tracking datasets and evaluation metrics are introduced later to enable completing a more comprehensive understanding.

1) CF-Based Trackers:
The CF tracker trains with positive and negative samples based on the object bounding box at first frame [5], [61], [62]. Its weights are updated in subsequent frames for preventing temporal degradation and increasing the tracker discriminative capability. The general structure is shown in Fig. 25, which the feature extraction part includes manual, DNN, or both of them. The categories of CF trackers are mainly divided into basic and deep-learning-based CF trackers. a) Basic CFs: CF is fast and real time without additional training, which is suitable for large-scale RSV object tracking. Some elements can be added to make results accurate before feeding input to the filter. SCT divides the tracking object into multiple cognitive units and sends these units into the attentional weight map calculation module for getting final input [60]. CFME combines KCF [272] with the motion estimation method for determining the object position and mitigating the boundary effects [62].
Optical flow, as an important tool for detecting object motion, plays an important role in SOT. MOFT obtains object position with the Lucas-Kanade optical flow method [273]. HKCF adopts the optical flow for detecting motion information, and the histogram of oriented gradient (HOG) for capturing object texture information [61]. For hyperspectral video, MHT decomposes the data into constitute spectral and corresponding abundances and then embeds them into CF [274]. For SAR video, JKCF uses the cell-averaged CFAR to extract the object shadow in the image and the energy in the corresponding range-Doppler spectra [5]. Besides, it adopts interframe correlation with trajectory matching to suppress false tracks. The final shadow and energy bounding boxes are both sent to the dual KCF.
For the object feature extraction designing, PAC proposes spatial and appearance selective attentions [40]. The former, which generates an object location response map through the weighted Boolean maps, is used to capture the object topological structure. The appearance selective attention pushes the distractors around the object to negative samples. During CF weight updating procession, WTIC employs the information compensation, which introduces the background information into CF to distinguish the tracking object from the corresponding background [275]. In addition, JKCF proposes a normalized interaction factor to update the learning rate [5]. STSD adds the spatial-temporal information constraints to the objective function, which makes the filter update conservatively when the appearance changes drastically [276].
The CF tracker can combine with other models in different ways to improve performance [47], [125], [277]. CFKF proposes a tracking confidence module to couple the CF tracker and the Kalman filter [277]. It evaluates the CF confidence through the average peak-to-correlation energy algorithm and passes the result to Kalman filter for trajectory correction. Du et al. [47] parallel KCF with three-frame difference for preventing the drift offset and obtain results by calculating the attraction value. MBLT proposes a motion estimation to predict the object position probability and a road segmentation method to constrain the object moving area [125]. These two results are finally masked to the CF result for generating the final bounding box.
Model postprocessing is particularly significant at performance improvement [5], [62], [275], [276], [278]. For solving the occlusion problem, CFME uses the filter response patch peak value to determine whether the object is occluded or occlusion ends [62]. If occluded, the motion estimation result is used as the object position. IMMCF considers the maximum response score and the average peak correlation energy [278]. If occluded, the interacting multiple model is used to predict the object position. To prevent the track drift, WTIC proposes tracking status monitoring indicators to evaluate tracking status [275]. JKCF presents a target localization interactive correction with the peak-to-sidelobe ratio (PSR) to prevent tracking drift and reinitializes the tracker while crashing unexpectedly [5]. STSD employs a multiscale patch-based contrast measure scheme to correct target position, preventing the shadow targets affected by clutter [276].
b) Deep-learning-based filters: Convolutional features are added to the tracker model for enriching the feature diversity. To emphasize the importance of different channel features, CGRCF proposes a channel attention module with the channel and graph regularization methods [279]. Likewise, A 3 DCF advocates an adaptive attribute-aware spatial attention mechanism with channel-specific regularization [73]. It identifies each channel discriminative information and mitigates the irrelevant information influence. For suppressing the distractors influence, JMMAC designs a multimodal fusion network with global and local networks, obtaining accurate response maps [70].
Cascading CF and DNN can achieve the robust tracking [68], [280]. ACFN adds a subset of CF trackers and designs an attention network composed of prediction and selection subnetworks, realizing the selection of trackers adaptively [280]. MMNet proposes a fine-grained perception module before CF [68]. It  performs a self-attention mechanism on the shallow features to obtain more fine-grained correlation information.
2) Deep-Learning-Based Trackers: The deep learning trackers generally migrate pretraining classification models to the tracker and fine-tune the model weights at the tracking data to achieve effective object tracking [75], [127], [281], [282]. These trackers are divided into three major categories, namely CNN-based, RNN-based, and Siamese-based trackers. a) CNN-based models: As single-branch trackers, they mainly use MDNet [283] as a baseline and train a feature extractor as well as a video-specific classifier at the first frame for subsequent tracking. The general tracking process is shown in Fig. 26; the light dashed line indicates the model backpropagation.
To increase the object representation ability, TTS introduces a spatial mechanism, which applies max and average pooling operations to the original convolution features, making the tracker pay more attention to the object [284]. RT-MDNet+LV adds an attention regularization term to suppress the background and highlight the target region [72]. The regularization defines the weighted local variances of the convolution feature. TCTrack [281] designs an adaptive temporal transformer for refine the feature map. As shown in Fig. 27, the subscript t of Feature encode t represents the tth video frame. It uses the temporal information to enhance the spatial features. CRAM combines the appearance and optical flow motion features [285]. The final location prediction integrates these two response maps from the same separate regression network. CAT introduces a center and a corner regression module [74]. Besides, it proposes a lightweight  attention module in the corner regression. The weighted features across this manner could pay more attention to the regions where benefited the corner regression. DACapT introduces the capsule network into the feature extraction to model feature similarity [286]. It adopts a group attention mechanism for the model pay attention to the object and a penalty attention module for providing discriminative attributes.
For different modality inputs, M 5 L integrates an attention fusion module that concatenates weighted modalities to obtain the final fused feature [69]. CBPNet designs a channel attention mechanism to make the model focus on significant regions [64]. As for the occlusion challenge in satellite videos, AD-OHNet uses the spatiotemporal context to calculate the object average moving direction and distance [287]. Besides, it adopts a deep reinforcement learning to make the tracker proceed along the original direction. And the object appearance model continues training with the previous positive and negative samples. b) RNN-based models: They employ the gating mechanism in LSTM to compute the information flow at the current time step and utilize different attention mechanisms for feature enhancement [75], [76]. HART, which imitates the human visual cortex structure, proposes a cascaded form of spatial and appearance attention before the features feeding into LSTM [75]. The appearance attention is paralleled by a ventral and a dorsal steam. The final input features are obtained by the Hadamard product across these two feature results. ARNN jointly trains with a bidirectional LSTM [76]. As shown in Fig. 28, the intraand interattention mechanism is formed with an interattention and an intraattention model, augmenting the object patch-level features.
c) Siamese-based models: These dual-branch architectures generally determine the current object response position by calculating the similarity between the template region feature in the first frame and the search area feature in the current frame [308]. The general process is shown in Fig. 29. Besides, it is worth noting that DualTFR achieves effective tracking with a pure transformer backbone network [309]. During the video frame preprocessing stage, DeepMAT proposes a dynamic target-aware attention module to obtain an accurate global search area [305]. CFD-SiamRPN++ integrates the clustering-based frame differencing method in the input blocks to enhance the discriminability of small objects [322]. It fuses the original block with a fine difference map generated by k-means clustering. In hyperspectral video processing, BRRF-Net proposes a band regrouping module, which divides HSI patches into groups of RGB-like image patches [48]. It quantifies each band by capturing nonlinear correlations between bands and then reorganizes them according to the importance degree. Similarly, SiamMRANN divides HSI patches into several threeband image patches and inputs them into the Siamese network in parallel [282]. H 3 Net divides RGB and hyperspectral video data into spatial and spectral branches and then concatenates the spatial and spectral features into the Siamese tracker [112]. It adopts an unsupervised learning framework to train these two data sequentially using the principle of cycle consistency.
The channel and/or spatial attention modules with different connection modes can be added in the feature extraction to enhance the tracker adaptability, such as cascade and parallel modes [303], [304], [320]. It achieves the sensitivity of the tracker to object discriminant features [127], [298]. CGACD designs a twofold correlation-guided attention module to obtain enhanced features [298]. It is based on channel and spatial attention mechanisms, which acts on search regions and template features, respectively. SiamMRANN proposes a multilevel residual attention module to focus on spatial and spectral aspects of local objects [282]. The loss function incorporates the tracking results of multilevel features to accurate object regression prediction. AiATrack introduces an attention-in-attention module after the dot product operation of the attention mechanism [312]. The proposed module is shown in Fig. 30, which can be used in self-attention or cross-attention blocks to suppress noise.
Transformer encoder-decoder can be used to aggregate template and search area features [42], [307]. As shown in Fig. 31, TrDiMP adopts the transformer architecture to achieve the enhancement of the object cues, where Mask represents the template feature mask [42]. In addition, pyramid features have great advantages in the model feature enhancement [310], [314], [316]. The multiscale features can be sent to the pooling attention mechanism, which is similar to Fig. 5(c) [314]. SiamTPN designs a transformer pyramid network block [316]. It uses the lateral cross-attention approach for cross-scale feature fusion.
Similarity calculation can be replaced by cross-attention operations [41], [301], [314], [321]. Among them, TransT contains a cross-feature augment module composed of multihead cross attention [41]. TT-ATOM designs a cascaded pixel-level cross attention and channel-level cross attention to realize interactive modeling across channels [314]. Different from the above pixellevel calculation, CSWinTT flattens the template and search area features into a window sequence [126]. It proposes a multiscale cyclic shifting window to generate a large number of samples, realizing window-level attention. SwinTrack designs a visionmotion integrated transformer, which fuses a motion token into the decoder to embed tracklet [78]. MixFormer adopts the multiple stacked asymmetric mixed attention modules with patch embedding, realizing the integration of feature extraction and correlation [311]. Transformer can input with other features after correlation calculation, such as saliency features or hierarchical features [310], [317], [318]. It enhances the ability of capturing global context information. The Gaussian mixture model (GMM) can obtain the object mask result to improve the tracking performance and prevent tracker drift [49], [66]. The result then is fused with the output from Siamese tracker to predict the object position, which makes full use of the tracking and detection capabilities. SRN-TFM presents a deep motion regression network formed with optical flow, which is a crucial complement of Siamese tracker [50]. In addition, an adaptive fusion strategy based on the PSR is adopted to combine the deep motion network with the tracker. And a trajectory fitting motion model is proposed to fit the object motion pattern for alleviating tracking drifts.
3) Datasets and Evaluation Metrics: RSV datasets provide a very important reference value in SOT development, further promoting the model research.
a) Datasets: The UAV123 dataset incorporates long-term aerial tracking sequences and protrudes camera viewpoint with bounding box aspect ratio changing [323]. Besides, the total number of sequences exceeds 110K frames. Some of the DTB70 video sequence are recorded from DJI Phantom 2 Vision+ drone on college campuses, others from YouTube [111]. It improves the variety of object appearance and scenes. The tracked  [114]. It marks 840K bounding boxes. And the sequence length ranges from 83 to 2970 frames. In particular, only 50 sequences are used for SOT testing. As a hyperspectral video dataset, the WHU-Hi-H 3 dataset provides additional spectral information among the band range from 600 to 900 nm with 25 bands [112]. It designs nine scenes, which divided into 69 video sequences. The tracked objects include cars, rigid objects, people, and shadows.
The VISO dataset, which is taken by Jilin-1, contains some different traffic situations in real-world scenarios [35]. The tracked objects include airplanes, cars, ships, and trains. Twentyseven video sequences are used for SOT, which contains 3159 trajectories with a total of 1120K frames. The SatSOT dataset uses the data collected by Jilin-1, Skybox, and Carbonite-2 and contains 27 664 frames [113]. To reflect more complex background information, it does not set a uniform resolution. Besides, the number of video frames ranges from 120 to 750 frames. Objects include ships, cars, planes, and trains, whose sizes range from 21 to 780 605 pixels. The SV248S dataset utilizes six open-source satellite video datasets provided by CGSTL [7]. It constructs 248 video sequences. Each dataset selects approximately 40 tracked objects including ships, motor vehicles, and aircraft.
These datasets contain a rich set of abundant appearance and challenging attributes. All the video sequences are accurately labeled with tracking targets for tracker evaluations. The detailed information of these datasets is listed in Table VI. b) Evaluation metrics: Most trackers adopt the one-pass evaluation, that is, initializing the ground truth position of the first frame in a video sequence and reporting the average accuracy/success score [61], [127], [280], [286]. They follow the evaluation methodology of OTB across calculating the success and accuracy scores without any parameters [324], [325]. The specific indicators are as follows.
Precision plot: Given the center positions (β G 1 , β G 2 ) and (β T 1 , β T 2 ) of the ground truth and tracked boxes across each frame in the video sequence, we define the Euclidean distance between these two as the center location error (CLE) We calculate the percentage of frames, in which CLE is less than a given threshold for a specified video sequence. Then, an accuracy curve is drawn through different thresholds with corresponding frame percentages. The proportion of the area under curve (AUC) is the total accuracy score of the tracker. Success plot: Given the ground truth area box 1 and the tracked area box 2 of each frame in the video sequence, the IoU value IoU 1,2 can be calculated by (18). The enhanced IoU (EIoU) considers the location error and IoU comprehensively Here, δ 1 and δ 2 are the nonnegative weight coefficients; it is stipulated that δ 1 + δ 2 ≤ 1. NE represents the normalized Euclidean distance of the center positions between the ground truth and tracked boxes [7]. The percentage of frames in sequence is calculated through IoU/EIoU is less than a given threshold. The success curve drawing is the same as the accuracy curve. The total success score of the tracker can be obtained via the proportion of the AUC. The precision and success evaluation metrics show different types of tracking accuracy at all thresholds.
Enhanced normalized union score (ENUS): It is a highly compatible and accurate evaluation method that can evaluate different types of tracker boxes, such as tight polygon boxes, which is specifically written as, where σ 1 and σ 2 are the nonnegative weight coefficients, which satisfy the condition of σ 1 + σ 2 ≤ 1. U = max(1 − | Precision Precision 0 − 1| γ ) presents the product of Recall and Precision.
Precision 0 is determined according to the type of tracker box, and γ is a regularization factor [7]. c) Performance evaluation: The precision and success score comparisons of SOT methods on available RSV datasets are listed in Tables VII and VIII. DeepMAT [305] adopts

B. Multiple-Object Tracking
MOT methods associate the same objects across frames in a given sequence to generate the optimal motion trajectories with object identity [82], [86], [90]. The categories of MOT methods are listed in Table IX. It explains the key characteristics from detection hypotheses and detection-tracklet association. Same as SOT, the end of this subsection introduces some common tracking datasets and evaluation metrics to make the research complete.
1) Two-Stage Structures: Followed by the tracking-bydetection paradigm, these traditional methods are cast MOT as data association problems, in which detection hypotheses are associated into object trajectories [326]. The main steps are divided into two steps. 1) Preprocessing: Objects in a video sequence are detected by a pretrained image detector or background subtraction. It comprehensively describes objects using discriminative features, such as textures and structural features. 2) Multiframe data association: Target trajectories are assigned through the data association between all the targets in all frames. MOT is treated as a multiframe multiobject association problem. According to whether future frame information is required to process the current frame, these two-branch structures are  divided into two branches, i.e., online and offline methods. The online method only uses the current frame and past frames to estimate the current object states, while the offline method uses the future frames and past frames as input to estimate object trajectories.
a) Online methods: It matches the current frame detections with the previous tracklets until the end of the video sequence. The overall process is shown in Fig. 32. We divide these methods into motion-based, appearance-based, and objectinteraction-based methods.
Motion-based models: In the object detection stage, GMPHD-SAR adopts the morphological operations and border tracking to extract object candidates from clutter-suppressed SAR video frames [327]. As shown in Fig. 33, SFMFMOT proposes an improved NMS module to combine FairMOT [89] with a slow-feature-based bounding box proposal extraction module for extracting object bounding boxes [82].
During the data association phase, the Kalman filter or other motion models are used to learn the trajectory features of different detections/pixels [82], [83]. They distinguish the moving object trajectories and fill in the missing detection parts. The prior information can be used to achieve tracking. GMPHD-SAR adopts the Gaussian mixture probability hypothesis density (GMPHD) filter for tracking under the assumption in which each target follows a linear Gaussian dynamic model [327]. With the shadow characteristic of moving targets and road information in SAR video frames, SDT-SAR adopts the pretrained CNN and filters to complete tracking [328]. Structural constraint event aggregation (SCEA) exploits the structural constraints to achieve data association [83]. It proposes an SCEA method, which fuses data association costs along with the assigned events, to estimate the optimal assignment between well-tracked objects and detections. Besides, a structural constraint object recovery (SCOR) method is presented to recover the missing objects between frames through the updated well-tracked objects and structural constraints.
Appearance-based models: These models adopt the trackingby-detection paradigm and focus on the object appearance feature extraction. In the video frame preprocessing stage, ER-MOT proposes an adaptive resolution optimization (ARO) method to  [128]. It scales the image adaptively by applying the linear relationship between the gray value distribution (GVD) and the image size.
As for capturing the discriminative features between similar detections, ER-MOT adopts HOG, local binary patterns, and RGB histogram features of the detections [128]. TC-MOT proposes a Siamese-based appearance model [79]. The overall tracking process is shown in Fig. 34; HC and LC mean high confidence and low confidence, respectively. The tracker combines the online transfer learning (OTL) to fine-tune the model parameters, making it suitable for specific tracking sequences. HMAR proposes a human mesh and appearance restoration Fig. 34. Appearance-based online method [79]. method to extract 3-D appearance, pose, and location information of detections [329]. Transformer is then presented to propagate the spatiotemporal information for learning associations across frames. IQHAT designs a target identification module to obtain the identity assignment probabilities of detections, and a local target quantification (LTQ) module to obtain the density map [43]. An identity-quantity harmony (IQH) module is proposed to jointly optimize the two modules.
In the trajectory inference stage, the Hungarian algorithm and the Kalman filter can be employed to generate the final trajectories [43], [329], [330]. ER-MOT adopts the greedy bipartite graph technique to correlate the previous tracklets with the current detections [128]. It proposes a trajectory reliability assessment metric to eliminate incorrect samples, which mainly contains the affinity between tracklets and detections. TC-MOT proposes a confidence-based data association method, which defines a tracklet confidence [79]. The tracklets with high confidence are associated locally with the current frame detections through the Hungarian algorithm, while the low confidence tracklets are associated globally with detections or other tracklets later.
The instance segmentation method can be adopted for small proportion of object extraction, which is a conventional measure in RSV [330], [331], [332]. It enhances the appearance representation of detections and brings a reference to RSV tracking. ODTS constructs a foreground GMM and a universal background GMM for each object to compute corresponding confidence maps [331]. It adopts Lagrangian dual decomposition to combine the structured tracker with video segmentation method. Inspired by PointNet [333], PointTrack series set each instance as a 2-D point cloud and other region as environment point cloud [330], [332]. The random sampled point cloud data combine with multiple data patterns composed of offset, original RGB color, and categories. Moreover, a point weighting layer is introduced into the foreground for summarizing the instance features. The final instance features are obtained with the foreground, environment, and position embeddings. PointTrackV2 adds the focal loss to the instance segmentation for settling the pixel-level class imbalance problem [332].
Object interaction-based models: To learn the object feature and the relative position information between objects, the interaction models use the interaction characteristics between the tracked object and its adjacent objects, which combines the object motion and appearance information to achieve better trajectory predictions [80], [81], [334]. In object appearance and motion model designing stage, IMM-MPT computes a 4-D color histogram to detections in the color space for incorporating the spatial information into the appearance model [81]. The processing flow is shown in Fig. 35(a); PCHC means pedestrian color histogram computation. Besides, it proposes an IMM formed with the Kalman filter in Fig. 35(b), including the stationary model, the constant velocity model, and the constant acceleration/deceleration model. This tracker represents the data association as a weighted bipartite graph problem and uses the Munkres' algorithm to give the best assignments.
The tracking process could be described as an optimization problem [6], [334]. BQP-MOT proposes a binary quadratic programming to find each object position in the current frame, mainly constrained by object individual information and context cues [334]. It presents a modified Frank-Wolfe algorithm with SWAP steps for speeding up the optimization to directly solve the objective function. JMDT-EM employs the gating technique to eliminate infeasible association hypotheses for the data association module [6]. Based on the expectation maximization iterative optimization method, the tracker optimizes the optimization problems with alternately calculating the complete likelihood function and the tracking states. Particularly, MLMRF models the data association as a reidentification (Re-ID) problem [80]. It combines LSTM with the local maximal occurrence Re-ID model [335] to build an appearance model and uses the Kalman filter to model motion prediction. Besides, a label cost term is adopted to reidentify the detections as existing objects and a fast α-extension algorithm to solve the model optimization problem.
b) Offline methods: The overall process is shown in Fig.  36. It obtains the detections of the entire video sequence and then gains the final trajectories through performing global data association. The Kalman filter is always used to achieve global correlation [88], [117], [336]. The approximate solution has been proposed in the global associative optimization model to achieve an effective balance between memory and performance [87], [129], [337], [338]. The current offline models are mainly divided into graph-based, network-flow-based, and iterative-approximation-based methods.
Graph-based models: They regard each detection as a node and the relationship between the detections across frame as the edge weight on the graph structure. The data association graph is then constructed by edges with high similarity [84], [85], [339]. IT-MOT exploits the interaction between nonassociable tracklets to improve tracker performance [84]. The objective function is defined as a unary and pairwise term. The unary term measures the affinity between associable tracklets by integrating appearance, motion, and temporal consistency. While the pairwise term proposes close interaction (CI) and distant interaction (DI) term. The quadratic pseudo-Boolean optimization (QPBO) is then used to approximate the optimal solution. GMI-MOT regards object localization as a Markov inference problem via a graphical model, which designs the appearance and motion models as node potentials [339]. Besides, the edge potential is used to smooth the distance and angle of objects connected with the same edge. CCC regards MOT as a correlation co-clustering problem [85]. It combines the top-down MOT with the bottom-up motion segmentation and defines them in graph structure. The tracker centers on the high-level concept of semantic objects and treats the combination of bounding boxes with the same object as a correlation clustering problem. The motion segmentation centers on the low-level concept of grouping pixels and treats the grouping of point trajectories as a correlation clustering problem in terms of pairwise potentials.
Network-flow-based models: The data association optimization is treated as a multidimensional assignment problem, that is, a one-to-one data mapping should be found between multiple sets [129]. Under exploiting pairwise similarity, they use linear programming, minimum energy functions, or greedy algorithms to solve data association problems [44], [337]. In the object detection design stage, JTA combines target shadow and echo energy information [86]. A cell-averaged CFAR and a modified OS-CFAR are proposed to detect target shadows in imagery and energy information in the range-Doppler spectrum domain, respectively. As shown in Fig. 37, TBC creates the counting constraints by a spatiotemporal sliding window on the density map for object detection [337]. It integrates object appearance and motion information at the flow constraints to incorporate  video context information. Besides, it designs a mixed-integer linear programming (MILP) problem, combining the objectcount constraint with flow constraints.
In the data association stage, JTA estimates the object state vector with different data association methods for different mode trajectories [86]. It also introduces the M/N-logic-based method to associate the two modules' information. HDA divides the data association into detection association and tracklet association [44]. It estimates the detection affinity by employing the object pose and appearance features in the detection associations. A Siamese tracklet affinity network (STAN) is proposed with the tracklet affinity to generate the final trajectory. It models the long-term object action dependence by LSTM and introduces a coherency-aware Siamese predictor to bidirectionally generate the unseen trajectory states for two tracklets.
Iterative-approximation-based models: Iteratively approximating the interframe assignment is adopted to solve the global optimal solution, which correlates across the video sequence to construct trajectories. DCM-MOT generates low-level tracklets from detections through KLT [326]. As shown in Fig. 38, CNL and ML in dynamic clustering block means cannot link and must link constraints, respectively. The tracker adopts the Dirichlet process mixture model (DPMM) [347] to dynamically cluster tracklets and proposes two appearance representation models for rigid and nonrigid objects, namely superpixel model (DPMM-SP) and deformable part model ((DPM) 2 ). TLMHT defines five categories of tracklet hypotheses with dummy detections and forms track-level associations by using the similarity between any two different detections within five frames [338]. An iterative maximum weighted independent set (MWIS) algorithm is proposed to solve the multiple-hypothesis tracking problem through a hypothesis category transfer model. Besides, a polynomial-time approximation (PTA) algorithm is introduced in the model optimization process, which converts the MWIS problem in a hypothetical subset into a bipartite graph matching problem.
The tensor approximation can exploited to solve the data association optimization [87], [129]. R1TA-MOT reshapes the optimization as a rank-1 tensor approximation problem and proposes a tensor power iterative method [87]. It captures higher Fig. 39. One-stage structure [91]. order motion information by the assignment constraints inherited from the multidimensional assignment formulation. Dual-L 1 -MOT proposes a dual L 1 -normalized context/hypercontextaware tensor power iterative optimization to obtain the detection correlation [129]. The final global trajectories are produced through the serial expansion of all batch associations.
2) One-Shot Structures: An end-to-end model is built to generate detections and corresponding trajectories, which mainly combines object detection methods with Re-ID or motion information to achieve tracklet association [348], [349], [350].
The spatiotemporal context information reflects the morphological changes of objects in different periods, which is particularly important for the subsequent trajectory inference [90], [91]. As shown in Fig. 39, PCAN distills a set of prototypes by clustering the spatiotemporal memory with a GMM [91]. It contains a frame-level and instance-level prototype cross-attention module to achieve a generalizable yet compact feature representation. TGarM regards MOT as a multitask learning method based on graph spatiotemporal reasoning [90]. It calculates the edge weight between features through attention mechanism and uses the graph convolution network reasoning to obtain the current message. The current feature state is obtained using a readout function through the previous feature state and the current message.
To enhance task-related feature representation, CSTrack proposes a reciprocal network with self-attention mechanism [92]. This network constructs the self-relation and cross-relation weight maps to facilitate object detection. DHIAN adopts a Re-ID branch to extract appearance features and encodes detections through the historical locations of tracklets with corresponding time stamps [345]. GRN-MOT proposes two subnetworks to extract object state attributes, namely global response generation (GRG) and motion displacement regression (MDR) subnetwork [45]. A logical inference methodology is proposed to estimate object response values using the object states from past frames, and the regression subnetwork calculates the pixelwise offset.
During tracklet generation representations across detections, DHIAN proposes a GNN-based human interaction model to utilize the relative position information between tracked objects and its surrounding objects [345]. DAN performs data association by pairing permutation to calculate the affinity matrix between the current frame object features and the previously stored previous features and then generates reliable   [344]. GRN-MOT proposes two matching approaches, namely target-independent and target-dependent matching [45]. The former uses a greedy matching algorithm based on center point distance to link objects, while the target-dependent matching minimizes the global CM to optimize assignment.
MOT can be presented with two homogeneous branches for obtaining the current frame trajectories, namely detection and Re-ID branch [51], [89], [90], [92]. The overall process of Fair-MOT is shown in Fig. 40 [89]. FairMOT-SAR applies it for moving shadow tracking in SAR video with good performance [51]. The DLA network means deep layer aggregation network for extracting video frame features. For the Re-ID module, CSTrack proposes a scale-aware attention network (SAAN) with a spatial attention module and a channel attention module, enhancing the multiscale detection features and suppressing the background noise [92]. The branch imbalance problem has been solved with a bunch of detailed training schemes [89], [90]. TGarM proposes a multitask adversarial gradient learning strategy to make the loss gradients have similar statistical distribution [90]. SiaBi-GRU proposes a tracklet cleaving and reconnection network for trajectory postprocessing to cut impure tracklets and reconnect the same tracklets [346].
Unlike supervised models described above, visual-spatial proposes a cross-input consistency self-supervised learning method [46]. It computes detections in an unlabeled video corpus during preprocessing and proposes two input-hiding schemes to obtain learning signals, named visual-spatial and occlusionbased hiding. A tracker is applied independently on the two input variations to derive tracked output. The consistent output is produced by backpropagating the similarity of these two results.
3) Datasets and Evaluation Metrics: Some RSV tracking datasets for multiple objects are listed in this subsection, covering different challenges in various data types. In addition, commonly evaluation metrics are described to measure the performance of multiobject trackers more comprehensively. The characteristics and shortcomings of the trackers can be determined in time for subsequent optimization. a) Datasets: Table X lists the characteristics of each dataset in terms of sequence number and total frames. The VisDrone2018 dataset is a large-scale drone video dataset for Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. multiple vision tasks, which filmed in 14 different Chinese cities [351]. In the MOT task, this dataset divides whether the tracker needs detections in a single frame into two tracklet tasks. It contains 56 sequences for training, 7 for validation, and 16 for testing. Target categories include pedestrian, car, van, bus, and truck. The MOT task in the VisDrone2019 dataset merges the two tracklet tasks of VisDrone2018-MOT [352]. In the MOT task of the UAVDT dataset, 30 sequences are used for training and 20 sequences are used for testing [114]. The train and test sequences select different shooting angles to prevent the tracker overfitting. The VISO dataset uses the last seven (658 tracklets and 89 509 bounding boxes) of 47 video sequences as tests, realizing the MOT task design [35].
b) Evaluation metrics: Evaluating MOT methods incorporate multiple metrics to comprehensively evaluate the tracker performance from different perspectives. The evaluation metrics are synthesized from some MOT datasets and methods, aiming to gain a comprehensive grasp of evaluation protocol [34], [35], [114], [351], [352].
where Location MOT can be expressed as the total values of IoU between the true positives and the corresponding ground truths or the total values of the Euclidean distance between the two center positions. 8) Identification F 1 score (IDF 1 ): It expresses the ratio of correctly identified detections over the average number of ground truths and computed detections, namely MT is recorded as the number of targets with a covered percentage more than 80%, while ML is less than 20%.

VI. POTENTIALS IN RSVS
Transformer has achieved beneficial results in both RS image and video fields [12], [13], [14], [15], [16], [17]. Singleobject transformer tracking is prominent with improved performance [41], [42], [301], [307], [310]. There are still some potentials in RSV moving object detection and tracking tasks, such as the feature extraction of sparse foregrounds, the influence of complex background noise, and the utilization of spatiotemporal context information. The future developments of transformers in RSV moving object detection and tracking will be delved into this section.

A. Transformer in MOD
MOD contains the traditional background and the deep learning method [35], [36], [37], [38], [39]. The former uses the background spatiotemporal correlation with the motion cues of objects. It is sensitive to texture changes and objects irregular motion. The blurred RS scenario brings challenges to model performance. The deep learning method relies on object appearance and needs to balance model performance and speed. In this subsection, we will describe the development prospects of transformers from the perspective of motion-based and appearance-based models.
Background subtraction divides a video sequence into foreground, background, and noise, which relies on interframe registration. The self-attention mechanism in transformer can perform a more accurate spatial mapping between moving and fixed images, which provides a sufficient guarantee for the interframe registration [353]. The sparse background method models background with the rank minimization, foreground with structured sparse [257], [261], [262], [264], and adopts the motion information to ensure the continuity of moving targets [53], [121], [122]. Multihead attention can induce the model to interactively learn the context features, which can be used to ensure continuous detection in the irregular motion case [94], [96], [97], [124], [235]. Multilevel attention feature aggregation and hybrid attention modules can improve the feature representation of foreground objects and suppress noise interference [143], [181], [186].
2) Appearance-Based Models: Image-object-detectionbased methods focus on feature aggregation and motion information fusion. In feature aggregation, attention mechanisms or feature fusion blocks focus on moving objects [38], [116]. Nowadays, transformer variants focus on local feature areas to improve the feature expressive ability of the local regions [183], [198]. For example, designing local attention mechanisms, stacking attention paradigms, or combining convolution and attention [179], [215], [220], [227], [246]. The interframe information fusion has been adopted through convolutional networks [56], [116]. Transformer models the similarity of interframe information. It also has the advantage of global modeling at learning the object dynamics in the video scene [96], [124], [137], [241]. The attention mechanism has been used in RNN-based and tracking-based models to extract and enhance object semantic features effectively [39], [58], [59]. Various attention mechanisms and transformer variants have been used in RS tasks to enhance features [2], [16], [17], [21], [168], [208]. They can assist the network in mining deeper feature information and extracting high-quality detections.

B. Transformer in Object Tracking
RSV object tracking aims to track the objects marked by the first frame in subsequent frames. The development prospects of transformers concerning SOT and MOT methods are discussed in the following subsection.
2) Multiple-Object Tracking: The accurate detections and modal combination features are significant for the tracker robustness. Transformers perform well in RS image tasks, especially detection and segmentation [11], [14], [22], [108]. They can assist or replace detectors to improve MOT model accuracy.
In the data association stage, the Hungarian algorithm and the Kalman filter are widely used. Transformer, as a global context model, has the development prospect of calculating the optimal detection association. The two-stage and single-shot models will be discussed in this subsection. a) Two-stage models: In the object detection process, online methods mainly rely on the object detector to achieve bounding box extraction. They use the spatiotemporal aggregation of object features and instance segmentation methods to enhance the appearance representation [329], [330], [331], [332]. Besides, distinguishing similar objects in different ways lays a solid foundation for subsequent detection of trajectory-level associations [43], [79], [128]. To detect more accurate results and reduce the incorrect data association impaction, RS transformers can extract finer object detection boxes and suppress the background noise [14], [15], [105].

VII. TEN OPEN CHALLENGES WITH TRANSFORMER IN RSV
RS transformer development has gradually grown while facing some optimization, interpretability, efficiency, and versatility challenges. Fig. 41 depicts the open problems faced by transformer and RSV. It includes, but is not limited to, transformer interpretability, brain-inspired and physics-informed transformer, transformer with causal inference and few-shot learning, efficient and multimodal transformer, multiobjective optimization with transformer, multiscale geometric network with transformer, and transformer in RS tasks. They are introduced in detail as follows.

A. Transformer Model Interpretability
It is found that the attention heads in small transformers are interpretable, which has been shown to learn the context information, while the model interpretability becomes more complicated for multiple layers with high isolation costs [146], [147]. From an intuitive understanding, transformers can pay attention to more input information in a certain way and perform an approximate global analysis. Some brain studies explain how the brain works by adding perturbations to parts of the brain [359]. We can perturb part of the model to analyze the inner mechanism of transformer. In addition, the input in the attention head can interact differently to generate more complex behaviors with better performance [17], [151], [158], [206], [222], [223]. Therefore, it is necessary to explore the internal structures fundamentally for using the advantages of transformers more proficiently in RSV, such as explaining each module of transformer from different perspectives, comprehending transformer through the feature visualization, the influence function, or the saliency map.

B. Brain-Inspired Transformer
Neural networks treat functions as computational properties and train to learn external representations for adapting tasks [360], while they are still largely dependent on the input without the ability to understand the deep logical semantics, such as the object concept and the scene structural and causal understanding. It leads to poor generalization ability, making the networks enter a certain bottleneck period [361]. And how to successfully adopt biological plausibility for improving network performance has become an unavoidable topic. It can be enhanced through the study of brain anatomy and physiology [362].
Biologically realistic neural network architectures perform best at representing fundamental dynamics [360]. Transformer can replicate the spatial representation of the hippocampal structure accurately after being equipped with a recursive positional encoding [32], [33]. It suggests that transformer is similar to the human hippocampus without the aid of any biological knowledge. Moreover, transformers can significantly improve the ability of neural networks to mimic the various calculations performed by grid cells and other parts of the brain. This has laid the biological foundations for transformer studies, making it more valuable for research. The current networks need to provide more information for neural representation and brain cognition. We could continue drawing inspiration from the brain cognition and neuroscience field [363].

C. Physics-Informed Transformer
Embedding physics informed into some fields is already a popular trend [364], [365]. Quantum evolutionary algorithms, inspired by quantum theory in physics informed, have been widely used in multiobjective evolution algorithms [366], [367], [368], [369], [370], [371], [372]. Training neural network is a nonconvex optimization problem through the interaction and evolution of millions of parameter weights. It can be analogous to a large number of physical molecule interaction processes. Physics-inspired models have been proposed one after another [373], [374], [375].
Physics-informed transformers have been developed rapidly. Wave-MLP improves the token representation for distinguishing the semantic information in different images according to the wave-particle duality in quantum mechanics [375]. It divides each token into the wave function form of phase and amplitude, which dynamically aggregates tokens according to semantic information. The physics-informed models could be better at processing high-dimensional data with slow speeding in solution, which still needs to be researched. More physical information should be integrated into transformer by learning the data distribution laws to perform better in a shorter training time.

D. Integration of Causal Inference With Transformer
Causal inference is divided into three stages: association, intervention, and counterfactuals [376]. It estimates causal relationships through observational data, which can ensure that the results are correct and unbiased. Besides, it has great potential in exploring the influence of attributes on model prediction labels with promoting the development of deep learning models [374], [377].
For visual transformer research, pursuing accuracy and computational complexity is required. Most methods model the correlation between features, resulting in limited causal reasoning ability. Therefore, developing transformer with causal reasoning capabilities helps to explore the underlying mechanism in the model with interpretability. It will realize the general model transformation. Knowledge graphs for causal inference are built based on transformer, which provides logical evidence for the final prediction [378], [379], [380]. How to help transformer improve architecture performance is still an open problem. Causal intervention could be added to transformer for dealing with spurious correlations.

E. Efficient Transformer
The high performance with a low-cost strategy can improve transformer effectiveness and computational efficiency. At the same time, the energy and efficiency are usually related. Determining the balance between them is a meaningful research topic in the future. We will discuss from lightweight network and network architecture search for deploying efficient transformers.
With the performance of transformer in various tasks, the practical transformer has been designed through NAS [394], [395], [396], [397]. The current problem lies in the interpretability of NAS. In addition, model designs are limited to the existing structure design experience. How to find innovative elements from the search space to eliminate parameter optimization and the manual configuration of all the parameters are challenges in the future.
Compared with lightweight models based on neural networks, these transformers have similar or even higher accuracy [219], [405], [406], [407], [408]. There is still a development room for parameters and floating point operations. Balancing speed with accuracy and achieving better results on resource-constrained devices, like mobile devices, are still essential directions for future research [409].

F. Multiobjective Optimization With Transformer
In the real world, we often encounter problems where two or more conflicting objectives need to be optimized simultaneously. A set of constraint conditions must be satisfied, such as the receiver operator characteristic convex hull maximization in machine learning [410], [411], [412], [413], [414]. All these problems are called multiobjective optimization problems, which have been used widely in many fields [415], [416], [417], [418], [419]. Several evolutionary algorithms have been proposed to solve multiobjective optimization problems [415], [420], [421], [422], [423]. Their performances still need to be improved when applied to the optimization with transformer containing a super multiobjective. The iterative optimization algorithm further increases the model computational complexity.
Many real-world industrial applications and scientific researches present a time-dependent feature, including transformers [93], [101], [237], [424], [425], [426]. The dynamic multiobjective optimization problem has been paid increasing attention. It is characterized that the objective function with constraint and the associated parameters change over time [427], [428], [429]. The current difficulties are how to rapidly converge to the new true Pareto-optimal front and find a widely distributed set of Pareto-optimal solutions, while the transformer environment changes.

G. Multiscale Geometrical Neural Networks With Transformer
The wavelet scattering network uses the wavelet filter, which is a feature extraction network highly similar to the CNN between traditional image recognition and deep learning [430], [431]. This network is theoretically supported by rigorous mathematics and signal processing fields. It performs well under few-shot learning, which ensures translation invariance and deformation stability. Multiscale geometrical neural network (MGNN), which is based on the development of wavelet scattering network, has rotation and directionality with self-adaptive ability [432], [433].
Many methods combine neural networks with multiscale geometric analysis, mainly divided into two types. One is to use the transformation method at the multiscale geometric analysis tool in the feature space to achieve feature extraction and then send the extracted feature vector to the neural network for processing [433], [434], [435], [436], [437], [438]. The other is to use the parallel MGNN with the direction base directly [8], [439]. In the future, we can combine transformers to develop multiscale geometric analysis tools and construct parallel MGNNs with directionality. Choosing appropriate MGNNs for different tasks is also a future research direction. The computation of spatiotemporal information processing in the video field is relatively large. Combining transformers with MGNN to achieve a fast and practical model is also an important research topic.

H. Few-Shot Learning With Transformer Based on Knowledge-and Data-Driven Models
Inspired by the human visual system, few-shot learning designs a model with solid generalization ability from fewer training samples [440], [441], [442], [443], [444]. It solves problems like obtaining few training data. Transformer-based few-shot learning methods have been proposed one after another [445], [446], [447]. For example, HCTransformer explores the scheme of ViT in few-shot learning tasks [445]. It adopts a hierarchically cascaded transformer with a knowledge distillation framework and designs an attribute surrogate supervised method to learn information in labeled data.
Knowledge-and data-driven models need to be used to make it with logical reasoning and learning data rules. There are still some challenges to solve. In terms of the knowledge-driven model, how to quickly master a large amount of human commonsense knowledge and let the model learn automatically are challenges. For example, how to face an environment with ambiguous conditions. At the data level, the small amount of data, the image with low resolution, and the complex target relationships in the image cannot guarantee the model with a good learning effect. Moreover, how to balance the training time and performance of the model and achieve commercial accuracy with transformer-based few-shot learning are also significant topics for future research.

I. Multimodal Transformer
Multimodal transformer receives a variety of unique input information with different characteristics and generates additional modal data, which provides the possibility to realize more complex intelligent tasks [354], [448], [449], [450], [451]. It realizes the perception and interaction between modalities through the mutual fusion of information, which indicates that transformer has the potential to build a general intelligent agent. Xu et al. [448] describe the challenges of multimodal transformers with high research inspiration, including modal fusion, region-level alignment, and versatility.
For different modal tasks, it is necessary to design a specific learning strategy for the study due to the massive gap between the learning tasks, which leads to insufficient model fusion. Multimodal transformer is limited to the imitation of the brain apparent ability without the human cognitive research, leading to data fitting. A general multimodal transformer will lead to a more complex model parameter design. The tradeoff between model generality and computational cost will become a significant challenge in the future.

J. Transformer in RS Tasks
In this subsection, we focus on RS change and anomaly detection, object detection, and tracking. They play critical roles in detecting and preventing nonagricultural events, air defense, and surveillance. 1) Transformer for Change and Anomaly Detection: Compared with hand-crafted methods, CNN-based RS change detection methods can robustly model some complex change types [202], [204], [206], [452], [453], [454], [455], [456], [457], [458], [459]. Transformer shows excellent potential for change detection tasks with some challenges [3], [16], [21], [110], [207], for example, how to overcome input images with different resolutions, how to eliminate data dependence in diverse scenarios such as class imbalance, and how to capture more semantic information and fully use spatiotemporal context without increasing the model parameters. In addition, the research on environments, such as the open world, will improve the model flexibility and stability. Using the bottom and high-level features with transformers to generate more discriminative information and improve the robustness of pseudo-change information is also an important research topic in the future.
RS image anomaly detection aims to find strange objects or pixels, such as trees, aircraft, or rare minerals, without prior knowledge of abnormal samples [9], [460], [461], [462], [463], [464], [465], [466], [467], [468], [469]. It is a future research topic to combine models with transformers to make a robust feature extraction capability, thereby suppressing the influence of complex pseudo changes. Anomaly detection shows huge performance differences in scenarios. The model versatility has become an essential research direction. Besides, the data potential value needs to be mined for designing robust methods not attacked by deception algorithms [470]. On the other hand, transformer-based video anomaly detection methods have developed rapidly, and how to make the model select anomaly video segments adaptively is a research direction [100], [471], [472].
2) Transformer for Object Detection and Tracking: RSV with low spatial resolution, complex background, and small object sizes makes the intraframe and interframe information important [56]. It is a promising research direction about combining transformers, which capture global context information. The local redundancy of video data introduces a lot of repeated calculations, which can be solved with transformer for capturing long-distance dependencies [145]. Real-time performance is essential for military activities and urban monitoring, which is also a problem to be dealt in the future [34].
As a particular case of semantic video segmentation, MOD focuses on segmenting the foreground objects. The frame difference and background subtraction, which rely on motion information, are sensitive to irregular motion and texture changes [116]. Handling complex and rapidly naturally changing RS scenes are more challenging [58]. The appearance-based neural networks need more semantic distinction for motion artifacts, which means rich spatiotemporal semantic information is crucial [56].
The single-object transformer tracker has a good development in RSV [41], [42]. And some challenges have still existed in MOT. Online methods suffer from model drifts, irregular motions, similar appearances and occlusions, making them impossible to recover correct associations from early errors [43], [79], [80], [82], [83]. Offline methods adopt different local and global optimization based on accurate detections, which brings higher computational costs [44], [84], [85], [86], [87], [88]. Natural video object tracking can be migrated to RSV without taking the characteristics of RS data [90]. It may not take advantage of trackers, resulting in many false alarms or losing targets [82].

VIII. CONCLUSION
This article summarizes and looks forward to transformer in RSV moving object detection and tracking. We have a deeper understanding of RS transformers. It comprises the constraints of input mapping, the range of receptive field, the approximation and combinations of attention modules, and the efficient model construction with low redundancy and high inference speed. Besides, RSV moving object detection and tracking methods have been summarized with their characteristics and limitations. It also introduces the corresponding RSV datasets and evaluation indicators to promote detection and tracking research. The sequence nature of transformer drives its development into video field, which provides a good reference significance for RSV interpretation. In future research, the potential of transformer will drive to conduct extensive research in RSV detection and tracking. Different RS datasets with corresponding evaluation indicators will also promote the realization of more robust moving object detection and tracking.