Video Sparse Transformer with Attention-guided Memory for Video Object Detection

Detecting objects in a video, known as Video Object Detection (VOD), is challenging since appearance changes of objects over time may bring detection errors. Recent research has focused on aggregating features from adjacent frames to compensate for the deteriorated appearances of a frame. Moreover, using distant frames is also proposed to deal with deteriorated appearances over several frames. Since an object’s position may change significantly at a distant frame, they only use features of object candidate regions, which do not depend on their position. However, such methods rely on object candidate regions’ detection performance and are not practical for deteriorated appearances. In this paper, we enhance features element-wisely before the object candidate region detection, proposing Video Sparse Transformer with Attention-guided Memory (VSTAM). Furthermore, we propose aggregating element-wise features sparsely to reduce processing time and memory cost. In addition, we introduce an external memory update strategy based on the utilization of the aggregation to hold long-term information effectively. Our method achieved 8.3% and 11.1% accuracy gain from the baseline on ImageNet VID and UA-DETRAC datasets. Our method demonstrates superior performance against state-of-the-art results on widely used VOD datasets.


I. INTRODUCTION
Video object detection (VOD) extends still image object detection [1], [2] into videos. Applying still image object detectors suffers from stably detecting objects in a video due to appearance changes of objects over time. A video has rich temporal information, in which the same object may appear in multiple frames for a certain period. Therefore, incorporating temporal information into the detectors has thus been proposed to improve accuracy. The mainstream approach in recent years is feature refinement, considering the spatiotemporal information [3]- [5]. It aggregates useful features from surrounding frames to compensate for the deteriorated features of the target frame. FGFA [3] and MANet [6] proposed to utilize optical flow whereas TSSD-OTA [7] and some methods [8], [9] exploited recurrent neural networks to propagate features from neighboring frames.
Recently, leveraging distant frames from the target frame has been proposed [4], [5], [10] because considering only the neighboring frames suffers from detection on deteriorated FIGURE 1: An example of the behavior of VSTAM framework. It aggregates related information from past frames spatiotemporally to refine the target frame's features, including ones in external memory. The orange and yellow arrows represent the highly related positions between frames. apparent frames that persist for a while [3], [6], [7]. To utilize distant frames, object misalignment becomes an issue due to significant object position changes. Thus, focused on is the aggregation of object candidate region features generated from Region Proposal Networks (RPN) [1]. It allows us to aggregate features independent of the object positions; however, it cannot suppress false-negative detection since object candidates are assumed to be detected. In order to adapt detection miss, feature refinement before object candidate detection is necessary. Besides, the memory consumption cost becomes crucial to leverage distant frames if a static sliding window or external memory is used where it is updated randomly [11] or in order [5], [12]. Accordingly, adaptively holding the most vital frame features based on utilization of aggregation is preferable.
Motivated by the above observations, we propose Video Sparse Transformer with Attention-guided Memory (VS-TAM) that refines features at the element level sparsely, considering both nearby and distant dependencies. It refines features element-wisely by considering all features in sampled frames before the object candidate detection. To avoid timeconsuming and memory-intensive refinement, we propose a sparse aggregation method, taking into account the redundancy of a video. Moreover, VSTAM possesses an external memory that adaptively holds the most vital frame features. Figure 1 illustrates the concept of the proposed method. The proposed method aggregates features associated with each element widely and appropriately from multiple locations and frames with sparse attention, and retains the more vital frame features sequentially in the external memory. Even if features of some objects or frames are degraded, it appropriately uses features from other locations in other frames for aggregation. Moreover, if valuable frame features are in the sliding window, they can be updated into the external memory and aggregated. Despite its simplicity in structure, VSTAM outperforms existing methods surprisingly on Ima-geNet VID [13] and UA-DETRAC [14].
The contribution of our paper is listed as follows. • We propose a spatiotemporal feature enhancement framework with adaptive updated external memory for video object detection to sparsely refine the features at the element level, considering both nearby and distant dependencies. • To realize element-level aggregation efficiently, we propose a video sparse transformer to learn the aggregation of sparsely distributed features in space and time. • With a simple but effective feature enhancement framework, we achieved SoTA using ResNet-101 on online settings with 85.7 mAPs on ImageNetVID and 90.39 AP on UA-DETRAC. We also achieved superior accuracy improvement than SoTA without difficulty on the more complex Youtube-VIS dataset.

A. VIDEO OBJECT DETECTION
VOD is an extended task of still-image object detection [1], [15], [16] to tackle video issues such as appearance changes over time. It can be categorized into two groups: box-and feature-level methods. Box-level methods exploit tracking [17] and tubelet [18], [19] to associate related boxes and IoU [20] over time to create temporal links. Despite improvements, they need to detect objects in most frames to associate detection results. Besides, they cannot be trained in an end-to-end manner, or they require high computational costs.
Feature-level methods, on the other hand, enhance detection frame features with surrounding ones. According to the temporal duration to utilize, they can be divided into two subcategories: short-and long-term feature refinement.
Short-term feature refinement methods exploit optical flow [3], [6], [21], recurrent neural network [7], [8], and attention mechanism [22]- [25]. Although they can improve the detection by using nearby frames to enhance the whole features, the significant misalignment between frames impedes feature refinement when distant frames come in. Therefore, it is difficult to deal with video issues such as motion blurring lasting for multiple frames.
Long-term feature refinement methods [4], [5], [10] leverage distant frames to overcome multiple deteriorated frames. They mainly consider object-level features robust against object misalignment between frames. SELSA [4] considers the semantic impact between related object candidate regions in all the frames. RDN [10] distills relation through repeatedly refining supportive object proposals with high confidence. EBFA [26] proposes a temporal and spatial alignment module to refine object-wise features. Furthermore, HVR-Net [27] proposes to consider the relation of object-level features among different videos. MEGA [5] considers both local and global aggregation in time to enhance the feature representation. These methods, however, employ object-wise aggregation after the region proposals, which heavily rely on detection accuracy. On the contrary, our proposed method refines the whole features element-wisely, considering spatiotemporal information before detection. Table 1 briefly summarizes these methods.

B. EXTERNAL MEMORY FOR VOD
External memory has been studied in the feature-level method and can be classified into two categories based on updating strategy. The first category employs the first-in, first-out strategy, which utilizes the memory to extend past features [5], [12]. However, they do not consider the importance of each feature to keep.
The second one dynamically updates the external memory depending on the specific sampling strategy, and our approach falls into this category. OGEMN [11] proposes a topdown object-guided strategy, which computes features that give a high confidence level belonging to the detected objects and selects the higher ones to store. MAMBA [28] employs a random sampling strategy considering video redundancy and 2 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Feature refinement, regardless of object misalignment across frames.
Fixed sliding window.
SELSA [4] Refining features after RPN considering a whole video sequence.
Tolerance to appearance changes of objects for a while.
Ignoring the balance of nearby and distant frames.
More precise feature alignment among frames.
Fixed sliding window.
MEGA [5] Refining features after RPN by separating nearby and distant frames.
Considering the balance of nearby and distant frames.
Difficult to deal with falsenegative detections.
HVR-Net [27] Refining features considering the relation of multiple videos.
Mutual correction with multiple videos.
Difficult to deal with falsenegative detections.
proposes feature-wise deleting to remove redundant features for efficient computation. Our method differs from the above methods in that it selects features based on the sum of the attention weights of the aggregated elements at the frame level in a more straightforwardly bottom-up manner. We summarize these methods in Table 2.

C. TRANSFORMER
Transformer [29] is an architecture for learning sequential data dependency. The vanilla transformer [29] and its similar one, non-local [30], are powerful models; however, they suffer from computational costs when they come to large tensors due to the considerable sequence length and resolution. Some attempts are reported to reduce the cost by making the transformer's self-attention map sparse [31]- [33]. Sparse masks, such as slide windows, enable a transformer to abbreviate computation on no mask [31], [32]. Although these masks work well on NLP tasks [31], [33] and a single image [32], we cannot directly apply them to the video sequence due to spatial and temporal constraints. Our proposed video sparse attention properly captures long-range dependencies in space and time with reasonable computational cost and memory consumption. Recently, TimeSformer [34] and SSTVOS [35] have been proposed to make a transformer sparse for video classification and video object segmentation, respectively. However, TimeSformer considers only the element-wise temporal information at the same position over multiple frames. On the other hand, SSTVOS considers only temporally and spatially local elements. Hence, they are vulnerable to significant object motion. On the contrary, our method also considers the object moving in distant frames by incorporating randomness. We summarize the highlights and limitations of these methods in Table 3.
DETR [16] and Deformable DETR [36] have recently been proposed for still-image object detection by utilizing Transformer. Our method differs from these methods because it considers spatiotemporal information for feature enhancement. Furthermore, their extension to VOD has been proposed [37]. However, since it neither considers temporal information for feature refinement nor leverages distant frames adaptively, the temporal information is not effectively utilized.

III. PROPOSED METHOD
Our proposed VSTAM is depicted in Figure 2. It consists of five components: feature embedding with frame selection, encoder, decoder, detection modules, and external memory. First, to feed short-and long-range temporal information, we effectively collect both nearby and distant frames. The feature embedding module then extracts features. The features are further compressed and flattened into one dimension [16]. Then, they are concatenated in the timeline order with features in the external memory to have a one-dimensional sequence. Then, the sequence together with positional encoding is passed to the encoder module to exploit the long-range sequential dependency among frames. Next, the high-level encoded features and the positioned frame query are passed into the decoder module for aggregation to obtain enriched features. They are passed to the detection module for object detection. Finally, the features to be kept are selected for the next-frame detection. The external memory is updated based on the attention weight of each frame in the decoder to utilize the features of essential frames in the distant frame window.

A. FRAME SELECTION FROM SHORT-AND LONG-RANGES
Given video frames {I t } T t=1 ∈ R H0×W0×C0 , where T is the length of a video and H 0 , W 0 and C 0 respectively denote height, width, and number of channels, our goal is to detect objects in the current frame (at time k) I k with reference frames R k where |R k | = m for a given m. Reference frames are used for aggregation to have the enriched feature of the current frame.
To capture the long-term temporal dependencies of a video, we need to collect reference frames from shortand long-term periods. For a given m (the number of past frames used for aggregation), we define the set S sparse of differences of time from the current frame time as follows: (Fig. 3b). We then define R k = {I k−n |n ∈ S sparse }. In this way, we can effectively collect nearby and distant frames as reference frames. I k and VOLUME 4, 2016  [11] Updating memory based on object features contributes to detection.
Considering the importance of each feature.
Complex and many calculations involved.
Fast and simple strategy considering video redundancy.
Ignoring the importance of each feature.

Methodology Highlights Limitations
TimeSformer [34] Aggregating features in the current frame and ones in element-wise temporal direction at the same position.
Consideration of long-term temporal information.
Vulnerable to object misalignment across frames.
Simple sparse aggregation. Ignoring global spatiotemporal information in a video.

FIGURE 2:
The architecture of the proposed Video sparse transformer with attention-guided memory (VSTAM). It receives the detection frame and the sparsely sampled reference frames in time. It then compresses the features from the backbone, and converts them into a 1D sequence along with the ones in external memory. The encoder spatiotemporally samples the elements sparsely, and the decoder outputs the enhanced features for detection. In order to update the external memory based on the importance of features, the weighting of each element at the aggregation is accumulated for each frame. Finally, the frame's features with the highest weight are kept in the external memory in order.
R k are fed to the feature embedding module.
Our collected reference frames consist of nearby frames that complement the blur in a short time temporal densely and distant frames that are less affected by rare poses temporal sparsely. Compared with the standard dense sampling [3] ( Fig. 3a), our collection way allows us to obtain a broader range of information with the same number of reference frames. We remark that we use "nearby frames" to refer to the last five frames and "distant frames" to refer to the frames after that since in existing works [3], [12] focusing only on nearby frames, the weights are applied only up to five frames.

B. FEATURE EMBEDDING
Given the selected frames R k and I k , the feature embedding module extracts features {F k }. We utilize a shared-weighted ResNet [38] or ResNeXt [39]. Following DETR [16], we use a 1 × 1 convolution to reduce the channel dimension of features {F k } ∈ R H×W ×C from C to a smaller dimension d, creating new features {F k } ∈ R H×W ×d . We then collapse the spatial dimensions of {F k } into one dimension, (a) Nearby-only frame selection (b) Nearby-distant frame selection FIGURE 3: The reference frame selection for feature aggregation from a video clip. m is the number of past frames used for aggregate. Dark and light orange indicate the current frame and the selected reference ones, respectively.
resulting in HW × d features. The external memory contains additional flattened features {Ê q } p q=1 . The newly sampled features and the ones stored in the external memory are concatenated in the order of the timeline. The number of frames is L = p + m + 1. In this way, we obtain a feature sequence Z ∈ R LHW ×d .

C. VIDEO SPARSE TRANSFORMER
Based on the vanilla transformer [29], we develop video sparse transformer (VST) so that it aggregates information from multiple frame features. The vanilla transformer exploits a self-attention mechanism to learn the elements' dependencies and gather information for an input sequence. Although a vanilla transformer considers all the elements, considering all the elements of a video sequence is unnecessary because of the redundancy involved in the video (ex., objects may appear at similar positions for a certain period in multiple frames). We thus follow recent work [32], [33] that makes self-attention sparse and samples elements more efficiently for VST.

1) Video sparse attention
To realize video sparse attention, the video sparse attention masking operation M (·) is implemented on selfattention [29] with the below modification. The modified formulation of a uni-head sparse self-attention is are the key, value, and the query, respectively. l is the length of input sequence, d v and d k are the embedding dimensions of the value and key. A sparse mask M ∈ [0, 1] l×l is defined as Video sparse attention is designed taking into account the following considerations. First, to refer to all the elements in a frame spatial locally and globally, we introduce the frame attention (Fig. 4a). It allows the self-attention to refer only to each frame's elements, thus improving the features considering its spatial context. To compensate lack of temporal information, we introduce two sparse masks: random and position attention.
Random attention (Fig. 4b) masks a certain percentage of the elements, allowing access to a wide range of features. Different from the original one [31] in NLP, we mask each frame with a random probability r instead of the entire sequence because a video is divided into frames.
Although random attention enables us to obtain information from multiple frames, it cannot sometimes aggregate features reliably when objects remain in a specific area over multiple frames. To reliably extract information from around the same location over multiple frames, we introduce position attention (Fig. 4c). It plays the role of aggregating features from the corresponding location in the temporal direction. It only considers the same position at each frame; it is sensitive to object motion. Therefore, we applied a mask like a 3 × 3 dilated convolution kernel (Rate = 2) [40] to each element to give the position attention to a wide field of view for robustness against object motion.
The combined masks of the frame, random, and position are video sparse attention (Fig. 4d) and applied to the selfattention of transformer [29]. We exploit this sparse transformer for both the encoder and decoder.

2) Encoder and decoder
The encoder and decoder follow the original layered architecture of the transformer [29] except for its self-attention. We replace the standard self-attention with video sparse attention. Given the positioned and flattened feature sequence Z, we obtain the embedded sequenceẐ ∈ R LHW ×d via the encoder.
The function of the decoder is to generate the refined features. To decode at each element of the features, a query sequence Q k ∈ R HW ×d is required and obtained by flattening embedded featuresF k . Then, the decoder outputs the enriched feature sequenceQ k using the queryẐ and embedded Q k sequence.

D. DETECTION
Thanks to the decoder, we have refined sequenceQ k . To utilize it for detection, we expand its spatial dimension. It contains the video sequence's spatiotemporal local and global information; however, it loses the detailed information due to the compression of 1 × 1 convolution operation in the feature embedding process. Therefore, we decompressQ k with 1 × 1 deconvolution operation in the channel direction to generate the featuresQ k ∈ R H×W ×C . Then, we merge both featuresQ k and F k by the element-wise sum to acquire the final refined features for detection.

E. EXTERNAL MEMORY
To adaptively store the features of the most vital frames in the external memory, we select them based on the importance of each frame. We regard it according to the attention weights, which are already computed when VST aggregates each element. The element-level attention weights are accumulated  for each frame to measure the importance of each frame. This is the "attention ranking" shown in Fig. 2, and we keep up to the p-th features as {Ê q } in the external memory, arranged in the cumulative order of attention weights. We have two types of feature candidates stored in the external memory. One is the features kept in the current frame's external memory. The other is the features of the distant frames newly loaded in the sliding window. The former features are from distant frames deemed vital and stored at the time of the previous detection. The latter features are from newly sampled distant frames and stored at the time of the current detection. This design enables us to handle the issue that the critical features in the past are not always valid due to video scene changes. The distant frames are defined in section III-A. The role of the external memory is to hold long-term features and deal with scenes that are difficult to detect by using only neighboring frames, so adjacent frames are not to be stored.

A. DATASETS AND METRICS
We utilized the two datasets shown in Table 4 for our validation.

1) ImageNet VID
ImageNet VID [13] is a large-scale benchmark for video object detection. It is one of the challenging datasets to detect objects because it offers a wide variety of appearance changes of objects over time. It has 30 categories and contains 3,862 training and 555 validation videos with 25 and 30 frame rates. We evaluate our method on the validation set and use the mean average precision (mAP) following widely adopted protocols in [3].
2) UA-DETRAC UA-DETRAC [14] is a large-scale benchmark for real-world traffic scenes. It contains 60 videos in the training set and 40 videos in the test set. The videos are recorded at 25 fps, with a resolution of 960×540 pixels. We validate our method on the test set and use the average precision (AP) at IoU threshold 0.7 as the evaluation metric for precise localization.

B. NETWORK ARCHITECTURE 1) Feature extractor
We utilize ImageNet pretrained ResNet-50 [38], ResNet-101 for detail analysis and performance comparison, respectively. Following a common practice [3], [6], we enlarge the resolution of features by modifying the stride of the first convolution block in the last stage of convolution, conv5, from 2 to 1. We also set the dilation of these convolutional layers to 2 to retain the receptive field size.

3) VSTAM
We set the frame selection (Fig. 3b) at training and inference stages with temporal window size of m = 5. Unless otherwise stated, we conducted our experiments with external memory set to p = 2. Therefore, we utilize L = 8 frames. In the embedding process, we set the compressed dimension d of the features to 128. In VST, we utilize multi-head attention with the number of heads h = 8. The number of layers in the encoder and decoder is set to 4, respectively. We use a sinusoidal positional encoding. For the sparse attention, we set a random ratio with r = 10% at each frame.
In DET, we select the same 30 classes as in the VID dataset and follow the data augmentation strategy proposed in [4]. For UA-DETRAC, we utilize only its training dataset. The input images are resized to have their smaller side to be 600 pixels. If there is no specified frame at the frame selection, zero padding is performed. In addition to the frames for the fixed window, p frames are randomly sampled from the video clip for the external memory. Each GPU holds two mini-batches, and each mini-batch contains one set of images or frames. The model with a vanilla transformer was trained with one mini-batch due to memory constraints and adjusted the hyperparameters according to the batch size [44]. We employ a base learning schedule as 1× for 13 epochs with learning rate decay, dividing by ten at epochs 9 and 12, respectively. We utilize the 1× learning schedule for component analysis and the 3× one to compare the competitive models, based on the observation [45]. The initial learning rate is set to 10 −4 . At inference, an NMS of 0.5 IoU threshold is adopted to suppress reduplicate detection boxes.

D. COMPARISON WITH THE STATE-OF-THE ARTS 1) Comparison on ImageNet VID
We employ ResNet-101 [38] and ResNext-101 [39] as a feature extractor for a fair comparison. We compare the models separately since the accuracy differs among offline, online, and post-processing cases. For comparison, we show the offline setting results for the case where five frames extend the sliding window into the future (L = 13). Table 5 shows the comparison result between state-ofthe-art methods on online, offline and post-processing conditions. Among all the methods, VSTAM achieves the best performance on all backbone and conditions. With ResNet-101 backbone, our online model achieves 85.7% mAP, 1.1% absolute improvement over the recent and most powerful competitor MAMBA [28], which utilizes external memory. Compared with FGFA [3], MANet [6], and STSN [43], which aggregate element-wise feature from nearby frames, VSTAM outperforms more than 8 points. The gap between VSTAM and the above is feature aggregation with global spatiotemporal context. VSTAM also outperforms some methods [4], [5], [26], which utilize object-wise feature aggregation. The object-wise approaches provide effective improvement; however, they cannot refine features unless detect object candidate regions are given. Our method considers element-wise features from distant and nearby frames before the region proposal network, leading to the best performance on ImageNet VID. Figure 5 shows the detection results of Faster R-CNN, FGFA, MEGA and VSTAM. We see that VSTAM improves the detection of even severely damaged scenes.
By replacing the backbone from ResNet-101 to ResNeXt-101, our model achieves a better performance of 87.0% mAP, as expected. In an offline setting, our model achieves an accuracy of 87.6%. Moreover, we applied post-processing to the offline model as many offline methods do. For the postprocessing method, we adopt Seq-NMS [20], which refines scores of weaker detection from nearby frames. Our method still performs the best, obtaining 86.4% and 88.1% mAP with backbone ResNet-101 and ResNeXt-101, respectively. Furthermore, TransVOD++ [37], which extends Deformable DETR [36] spatio-temporally, employs the strong backbone [51] and detector [36]. To fairly compare with it, we replace our backbone and detector with them; our model reaches 91.1% mAP in the offline setting, 1.1 points higher than TransVOD++. We see that element-wise feature refinement and attention-guided external memory are essential to detect objects stably. Table 6 shows accuracy and speed comparison on the same architecture and GPUs. We can see that VSTAM is superior in accuracy while the speed is faster than most methods. Next, we replaced the detector with R-FCN [50] to compare the performance of the methods with the external memory under the same conditions and GPU. As shown in Table 7, we confirm that VSTAM is superior in both accuracy and speed. OGEMN and MAMBA have two-step frame-wise and object-wise aggregation based on complex update and delete rules, requiring more processing time. On the other hand, VSTAM deals only with features element-wisely and reduces run-time by a simple rule that holds the features most used in the enhancement.

2) Comparison on UA-DETRAC
The detection results on the UA-DETRAC dataset are reported in Table 8. YOLOv3-SPP, MSVD_SPP, and SpotNet are still image detectors, and they propose to improve accuracy by detector modification and introducing spatial attention mechanisms. In contrast, TFEN and FFAVOD-SpotNet utilize temporal information to refine features. We remark that the methods on UA-DETRAC cannot be fairly compared because they use different feature extractors and detectors.
Our online model with ResNet-101 achieves 90.39% AP, 2.29% absolute improvement over FFAVOD-SpotNet [55], which employs a strong base detector [2] and a feature refinement module to fuse multiple nearby frame features channel-wisely in offline settings. VOLUME 4, 2016 FIGURE 5: Visualized comparison against the state-of-the-art methods on ImageNet VID. We show the detection result of our method against Faster R-CNN [1], FGFA [3] and MEGA [5]. FGFA refines features considering nearby frames, and MEGA incorporates distant frames with object candidate detection. All models use ResNet-101 [38] as the backbone and Faster R-CNN [1] as the detector. We show randomly sampled frames from a video clip. Our detection suppresses the false negative and false positive detection. 8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

E. ERROR ANALYSIS
Using TIDE [56], we analyze what types of errors in VOD are resolved. TIDE classifies object detection errors into misclassification, incorrect localization, duplicate detection, falsepositive detection, and miss. We performed error analysis on the ImageNet VID dataset to use the officially published  : Visualized results of error analysis on ImageNet VID by TIDE [56]. Each bar indicates the amount of errors accumulated in each category. "Cls" represents that the model detected the object but misclassified it into another class. "Loc" means that the model detected the object with lousy localization. "Both" means occurring of both "Cls" and "Loc". "Dupe" represents duplicated detection for an object. "Bkg" means false positive, while "Miss" means that it does not detect the object even though an object exists there. Compared to MEGA, object-wise feature enhancement method, the proposed method significantly suppresses "BKG" and "Miss". Best viewed digitally and in color.
model [5]. Figure 6 shows the error results of the Faster R-CNN [1], MEGA [5] and VSTAM, where the horizontal axis shows the error items and the vertical axis shows the amount of error accumulation proposed in TIDE [56]. We see that VOLUME 4, 2016 We see that VSTAM detects targets robustly in the occluded scene.
Faster R-CNN produced many "Cls", "Bkg", and "Miss" detection errors. This is due to the appearance changes of objects over time in videos. In contrast, VSTAM significantly reduces the amount of errors, especially for "Bkg" and "Miss", compared to MEGA. In addition, our elementwise aggregation also reduces the class error. We see that it is crucial to enhance the features before object candidate detection for VOD.

F. COMPONENT ANALYSIS
To evaluate the effectiveness of each component in VSTAM, we conduct ablation studies with ResNet-50 and the 1× learning schedule (i.e., 13 epochs). Table 9 lists the ablation result of our module variants.

1) Investigation of VSTAM
To confirm the effectiveness of VSTAM, we use Faster R-CNN [1] as the baseline. First, we can see that the introduction to VST brings significant gains to both datasets. We thus conclude that elemental aggregation is effective as feature aggregation for video. Next, we confirm that introducing the attention-guided external memory improves accuracy. Accordingly, all the factors are essential for VSTAM. Figure 7 shows the detection result comparison between the baseline [1] and VSTAM on UA-DETRAC. We confirm that VSTAM has improved detection results.
To check the effectiveness of sparse sampling in VSTAM, VST was replaced with a vanilla transformer. VST is about 1.2 points more accurate than the vanilla transformer on both datasets. Indeed, we confirm that for video sequences, properly performing the sparse sampling achieves higher accuracy. Furthermore, VSTAM processes one frame in 52ms while the run-time of a vanilla transformer is 342ms. VSTAM (w/ VST) and VSTAM (w/ vanilla one) consume 2.1GiB and 7.2GiB memories per frame during inference. Additionally, 5.1GiB memories are required for each for Faster R-CNN. Our VST offers 658% speed up and 70.8% memory reduction thanks to our sparse sampling.

2) Effect of feature refinement to RPN
The object-wise feature refinement methods, evaluated on ImageNet VID, employ features from RPN to deal with object misalignment among frames [4], [5], [10]. These methods rely on RPN detection, heavily degrading performance when RPN performance is not good. In contrast, our method refines the features before RPN. We evaluate how VSTAM affects the RPN in terms of Average Recall (AR). We select top k = 5, 10, 100 proposals generated by RPN and calculate AR k . Table 10 shows the difference of Recall in RPN between the baseline model and the model with VSTAM. We can see that all the metrics are improved by the proposed method, confirming the effectiveness of the feature refinement before RPN.

3) Investigation of Video Sparse Transformer
We investigate VST from three aspects. They are sparse aggregation, video sparse attention, and the random attention ratio.
Effect of sparse frame selection on aggregation modules: We examine the effect of element aggregation across different frame selections. VSTAM aggregates information from 10 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   Table 11. We show the detection results for four consecutive frames in which motion blur occurs. The correct object labels on the first video sequence are "fox" and "domestic_cat". The second video sequence contains "squirrel" and "domestic_cat". We confirm that the sparse frame selection stabilizes detection.
a wide range of frame spans. We investigate how the frame selection affects element-level aggregation. In this experiment, we exclude external memory. We also compare our results with the sparse attention-based aggregation method: SSTVOS [35], Big Bird [31] and TimeSformer [34]. Note that since there is no official implementation of SSTVOS, we reproduced it and obtained a result of 0.1 points higher than the reported score in the paper. We also note that the percentage of random attention in Big Bird is set to 10% for a fair comparison. Table 11 shows the performance with the two types of frame selection, where "Nearby-only" and "Nearby-Distant" represents dense sampling (Fig. 3a) and sparse sampling (Fig. 3b), respectively. Although all methods improve accuracy over the baseline, SSTVOS and TimeSformer lose accuracy when exploiting far frames. Big Bird does not handle sequential information well when adapted to a video, resulting in lower scores. This will be because it is proposed for NLP tasks. VST, on the contrary, improves accuracy by utilizing distant frames rather than nearby ones. Figure 8 shows the difference in detection results depending on the frame selection of our method. We see that the sampling method, including distant frames, is robust to apparent changes over time, such as motion blur. Investigation of sparse attention: Table 12 shows the accuracy impact with each sparse attention method between baseline and VST. Using only the frame attention leads to significant accuracy decrease. Accuracy using only the random attention or the position attention is insufficient. We see that combining the frame, random, and position attentions improve accuracy, meaning that each is necessary. Effect of random attention: We investigate the impact of the ratio using random attention. Table 13 shows the perfor-VOLUME 4, 2016 11 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  mances under different ratios (r%) using random attention. r = 0% is identical with using frame attention and position attention only in Table 12 while r = 100% is identical with using a vanilla transformer. We see that the accuracy is improved by increasing r from 5% to 10%, but it gradually decreases from 15% to 100%. This indicates that introducing random attention is effective for feature aggregation, but using random attention too much is not a good way.  Table 14 summarizes the accuracy effects of changing the number of additional frames p. To evaluate the effect of adaptively selected features, we increased sliding window width m. Although the extended sliding window improves the accuracy, the gain of the external memory is more prominent. The number of additional frames of the external memory is saturated after about two frames. It is no longer possible to obtain a significant gain. Figure 9 shows the difference in detection results when using attention-guided external memory and sliding windows at the additional two frames. We confirmed that the adaptive external memory updates prevent class errors and false-negative detection. By default, the update candidates of external memory are only distant frames. We confirm that when nearby frames are included in update candidates, the accuracy decreases compared with only distant frames. It is better to utilize only distant frames as the candidates to utilize long-term information. We also show the case where the update frames are randomly selected to see whether the attention-guided update rule is valid. We see that the attention-guided updates are more accurate than random ones because they retain important frames adaptively. Figure 10 shows how past features are stored in the external memory for a given video. It sometimes stores frames beyond 50 frames from the target frame. It is difficult to hold such a distant frame in a sliding window, confirming the importance of adaptive updates. Besides, the external memory tends to store the frames which capture objects clearly.

V. APPLICATION TO VIDEO INSTANCE SEGMENTATION
To validate the versatility of the proposed method, we applied our method to video instance segmentation (VIS), a combination of object detection, instance segmentation, and object tracking across frames. We evaluated our method on the YouTube-VIS dataset [57]. YouTube-VIS contains 40 object categories and consists of 2,238 training videos, 302 validation videos, and 343 test videos. The training is conducted on the training videos. Since the evaluation on the test set is currently closed, the evaluation is performed on the validation set.
To apply the proposed method to VIS, we replaced Faster R-CNN with MaskTrack R-CNN [57], an extension of Mask R-CNN [58] with a tracking branch to link the same object instances across two frames.
We compared our methods with TF-Blender [49] and TROI [48], which refined features to improve accuracy. The results with ResNet-50 are shown in Table 15. Our proposed method outperforms them on all evaluation metrics. With our proposed method, MaskTrack R-CNN is improved by more than 8.7% on the AP metric. TF-Blender utilizes nearby frames, but it aggregates frame-wise features, and the gain is limited. This is because instance segmentation requires more precise feature refinement. TROI proposes a temporal ROI alignment to extract ROI features from other frames based on their similarity; however, it is not sufficient for hard-todetect scenes because the refinement is for the object-level features. On the contrary, our approach is based on elementwise aggregation before object candidate detection, allowing us to improve the representation more precisely. Figure 11 shows VIS results between the baseline (Mask-Track R-CNN [57]) and the proposed method on example frames in the validation set. We see that by refining elementwise features with the temporal information, false negatives are reduced, and masks are stabilized.

VI. CONCLUSION
We introduced a novel framework, VSTAM, for video object detection. It element-wisely refines features spatiotemporally, considering object misalignment before detection. The proposed video sparse transformer in VSTAM efficiently aggregates features sparsely with considerable time and memory cost. Moreover, we demonstrated significant accuracy improvements by storing the most utilized frame features during the aggregation in external memory. Extensive evaluations also demonstrated that it outperforms SOTAs on publicly available datasets. FIGURE 9: Visualized comparison of detection results between the attention-guided adaptive reference frame and static extended sliding window frame on ImageNet VID. We used the models in Table 14. We show the detection results for four successive frames in which motion blur occurs. The numbers below each sequence indicate the distance from the current frame to the additional reference frame. The proposed method reduces class errors by adaptively updating the external memory.
The detailed error analysis reveals that our method significantly reduces background false positive and false negative detection. To achieve more stable detection, reducing class misclassification is necessary. We plan to incorporate tracking for feature refinement with continuing to detect the same object in the same class over time.  Results are plotted if their confidence score is larger than 0.45. We confirm that the proposed method suppresses false negative detections. Best viewed digitally and in color. VOLUME 4, 2016 15 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.