Local Memory Read-and-Comparator for Video Object Segmentation

Recently, the memory-based approach, which performs non-local matching between previously segmented frames and a query frame, has led to significant improvement in video object segmentation. However, the positional proximity of the target objects between the query and the local memory (previous frame), i.e. temporal smoothness, is often neglected. There are some attempts to solve the problem, but they are vulnerable and sensitive to large movements of target objects. In this paper, we propose local memory read-and-compare operations to address the problem. First, we propose local memory read and sequential local memory read modules to explore temporal smoothness between neighboring frames. Second, we propose the memory comparator to read the global memory and local memory adaptively by comparing the affinities of the global and local memories. Experimental results demonstrate that the proposed algorithm yields more strict segmentation results than the recent state-of-the-art algorithms. For example, the proposed algorithm improves the video object segmentation performance by 0.4% and 0.5% in terms of ${{\mathcal {J}}}\& {{\mathcal {F}}}$ on the most commonly used datasets, DAVIS2016 and DAVIS2017, respectively.


I. INTRODUCTION
Video object segmentation (VOS) aims at cutting out objects of interest from the background in a video. It is a fundamental task to perform many computer vision techniques, including video editing and video summarization. It also takes an essential role in facilitating real-world applications such as automatic driving or augmented reality [1]. Object deformation, occlusion, and appearance change are challenging problems [2]. To overcome these issues, semi-supervised VOS, which uses a complete annotated mask at the first frame of a video to segment out the target object, has been widely researched. Recently, many semi-supervised VOS researches have been carried out with the development of deep neural networks.
Representatively, space-time memory network [3] and its following works [4], [5], [6], [7] proposed memory-based The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . VOS algorithms and achieved outstanding performance and efficiency. By assigning a number of previously predicted frames as memory, they predict segmentation results of the query frame using the memory through the readout process, as shown in Figure 1(a). First, they construct affinities between the memory and the query frame to conduct the readout process. Affinities then transfer the encoded feature of the memory to the query frame for reliable prediction.
However, since many memory-read processes are performed in the non-local manner [8], they overlook the property of target objects that have spatiotemporal smoothness across the video. In general, there is a constraint that object movements between neighboring frames are confined. In this regard, recent studies [9], [10] attempted to deal with this continuity by performing local matching within a specific search range, but they are vulnerable to fast or large movements beyond the corresponding search range. In contrast, the proposed method adaptively readout the global and local memory to address the problem as illustrated in Figure 1(b).  [4], [5], [6], [7] and (b) proposed algorithm. The proposed method adaptively reads the local information to consider the contiguity of target objects across adjacent frames.
In this paper, we propose a robust approach to achieve VOS based on the local memory read-and-comparator. First, we propose a local memory read (LMR) and sequential local memory read (SLMR) to transfer the segmentation information to neighboring frames in a hierarchical manner. Next, we design a memory comparator to read the global memory and the local memory adaptively according to the affinity between the memory frames and the query frame. Experimental results demonstrate that the proposed local memory read-and-comparator is effective and outperforms the stateof-the-arts VOS algorithms. This paper has three main contributions: • Effective local memory read operators to deal with spatial contiguity between adjacent frames.
• Memory comparator to selectively use the local memory and global memory.
The rest of this paper is organized as follows. Section II reviews related work on the three main approaches of video object segmentation which are unsupervised, interactive, and semi-supervised settings. Section III describes the proposed algorithm. Section IV compares the proposed algorithm with the state-of-the-art VOS algorithm and analyzes the proposed operators quantitatively and qualitatively. Finally, Section V concludes the paper.

B. INTERACTIVE VOS
Interactive VOS aims to refine segmentation results with repeated user inputs, such as points, scribbles, or bounding boxes. A round-based interactive VOS process [43], which iterates each round of the interaction until the user is satisfied, is adopted in many recent interactive VOS algorithms [7], [44], [45], [46], [47], [48]. Cheng et al. [48] proposed the difference-aware fusion to fuse results of the previous round and the current round by learnable parameters. Heo et al. [47] introduces a guided interactive VOS system based on the reliability attention module for the annotated frame.

C. SEMI-SUPERVISED VOS
Semi-supervised VOS is a task to predict target objects in a video using an accurately and densely annotated mask at the first frame. Superpixels [49] or random walkers [50] are used for the early works. With the development of deep convolutional neural networks, VOS methods have focused on online and offline learning. Table 1 lists a summary of CNN-based semi-supervised VOS algorithms. Online learning VOS methods [11], [12], [13], [14], [15] finetune pretrained networks with the first frame annotation of the video. Therefore, they inevitably consume additional time to train the network in inference for each video.
On the other hand, in order to eliminate the time-consuming process in online learning, offline learning algorithms have been studied based on propagation, detection, and matching. Specifically, propagation-based algorithms [16], [17], [18], [19], [20] propagate predicted masks in the previous frame to the query frame to carry out VOS. For example, AGSS [18] generated attention with the previous frame and its prediction to guide the query frame.
Matching-based algorithms [21], [22], [23], [24], [25], [26] perform the pixel-wise feature matching between query frame and other frames. For example, PML et al. [21]  classified an encoded feature of each pixel at the query frame into foreground or background based on the feature distance between the annotated frame and the query frame. Also, [23], [24], [26] measured the pixel-wise feature distance at the query frame with the previous frame as well as the annotated frame. LLGC [25] used more unlabeled frames for matching to improve the robustness with the graph-based learning algorithm.
Recently, Oh et al. [3] introduced the space-time memory network (STM), which transfers the feature from the memory 90006 VOLUME 10, 2022 to the query frame. STM encodes several past predictions into the memory and employs non-local matching [8] to transfer target object features in the memory to the query frame. As variants of STM, many memory-based algorithms [5], [6], [7], [9], [10], [27], [28], [30] have achieved impressive performance on the semi-supervised VOS. DMN [6] generated object templates and employed a dynamic memory network to align positional changes of target objects. STCN [7] replaced the operation of matching from the dot product to Euclidean distance. In addition, some memory-based methods [5], [9], [10], [30] considered temporal smoothness between the previous frame and query frame. RMNet [30] used motion between neighboring frames to limit matching regions at the query frame. LCM [5] and AOT [10] adopted the relative positional encoding [51] with sine and cosine functions. HMMN [9] and AOT [10] transferred temporally adjacent object features by computing similarities between the query and previous frames within the local region for each pixel.

III. PROPOSED ALGORITHM
We segment out objects of interest in a video from the complete annotation at the first frame consecutively. To this end, we develop the local memory read-and-compare algorithm. Figure 2 illustrates the overview of the proposed VOS algorithm. To predict the segmentation result at the query frame, we transfer values of the previously segmented memory frames. Given T memory frames (global memory) and the previous frame (local memory), we first apply the global memory read (GMR) operation to transfer the global memory.
Then, the proposed network propagates value features of the target objects at the previous frame using the proposed LMR and SLMR operations in various resolutions. We also design the memory comparator to employ the propagated features adaptively according to the reliability of LMR and SLMR operations.

A. FEATURE EXTRACTION 1) QUERY FEATURE
We extract a key feature from the query Q using the key encoder in [7]. The key encoder takes an image as input and yields a key feature through ResNet50 [52] and a 3 × 3 convolution layer. Specifically, from res2'', res3'', or res4'' in ResNet50, multi-scale frame features F Q r ∈ R H r W r ×C f r are obtained, where r ∈ {2, 3, 4} and C f r denote the feature stage with 1/2 r resolution of the input image and the number of channels at r, respectively. Then, for each feature stage r, F Q r is fed into the 3 × 3 convolution layer to obtain a key feature To this end, multi-scale query key features {K Q r } 4 r=2 are extracted from the query.

2) MEMORY FEATURE
Given the global memory G and the local memory L, we extract value features as well as key features. The key features for the local memory are extracted in the same manner as the extraction of query key features. For the local memory L, multi-scale frame features {F L r } 4 r=2 and key features {K L r } 4 r=2 are obtained from the key encoder. Also, for value features, we encode an image and object mask jointly using ResNet18, and the encoded feature is concatenated with F L r for each feature stage r. Finally, through a 3 × 3 convolution layer, a value feature V L r ∈ R H r W r ×C v r for the local memory is obtained for each r. On the other hand, we extract only single-scale key and value features for the global memory at the feature stage 4. Every frame in the global memory is independently embedded into key and value features, and then they are stacked along the temporal dimension to obtain a global memory key [7].

B. MEMORY READ OPERATOR
We employ three memory read operators, GMR, LMR, and SLMR, to predict the segmentation at the query frame from the global and local memories.

1) GLOBAL MEMORY READ (GMR)
GMR performs the equivalent role with the space-time memory read operation [3]. Given T memory frames, we obtain the value feature V G 4 ∈ R TH 4 W 4 ×C v 4 and the key feature for the global memory. GMR is designed to transfer the value feature V G 4 based on the affinity between the global key K G 4 and the query key To this end, we first compute the global similarity matrix S G 4 by computing negative-converted L2-distance which is employed in [7] as where k Q 4 i and k G 4 j are feature vectors for the ith position in K Q 4 and jth position in K G 4 , respectively. Then, S G 4 is normalized to obtain a global affinity matrix W G 4 , which is defined as We compute a global readout feature R G 4 for the query via the matrix multiplication which can be considered as value estimation at the query frame transferred from the global memory.

2) LOCAL MEMORY READ (LMR)
We design the LMR operation to convey the segmentation information of the local memory to the query frame. Since the previous frame has more common features than any other frames to guide the query frame, especially on appearance information such as edges and boundaries, we perform LMR not only in coarse-scale key features but also in fine-scale features. For each rth feature stage, we transfer the local value feature V L r ∈ R H r W r ×C v r using the affinity between the local key K L r ∈ R H r W r ×C k r and the query key K Q r . Unlike GMR, LMR computes the local similarity S L r within a local region N i for each pixel i in the query to exploit spatiotemporal smoothness between neighboring frames. Specifically, S L r is defined as where k Q r i and k L r j are feature vectors for the ith pixel in the query and jth pixel in the local memory, respectively. Also, the local region N i is the set of pixels, which are sampled from (2d + 1) × (2d + 1) pixels around ith pixel with stride 1. The similarity is computed for those pixels in the local region only and set to infinity for the others. Then, S L r is normalized via the softmax operation to obtain the local affinity matrix W L r , which has zeros values between distant pixels. Similar to GMR, a local readout feature R L r is obtained by In the LMR operation, W L r deals with smooth movements between adjacent frames, since it transfers the value features within N i for each pixel i. Therefore, R L r is able to consider the space-time continuity of objects in video frames.

3) SEQUENTIAL LOCAL MEMORY READ (SLMR)
We find out that affinities between the query and local memory vary according to the level of the feature stage, even at the same position. In other words, the affinity of a pixel at level r is different from the affinity of the corresponding pixel at higher level r + 1, since the key features represent different object properties according to the depth of the encoder.
Based on this observation, we propose SLMR to diversify the propagation process of the local values with higher level features. For this purpose, we reorganize the local key feature K L r and the local value feature V L r with the affinity W L r+1 at the higher level (coarser scale). LetW L r+1 ∈ R H r+1 ×W r+1 ×H r+1 ×W r+1 denote the 4D affinity tensor, which is reshaped from W L r+1 . Here,W L r+1 (x, y, p, q) denotes the affinity between a pixel (x, y) in the query and a pixel (p, q) in the local memory at r + 1th feature stage. Also, let K L r ∈ R H r ×W r ×C k andṼ L r ∈ R H r ×W r ×C v be the 3D tensors reshaped from K L r and V L r , respectively. For each feature stage r, we obtain a sequential local key K S r ∈ R H r ×W r ×C k and a sequential local value V S r ∈ R H r ×W r ×C v using those tensors: wherex = x/2 . This is repeated for all pixels (x, y) and channels c. As in (6) and (7), we obtain the sequential local key and value via the weighted sum with the affinity at the higher level. Figure 3 illustrates the reorganization process for K S r and V S r . K S r and V S r are reshaped to matrices. Then, the similarity matrix S S r and the affinity W S r between K Q r and K S r are sequentially computed as in LMR to acquire a sequential local readout feature

C. MEMORY COMPARATOR
We propose the memory comparator to use readout features, which are obtained from GMR, LMR, and SLMR, adaptively. Figure 4(a) illustrates the diagram of the memory comparator. The proposed memory comparator estimates pixel-wise weights for the local readout features {R L r } 4 r=2 and the sequential local readout features {R S r } 3 r=2 by comparing the similarity matrix S G 4 in GMR with {S L r } 4 r=2 in LMR and {S S r } 3 r=2 in SLMR.

1) TOP-K SELECTION
We select top-k on each row in the similarity matrices and remove the other ones, and thus we obtain S G k 4 ∈ R H 4 W 4 ×k , {S L k r ∈ R H r W r ×k } 4 r=2 , and {S S k r ∈ R H r W r ×k } 3 r=2 . Through the top-k operation, the memory comparator considers k primary similarities between the query and the memory. Since there is only one scale (H 4 × W 4 ) for the global similarity matrix, we sequentially upsample S G k 4 using bilinear interpolation to obtain {S G k r ∈ R H r W r ×k } 3 r=2 .

2) SIMILARITY COMPARISON BLOCK
Similarity Comparison Block (SCB) takes a pair of the global similarity S G k r and the local similarity S L k r (or sequential local similarity S S k r ) for each feature stage r. When S G k r and S L k r are given, SCB produces reliability weights that indicate which pixels in the local readout feature are more reliable than those in the global readout feature. When a pixel i has a larger local similarity than global similarity, SCB assigns high weight to the local readout feature for pixel i. As in Figure 4(b), SCB compares S L k r and S G k r via element-wise subtraction with the softmax operation. Thus, a difference map D L r ∈ R H r W r ×k is obtained by where α is a scale factor. D L r is fed into a 1 × 1 convolution with a single output channel and the sigmoid VOLUME 10, 2022 operation sequentially, resulting in the reliability weight H L r ∈ R H r W r ×1 . Thus, H L r is designed for limiting the usage of R L r if only R G r is unreliable to estimate segmentation results by comparing the similarities. Then, a weighed local readout featureR L r is given bỹ where denotes that each coefficient in H L r is multiplied to all C v r coefficients in R L r at the same spatial positions. As in Figure 4(a), SCB is applied to both local and sequential local readout features for all feature stages. To this end, the weighed readout features {R L r } 4 r=2 and {R S r } 3 r=2 are obtained and fed into the decoder.
D. DECODER Figure 5 shows the architecture of the decoder. In the decoder, features are gradually upsampled by a factor of two with the readout features, i.e. R G , {R L r } 4 r=2 , and {R S r } 3 r=2 , and frame features {F Q r } 4 r=2 using skip-connections. As in Figure 5, multi-scale readout features are processed according to feature scales. Finally, the output of the final layer is upsampled by a factor of four to be of the same size as the input frame using bilinear interpolation.

E. IMPLEMENTATION DETAILS 1) LOSS
The proposed network is trained to minimize the loss L = L pce + βL scale (11) where L pce is pixel-wise cross entropy in [53] between the segmentation prediction and the ground-truth. Also, the scale loss L scale is designed to minimize query key features between different scales (12) where i, i , and i denote the equivalent position in query key features. L scale is used until 1K iterations. We propose L scale to boost the training of the memory comparator in the early training stage.

2) TRAINING AND INFERENCE
For training, we use training videos in DAVIS2017 [41] and YouTube-VOS [42] to train the proposed model. We randomly select three different frames within 10 frames: one for the global memory, another for the local memory, and the other for the query frame. We set the mini-batch size to 8. We use the Adam optimizer [54]. The training is repeated 200K iterations with an RTX 3090 GPU. We initialize the key encoder and the value encoder with the pre-trained weights in STCN [7]. In inference, every 5th frame except the previous frame is picked for the global memory, and the previous frame is used for the local memory.

3) PARAMETERS
The channel dimensions C f 2 , C f 3 , and C f 4 are set to 256, 512, and 1024, respectively. The dimension of key features C k 2 , C k 3 , and C k 4 are equally set to 64. For value features, the number of channels C v 2 C v 3 , and C v 4 are set to 64, 128, and 256, respectively. Also, we experimentally decide the offset of the local region d = 2, top-5 in the memory comparator, α = 3 in (9), and β = 10 −4 in (11).

4) MEMORY MANAGEMENT IN LMR AND SLMR
Since LMR and SLMR are performed in fine scales as well as coarse scales, constructing similarities and affinities for each feature stage may lead to memory overflow. In order to prevent this issue, we construct the local similarities and affinities to store valid values. Since the number of the validate values for the similarities and affinities in each pixel is (2d + 1) 2 , memory complexity requires only O(H r W r · (2d + 1) 2 ) instead of O(H r W r · H r W r ) at feature stage r.

IV. EXPERIMENTAL RESULTS
In this section, we first compare the proposed algorithm with the state-of-the-art VOS algorithms on various datasets. Second, we analyze the proposed local read operations and memory comparator through various ablation studies.
A. DATASETS 1) DAVIS DAVIS [2], [41] is a densely annotated VOS dataset, which is the most commonly used to evaluate VOS algorithms. It provides 480p videos in two separate datasets: DAVIS2016 and DAVIS2017. DAVIS2016 provides single-object annotated 50 videos, which are divided into 30 for training and 20 for validation. DAVIS2017 provides 60/30/30 videos for training/validation/test-dev sets with multi-object annotations. Region similarity J , contour accuracy F, and their mean J &F are used as metrics in experiments.

2) YouTube-VOS
YouTube-VOS [42] is the large-scale VOS dataset. It provides 3471 training videos and 474/507 validation videos for YouTube2018/YouTube2019 datasets with multi-object annotations in various resolutions. In our evaluation, we resize the input frames to have a resolution of 480p. It has 65 seen and 26 unseen object categories. We measure J S and F S for the seen categories and J U and F U for the unseen categories. We also use the overall score G, which is the mean of the four metrics.
B. COMPARATIVE ASSESSMENT 1) DAVIS Table 2 compares the proposed algorithm with the existing semi-supervised VOS algorithms on the validation sets in DAVIS2016 and DAVIS2017. Scores in Table 2 are from the respective papers. For DAVIS2016, the proposed algorithm improves the segmentation performance by 0.4%, 0.4%, and 0.3% in terms of J &F, J , and F, respectively. Also, For DAVIS2017, in spite of its difficulty, the proposed algorithm achieves performance improvements of 0.5%, 0.6%, and 0.5% in terms of J &F, J , and F. This indicates that the proposed local read-and-comparator model is effective for both single object and multiple object cases.
2) YouTube-VOS Table 3 shows the comparison of the proposed algorithm with the existing VOS algorithms on the YouTube2018 and YouTube2019 validation sets. In terms of G, the proposed algorithm achieves the best performance on YouTube2018 and the same performance as the state-of-the-art [7] on YouTube2019. Specifically, for the seen categories, the proposed algorithm stands second and third place on VOLUME 10, 2022 FIGURE 6. Qualitative comparison on DAVIS2017 and YouTube2019 validation sets. We compare the proposed algorithm (LMRC) with STM [3] and STCN [7]. Failed predictions are marked in yellow boxes with the dotted line. VOLUME 10, 2022   YouTube2018 and YouTube2019, respectively. On the other hand, we observe that the proposed method shows the best segmentation results for the unseen categories on both YouTube2018 and YouTube2019. This indicates that the proposed method has superior generalization performance as compared with the state-of-the-arts. The proposed local read operations and memory comparator are robust to unseen categories by exploiting spatiotemporal smoothness between neighboring frames. Figure 6 shows qualitative comparison with STM [3] and STCN [7] on DAVIS2017 and YouTube2019 validation sets. Both STM and STCN fail to accurately segment out detailed regions such as bike wheels on 'bike-packing' and 'mbiketrick' sequences. Also, they are vulnerable to overlapped objects of the same category as in the YouTube-VOS examples. In '56e991f4a6' sequence, they failed to recognize the boundaries of the two overlapping cheetahs. In 'a9cee00b66' sequence, STM even merged them into one object in the end. On the other hand, the proposed algorithm (LMRC) provides accurate results by exploiting the local memory effectively.

C. ANALYSIS 1) ABLATION STUDY
We first analyze the effectiveness of the proposed components: LMR, SLMR, and memory comparator (MC). In table 4, we report J &F, J , Fscores, and frame per second (fps) for various settings on the DAVIS2017 validation set. We also measure G, J S , F S , J U , and F U on the YouTube2018 validation set. We trained each case in the same manner in III-E.
Setting A is the baseline, which uses GMR only. In setting B, LMR is employed for only a single scale at 4th feature stage, which is denoted as LMR-S. Settings B and C show that LMR improves performance. Also, the performance gap between B and C indicates that multi-scale readout features are effective in transferring the information of the local memory to the query. In addition, we see that LMR dramatically increases the performance of the unseen categories on YouTube2018. It is because LMR effectively transfers features within the local region and the local readout feature is trained to emphasize more on the pixel-level than category-level. We also observe that SLMR effectively increases the accuracy of segmentation results from setting D and F. Note that SLMR lowers the overall performance without the memory comparator, but improves the performance for both seen and unseen categories with the memory comparator on YouTube2018. Finally, settings E and F outperform setting C and D, respectively, by employing the proposed memory comparator commonly. Thus, these results demonstrate that the memory comparator significantly improves the performance, which requires little time.

2) LOCAL REGION AND TOP-K SELECTION
We analyze the local region of LMR and SLMR, and the top-k selection in the memory comparator on YouTube2018 and DAVIS2017 validation sets. Table 5(a) shows that d = 2 provides the best performance. Also, we observe that there are no significant changes according to the size of the local region. This is because LMR and SLMR are adaptively used based on the reliability weights. Table 5(b) shows how the performance is varying as k changes. k = 5 yields the best performance on both datasets. Figure 7 shows the reliability weights H L 3 and H L 4 , provided by the memory comparator, for three scene cases: static, dynamic, and fast movement. We observe three properties of the reliability weight. First, H L 3 has high-reliability weights VOLUME 10, 2022 near object edges, which indicates that local readout features are intensely used on object edges to deal with spatiotemporal smoothness motions of target objects between adjacent frames. Second, H L 4 maps in the dynamic scene are generally higher than the static scene. In a static scene, the global readout features are sufficiently reliable since frames in the global memory have similar features to each other. On the other hand, the global readout features in dynamic scenes are generally unreliable, and thus the local readout features should be used with high weights. From H L 4 maps in dynamic and static scenes, we can observe that the proposed memory comparator provides effective reliability maps for accurate segmentation. Third, the memory comparator effectively filters out the local readout features at fast-moving regions of the object (right leg within the yellow box) with lowreliability weight. Thus, the memory comparator deals with the problem of large movements out of the local region N .

V. CONCLUSION
We proposed a novel VOS algorithm that propagates the fused readout features of the local and global memories. First, we developed LMR and SLMR to convey the segmentation data hierarchically to deal with spatial proximity between adjacent frames. Second, we designed the memory comparator to adaptively read the local memory by comparing similarities of the local memory and the global memory. Experimental results demonstrated that the proposed algorithm outperforms the recent state-of-the-art algorithms and overcomes the limitation of the existing memory-based approaches. Although the proposed method is capable of using the adjacent frames, the frames of two or more frames behind should also be taken into account together as local frames with global memory, discriminatively. In the future, we will design to fuse the multiple local frames with global memory to deal with spatial contiguity.