Cascaded MPN: Cascaded Moment Proposal Network for Video Corpus Moment Retrieval

Video corpus moment retrieval aims to localize temporal moments corresponding to textual query in a large video corpus. Previous moment retrieval systems are largely grouped into two categories: (1) anchor-based method which presets a set of video segment proposals (via sliding window) and predicts proposal that best matches with the query, and (2) anchor-free method which directly predicts frame-level start-end time of the moment related to the query (via regression). Both methods have their own inherent weaknesses: (1) anchor-based method is vulnerable to heuristic rules of generating video proposals, which causes restrictive moment prediction in variant length; and (2) anchor-free method, as is based on frame-level interplay, suffers from insufficient understanding of contextual semantics from long and sequential video. To overcome the aforementioned challenges, our proposed Cascaded Moment Proposal Network incorporates the following two main properties: (1) Hierarchical Semantic Reasoning which provides video understanding from anchor-free level to anchor-based level via building hierarchical video graph, and (2) Cascaded Moment Proposal Generation which precisely performs moment retrieval via devising cascaded multi-modal feature interaction among anchor-free and anchor-based video semantics. Extensive experiments show state-of-the-art performance on three moment retrieval benchmarks (TVR, ActivityNet, DiDeMo), while qualitative analysis shows improved interpretability. The code will be made publicly available.


I. INTRODUCTION
Comprehending visual context together with natural language has been a desiderata in the vision-language research societies. Numerous respectful works have made great strides in bridging computer vision and natural language processing including video/image captioning [30], [34], video moment retrieval [1], [8], video/image question answering [12], [25]. Especially, recent success of video streaming services (YouTube) has drowned interest in video search technologies at fine-grained level. Accordingly, Video Corpus Moment Retrieval (VCMR) [13] is a task to localize a moment in large The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . video corpus, which includes two sub-tasks: (1) identifying relevant video in multiple videos and (2) searching for a specific moment in the identified video. To be concrete, the training of VCMR is given a single video-query pair and boundary label, so that system is trained to find the moment related to the query in the video. In the inference, system is given video corpus and query, where it is required to find moment in the video corpus-level. In this respect, VCMR perform a general format of single video moment retrieval.
In methodological aspect of moment retrieval, methods are typically grouped into two categories: (1) anchor-based method and (2) anchor-free method [16], [22], [36], [40]. The anchor-based method [16], [22] follows intuitive solution that first presets a set of video candidate proposals (via sliding window), and then performs classification for the proposals. Figure 1(a) depicts the concept of candidate proposal generation using sliding windows of different lengths, where the best proposal is selected that has the highest similarity to the given query. This anchor-based method is capable of understanding contextual semantics from long and sequential video frames, but it suffers from structural boundary limitation that the candidate proposals should be predefined in some heuristic manner. On the other hand, the anchor-free method [36], [40], has no burdens of generating predefined temporal boundaries, as it directly predicts the start-end time of moment pertinent to query. Figure 1(b) shows an example of anchor-free method [36], where it regresses the start and end time from the joint video-query embedding using multilayer perceptron (MLP). The regression can be free to the preset boundary problem, but the MLP is easily overfitted and hard to understand heterogeneous vision-language semantics, so that the performance is still far from satisfactory.
As shown in figure 1(c), recent anchor-free method [40] devises two dimensional moment score map, where one dimension indicates start frame of a moment and the other indicates end frame of a moment. To build this score map, systems predict frame-level start time probability and end time probability, after then multiply the two probabilities to calculate joint start-end probability in the format of 2D moment score map. Therefore each element in the map contains frame-level moment score corresponding to its start and end frames. Although this 2D moment score map can leverage the aforementioned preset boundary problem, still they confront the inherent problem of anchor-free method, which is insufficient understanding of contextual semantics across the long and sequential frames. This is because existing anchor-free systems have utilized only frame-level video-query similarities, so that they were not aided from context-level understanding with different lengths in video. Especially, in a scene (drama, movie) based on multi-character interaction, this contextual semantic understanding can be more crucial, combined with auxiliary modalities such as subtitles and sounds.
To overcome aforementioned challenges of existing methods, our proposed Cascaded Moment Proposal Network (Cascaded MPN) incorporates following two main properties: (1) Hierarchical Semantic Reasoning (HSR) which provides video contextual understanding from anchor-free level to anchor-based level via building hierarchical video graph, and (2) Cascaded Moment Proposal Generation (Cascaded MPG) which precisely performs moment retrieval via devising cascaded multi-modal feature interaction among anchor-free and anchor-based video semantics. In overall pipeline, the HSR provides anchor-based level and anchorfree level semantics via building bipartite graph among video and subtitles, and this multi-level (anchor-based, anchorfree) semantics generates multi-level moment score maps based on a similarity with given query. Finally, the Cascaded MPG associates the contextual meanings from each level of moment score map in a recursive manner and predicts moment pertinent to the query. Cascaded MPN shows effectiveness on three challenging benchmarks (i.e., TVR, DiDeMo, and ActivityNet) and the code will be made publicly available.

A. VIDEO MOMENT RETRIEVAL
Video moment retrieval (VMR) is a task of localizing a moment pertinent to given natural language sentence. Gauged from the remarkable advancements of natural language processing [5], [23], VMR system has developed from localizing a temporal activity to a task that understands a natural language query and retrieves the relevant moment [6], [33], [35], [38]. Furthermore, improvements of video representation learning [3], [26] contribute to boosting performance of retrieval system from video retrieval to moment retrieval. The first attempts of retrieval [15], [24] were in temporal activity localization, which aims to predict start-end time of moment corresponding pre-defined actions. Henceforth, a large number of advancements of natural language processing bridges temporal activity localization to a language-based moment retrieval. Gao et al. [8] first proposes video moment retrieval, which localizes moments with a sentence describing actions. Hendricks et al. [1] proposed VMR with a simplified format for clip-level video understanding. In the meantime, Mithun et al. [21] proposed different type of retrieval system that finds a video related to text in multiple videos, which is called video retrieval. Recent efforts advance forward to a general form of video moment retrieval. Lei and Li et al. [13], [14] propose systems that perform moment retrieval in a video corpus level, which incorporates video retrieval and single video moment retrieval. This video corpus moment retrieval contains an insight that moment retrieval systems should be operated on a general situation given multiple videos. Inspired by another step of generality, we strive for enhanced interpretability in VCMR.
B. VIDEO PROPOSAL GENERATION Video moment retrieval system estimates moments by predicting start-end time of the moments related to given query. Current literature for predicting moments can be largely grouped into two categories depending on the way VOLUME 10, 2022 of predicting the moments as like anchor-based methods and anchor-free methods. Anchor-based methods facilitate context-level video representation learning, which is suitable for learning sequential semantics in the video. Previous works in anchor-based methods generate several proposals with different sizes by sliding window and retrieve the most pertinent one. Lin et al. [16] applied an algorithm considering exploration and exploitation based on reinforcement learning to select top-K proposals. Ma et al. [19] proposes surrogate proposal selection which reduces the redundant proposals via selecting one surrogate from defined proposal group. In the case of anchor-free methods, they are versatile to predicting expected moments without temporal boundary constraints. Yuan et al. [36] proposed Attention Based Location Regression which regresses start-end points of moment related to query via a series of multi-modal co-attentions. Zhang et al. [40] designed two dimensional map with start time and end time as axes, which covers diverse video moments with different lengths. Xiao et al. [32] proposed two-stage candidate proposal generating method, which prepares the two-dimensional map as [40] and searches candidate moment proposals from the map. Wang et al. [31] also proposed two-stage coarse-to-fine grained multi-modal interaction between video and query. Although many novel thoughts have been proposed as above, there are still room for improvement in terms of converging both anchor-free and anchor-based manners, where we made an effort to perform moment retrieval in a fine-grained level in terms of integrating the beauty of these two methods.

III. METHOD
Cascaded MPN takes video and textual query as inputs and produces a moment score map that includes scores in terms of how similar each moment is to the query. The figure 2 presents overall pipeline of Cascaded MPN, where we define anchor-based and anchor-free semantic features from Hierarchical Semantic Reasoning and explain how they are associated to predict best-matched moment in Cascaded MPG. In training, the system is given a single video-query pair and trained to find the moment in the paired video. In inference, it is required to find moment in a video corpus.
where N v and N s is the number of frames and subtitles in a single video. To reason hierarchical semantics in anchor-based and anchor-free level, we first define (1) anchor-free semantic features and construct (2) anchor-based semantic features founded on the anchorfree semantics.

2) ANCHOR-FREE SEMANTIC REASONING
As shown in Anchor-free Semantic Reasoning in figure 2, video frames that share the same subtitle have common contextual meaning from that subtitle. To give this common meaning on video frames, we build a bipartite graph between the shared frames and subtitle. To this, we reorganize the video frames to be aligned with single subtitle s i as where φ v and φ w is d-dimensional embedder. LN is layer normalization [2] and PE is the positional encoding [27]. The frame embedding E s i v and word embedding E s i w contain common semantic of subtitle s i and in order to hold this semantic in each video feature, we formally construct videosubtitle graph G s i = (H s i , E s i ) by regarding E s i w and E s i v as nodes group H s i in Equation 3. For the edges E s i of videosubtitle graph, we design bipartite graph between the words and frames.
where M s i = M s i v +M s i w is number of nodes in node group H s i and [·||·] denotes concatenation along with frame and word axis. To help understanding, in the section of Anchor-free Semantic Reasoning in figure 2, we depict diagram of bipartite graph showing connectivity between nodes in the graph. To associate these frames embedding with the words embedding, we conduct multi-head graph attention [29]. In each head, we use attention coefficient α k mn to give association between any linked two nodes m and n within node group H s i , and k in α k mn means k-th head like below: w k ∈ R 2d is weight vector and W k ∈ R d×d are shared embedding. H s i m ∈ R d is m-th node feature in H s i and H s i n , H s i l follow the same meaning. N m is the set of all nodes linked to node m in the bipartite graph. All nodes are updated with this attention coefficients α k mn and we define videosubtitle features Z s i by averaging of this updated nodes features. Here, we use video-subtitle, because frames and words in one subtitle get the shared semantic by attention. We define final anchor-free level semantic features by adding this Z s i to original H s i : where K is the number of attention heads. Here, we only used video features in (Z s i + H s i ) as V s i , supposing that subtitle semantics are involved in frames by video-subtitle features Z s i , where [: i] is slicing operation along node-axis.

3) ANCHOR-BASED SEMANTIC REASONING
In the anchor-based semantic reasoning, we first collect all the anchor-free level features V = {V s i } N s i=1 ∈ R N v ×d and uniformly divide V into N segments. From this segments, we build N anchor-based semantics C N ∈ R N ×d . In one segment, we perform multi-head self-attention to V using Transformer [28] and treat average of the segment along frame-axis like below: where Head denotes multi head self-attention of the Transformer. The following Cascaded MPG operate with this anchor-free semantic features V ∈ R N v ×d and anchor-based semantic features C N ∈ R N ×d .

B. CASCADED MOMENT PROPOSAL GENERATION
Cascaded Moment Proposal Generation (Cascaded MPG) is introduced to perform moment prediction considering different-level (anchor-free, anchor-base) multi-modal interaction, where it takes inputs as these two semantics, and produces a two-dimensional map containing query-moment similarity score in figure 3(a), where one dimension represents start time of moment and the other dimension represents end time. Casaceded MPG assumes a score map for framewise moment retrieval as the same score maps in previous works [13], [14], but also performs contextual reasoning associated with anchor-based semantics. To this, Cascaded MPG includes two main processes: (1) Conditional Moment Score Map generation and (2) Sparsity Pooling, which contribute to multi-modal feature interaction between anchor-based and anchor-free semantics.

1) CONDITIONAL MOMENT SCORE MAP
Conditional Moment Score Map (CMSM) produces twodimensional moment score map via containing querymoment similarity score in figure 3(a). To build CMSM, we define conditional moment score map generator f cond by multiplying start time probability of moment P(t st |v, q) ∈ R L×1 and conditional end time probability of moment P(t ed |I st ; v, q) ∈ R L×L . Given d-dimensional video feature v = [v 1 , . . . , v L ] ∈ R L×d with the number of frames L and sentence feature q ∈ R d , the start time probability P(t st |v, q) calculates the probability along the frame axis using videoquery similarities as: where the Conv 1D st and Conv 1D ed below denote 1D convolution layer embedding into start-end probabilities and t st , t ed are frame level start-end time. For the conditional end time probability, we first define conditional probability P(t ed |i st ; v , q ), VOLUME 10, 2022 where W v , W q ∈ R (d+1)×d are weight matrix and the operation [·||·] axis=n is concatenation along the axis n. We stack all the P(t ed |i st ; v, q) along the column axis like Equation 13 and build conditional end time probability P(t ed |I st ; v, q) ∈ R L×L . Finally, f cond (v, q) ∈ R L×L builds CMSM by multiplying start time and conditional end time probability, where is column wise and · is element wise multiplication: ; v , q ); · · · ; P(t ed |(L − 1); v , q )} (13) f cond (v, q) = U m · (P(t st |v, q) P(t ed |I st ; v, q)) ∈ R L×L , (14) where, we give upper triangular mask U m ∈ R L×L composed of 1 to remove the score in moments, where end-time comes before start-time. Therefore, the anchor-free score map f cond (V, E q ) and anchor-level score map f cond (C N , E q ) in figure 2 are defined by regarding video feature v as anchorfree features V and anchor-based features C N .

2) SPARSITY POOLING
Sparsity pooling is introduced to mitigate redundantly overlapping moments of frame-level moment score map. The Figure 3(b) shows moment score map in a 3-dimensional view, the high overlapping candidate moments in the map keep similar scores in local region of video, which loses the chance of retrieval in various areas and degrades retrieval performance. To resolve this, our proposed sparsity pooling h(x) makes the distribution of score map to be sporadic, which allows the retrieval systems to explore diverse moments in the positions and lengths. In detail, the sparsity pooling h(x) takes input of moment score map x ∈ R L×L and outputs of the same score map but that holds sparsity in the distribution. To this, we first build sparsity mask a ∈ R L×L in Figure 4, which includes the following processes: (1) calculating 2D max pooling outputs x N from original score map x with kernel size of N × N and stride of N , (2) generating 2D upsampled mapx by nearest neighbor upsampling up to the original size of x and (3) finally, preparing sparsity mask from elementwise dividing x byx. The aforementioned processes can be described as follows: where ./ and · are element wise dividing and multiplication. In Equation 15 and 16, thex contains a local maximum score of original score map x. In this process, sparsity pooling h(x) maintains the maximum score in the N ×N window and builds sparse distribution within the windows.

3) CASCADED MOMENT PROPOSAL GENERATION
This section introduces cascaded moment proposal generation (Cascaded MPG) algorithm in detail. Based on the anchor-free semantic V ∈ R N v ×R and anchor-based semantic C N ∈ R N ×d , Cascaded MPG produces 2D moment score map for moment prediction, where the algorithm relies on the conditional moment score map generator f cond and sparsity pooling h(x) in a recursive manner. Figure 5 summarizes the pipeline of Cascaded MPG. In the first stage, f cond (V, E q ) builds anchor-free moment score map M. The sparsity pooling h(x) bridges to the next stage by performing sparsity masking on the score map. At the last stage, f cond (C N , E q ) builds a anchor-based moment score map N N , after then the map is up-sampled to the original anchor-free score map and added to the output of the sparsity pooling. The whole pipeline of Cascaded MPG is described in Algorithm 1.

4) TRAINING
The anchor-free semantics V ∈ R N v ×d and anchor-based semantics C ∈ R N ×d are trained under two types of loss as follows: (1) video-level loss; and (2) moment-level loss. In video-level loss, we use hinge loss in terms of cosine similarity with query feature E q ∈ R d like below: where + is positive from video-query pair and − is negative from other videos in a batch. p(·) is 1D max-pooling and s(·, ·) is cosine similarity. c = 0.1 is a margin and we select surrogate cosine similarity by max-pooling among anchor-based and anchor-free semantics. In moment-level loss, we use cross-entrophy loss CE in terms of ground-truth start-end time (g st , g ed ) and predicted start-end time probabilities as: Total loss L is defined with L v and L m using hyperparameters α and β.

A. DATASETS
We validate our prospoed Cascaded MPN on three recent benchmarks (TVR, DiDeMo, ActivityNet Captions) as follows: (1) TV show Retrieval (TVR) dataset [13] is constructed under 6 TV shows across 3 genres: medical dramas, sitcoms and crime dramas. TVR contains 109K queries from 21.8K videos with subtitles and each video is about multi character interactions with 60-90 seconds in length. For the fair comparison [13], [14], [37], We also split TVR into 80% train, 10% val, 5% test-private, 5% test-public. The test-public is prepared for official leaderboard. (2)  For the evaluation of VCMR, prediction is correct if: (1) a predicted video matches the ground-truth video; and (2) the predicted moment has high overlap with the ground-truth moment. Average recall at K (R@K) over all queries is used as the evaluation metric, where temporal Intersection over Union (tIoU) is used to measure the overlap between the predicted moment and the ground-truth. We first predict top-100 videos from video corpus by measuring p(s(V, E q )) in Equation 19 as and Cascade MPG localizes the best matched moment among the videos.

2) TRAINING DETAILS
We used same video features with [14] using SlowFast [7] pre-trained on Kinetics [10] and ResNet-101 [9] pre-trained on ImageNet [4]. The text features are contextualized token features from pre-trained on RoBERTa [17].    scratch on TVR and ActivityNet. The Cascaded MPN consistently outperforms the runner models without pre-training. As the subtitles are unavailable in the ActivityNet and DiDeMo, we utilize video features from video encoder instead of anchor-free semantic features. To the further experiment of DiDeMo, we complement the subtitles with the auxiliary features using Audio Speech Recognition (ASR) from [14], which makes anchor-free semantics available and gives performance gain up to 3.31 in the measure of tIoU=0.7,R@1 on DiDeMo. As reported in [14], the previous results from DiDeMo and TVR are also conducted under pretraining with large-scale dataset HowTo100M [20]. For the fair comparison, we also presents the results from pre-training of HowTo100M on DiDeMo and TVR from Table 1 and 2. Besides, for the two sub-tasks of VCMR : SVMR and VR, Table 2 presents the results on TVR, where the Cascaded MPN also validate the effectiveness.

D. ABLATION STUDY
We perform ablation studies with several variants of Cascaded MPN. Table 3 summarizes ablative results of sparsity pooling (SP), anchor-free semantic reasoning (AFr), anchor-based semantic reasoning (ABr) and conditional For the absence of CMSM, we utilize P(t ed |v, q) in stead of P(t ed |I st ; v, q) and define score map generator as f cond (v, q) = U m (P(t st |v, q)P(t ed |v, q) ). CMSM is also worth that it saves about half of training time by early saturation. The Table 4 presents experimental results according to the variants of cascaded layer length. The cascaded layer n = 3 shows highest performance with kernel size N = 2, 4, 8 in sparsity pooling and more long layers give a slight deterioration in performance. This is because in the early stage of cascaded proposal generation, it is effective to remove many redundant candidates, but after the layers longer than 5, as there are not many redundant candidates, it may damage the proposal scores in way of dropping performance.
E. QUALITATIVE RESULTS Figure 6 represents moment prediction, conditional statend probability, and Figure 7 represents moment score map after sparsity pooling. In the Figure 6, the red curve is the start-probability distribution and the blue curve is the endprobability distribution. From these two distributions, final  moment prediction is performed. Since end-probabilities have the start-probabilities as conditional prior, they have high values right behind the start time. In Figure 7, we can see that the intensive score distribution in a specific moments is alleviated into a sporadic distribution through sparsity masking, which gives the chances of retrieval in various areas and boosts performance in recall. From these, the conditional moment score probability generation and sparsity pooling have a positive effect on the retrieval.

F. LIMITATIONS
We think that Cascaded MPN used many attention weights to represent the two different types of video representations: (1) anchor-free semantic and (2) anchor-based semantic, which took a lot of time to fully learn each features. In this regard, further study would be possible to generate this two hierarchical representations in a more efficient way (weight sharing, model pruning) or another types of representation. We believe that many valuable researches will be built under motivation of overcoming these limitations.

V. CONCLUSION
We propose Cascaded MPN for video corpus moment retrieval to overcome two main challenges: (1) anchorbased method is vulnerable to heuristic rules of generating video proposals, which incurs restrictive moment prediction in length; and (2) anchor-free method systemically suffers from insufficient understanding of long and sequential video semantics. Therefore, our proposed cascaded MPN incorporates following two properties: (1) Hierarchical Semantic Reasoning which gives video understanding from anchorfree level to anchor-based level by building hierarchical video graph, and (2) Cascaded Moment Proposal Generation which precisely performs moment retrieval by devising cascaded multi-modal interaction among anchor-free and anchor-based level video semantics. Experimental results on three benchmarks show effectiveness of our Cascaded MPN.