Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.


Introduction
Text-Video Retrieval (TVR) [25,6,13,18,34,3,37] has made significant progress in recent years. Given a piece of sentence, TVR aims to search a video that is semantically relevant to the targeted sentence from a video database and vice versa. Compared to other cross-modal video tasks, it is much easier for TVR to obtain paired text-video data from the Internet, such as movies with corresponding captions or YouYube videos with titles. Thus, based on massive text-video pair data, TVR becomes an important proxy task to understand video contents. However, a natural semantic gap between two modalities, i.e., video and text, raises a great challenge, which hinders industrial-level applications of TVR. To this end, recent methods target to distill cross-modal knowledge from large-scale pre-training experts [21,14,19,32] and leverage cross-modal contrastive learning to explore both intra-modal representation and cross-modal interaction.
A pioneering vision-language pre-training work is the Contrastive Language-Image Pretraining (CLIP) [30], which collects 400 million image-text pairs to learn general multi-modal knowledge. Based : HCMI measures the semantic similarity between a video and a sentence by considering multi-level interactions at frame-word, clip-phrase, and video-sentence granularities, simultaneously. Both visual and text aggregation modules adopt the same architecture, which is given in the right column. FC indicates the fully connected layer. The visual and text encoders adopt standard CLIP architectures in [30], e.g., L v = L t = 12 for CLIP with ViT-B/16.
on CLIP, recent methods [25,13,34,6] aim to transfer the well-pretrained image-text knowledge from text-image to text-video. For example, CLIP4Clip [25] develops temporal ensemble modules to aggregate sequential CLIP frame features into a global one. Besides, massive matched and unmatched text-video pairs are constructed to learn video representations and text embeddings via cross-modal contrastive learning. By fine-tuning on text-video data, CLIP4Clip achieves a great success in TVR. Different from learning global video representations and sentence embeddings, DCR [34] proves that the token-wise interaction between frames and words reveals more fine-grained cross-modal knowledge. Specifically, DCR constructs a similarity matrix between different frame representations and word embeddings, and then infers a comprehensive text-video matching score by considering dense frame-word correlations. However, these methods only consider single cross-modal interaction from either the video-sentence level or the frame-word level, which results in biased retrieval. In the human sense, we recognize a video-text pair by simultaneously analyzing video-sentence, clip-phrase, and frame-word interactions, due to the intrinsic hierarchical semantic structure in video and text data, as shown in Figure 1.
In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), which hierarchically explores video-sentence, clip-phrase, and frame-word interactions to understand text-video contents comprehensively. To utilize large-scale pre-training knowledge, HCMI leverages CLIP as initial visual and text encoders, similar to [25,34]. To explore fine-grained cross-modal interactions, HCMI first constructs hierarchical visual representations and text embeddings at respective frame-clip-video and word-phrase-sentence granularities, as shown in Figure 1. Taking the visual modality as an example, HCMI performs self-attention to gather semantic-correlated frames into several clip representations, which are further fused into a global video representation. Similar to the video modality, a sentence also has multi-level representations, e.g., consisting of words and phrases, and can be represented in a word-phrase-sentence manner. Thus, based on hierarchical video and text representations, HCMI leverages cross-modal contrastive learning to learn inter-modal relationships at frame-word, clip-phrase, and video-sentence granularities, respectively, which achieves a more comprehensive cross-modal comparison than the previous methods.
Besides, we argue that it is unreasonable to regard all videos as individual categories and repel text embeddings from all other video representations with different sample IDs, which is, however, widely used in cross-modal contrastive learning. For example, the text of 'a man is running' should not be a negative pair of other videos containing 'man running' content. Thus, HCMI adopts an adaptive label denoising strategy to discover potential positive text-video pairs even with different sample IDs to avoid confusing updating, and proposes a marginal sample enhancement strategy to improve feature discrimination.
Consequently, HCMI obtains new state-of-the-art performance on various benchmarks. Our contributions are three-fold: 1. we propose a novel method, dubbed Hierarchical Cross-Modal Interaction (HCMI), which explores multi-level cross-modal interactions at video-sentence, clip-phrase, and frame-word granularities to understand text-video contents comprehensively; 2. we design adaptive label denoising and marginal sample enhancement strategies to discover potential positive pairs and enlarge hard sample margins, which can avoid noisy gradients and improve feature discrimination; 3. HCMI obtains new state-of-the-art Rank@1 retrieval results of 55.0%, 58.2%, 29.7%, 52.1%, 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.
This paper is organized as follows. Section II reviews recent related works on text-video retrieval tasks. Section III introduces our proposed Hierarchical Cross-Modal Interaction method, as well as adaptive label denoising and marginal sample enhancement. Section IV gives experimental results and a detailed analysis, followed by the conclusion part in section V.

Related Work
Previous methods for video understanding focus on designing 3D convolution kernels to capture spatio-temporal information [33,36,11]. Recently, Vision Transformer (ViT) [8] has shown a great potential in many vision tasks, especially when massive supervision is available. Thus, many works investigate transformer-based video encoders for video content understanding.
A recent emerging topic is how to explore semantic supervision from large-scale unlabeled data [30,26,30]. In the image-text domain, knowledge transfer and multi-level alignment are two common ways for supervision. For example, CMR [40] transfers valuable knowledge from existing annotated data to new data via a joint learning paradigm. DSCMR [41] minimizes the discrimination loss in both the label space and the common representation space to supervise the model, which obtains promising results. Different from manual labels for video tasks, it is much more convenient to collect large-scale video and text pairs collected from the Internet. For example, Howto100M [27] contains millions of instructional videos, and WebVid-2.5M [3] collects 2.5M video-text paris from the web. Based on massive video-text pairs, plenty of pre-trained model-based methods [24,12,28,42,1,20,26,9] have dominated the video-text retrieval leaderboard. In general, these methods can be roughly divided into two categories: video-sentence-interaction-based and frame-word-interaction-based methods.
The video-sentence-interaction-based methods [25,23,29,3,7,12,22,23,15,16], whose retrieval process is extremely concise and efficient, employ the separate text and video embedding extractors to map the texts and videos into a common feature space, and then directly conduct the retrieval task based on cosine similarities between their feature representations. For example, ClipBERT [18] proposes an end-to-end approach with visual and text encoders pre-trained on the image-language dataset and leverages sparse sampling to alleviate the training burden. FROZEN [3] treats an image as a single-frame video and designs a curriculum learning schedule to train the model on both image and video datasets. Different from FROZEN pre-training a new model on video-text retrieval, the previous SOTO method CLIP4clip [25] transfers the knowledge from the image-text pre-trained model CLIP [30] to solve the video-text retrieval task. Also based on the pre-trained CLIP model, CAMoE [6] proposes a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to further improve the retrieval performance. The video-sentence-interaction-based methods consider the alignment between global video embedding and sentence embedding. However, in the human sense, we compare a video and a sentence in multi-level aspects, such as different frames and words, which is ignored by previous video-sentenceinteraction-based methods.
The frame-word-interaction-based methods first utilize embedding extractors to transform each text (video) into a sequence of token (frame) embeddings, and then uses an interaction module to capture fine-grained clues between the token and frame embeddings. Inspired by [9], the pioneering work FILIP [39] proposes a fine-grained interactive language-image method that leverages tokenwise maximum similarity between visual and textual tokens, e.g., patches and words, to guide the image-text contrastive learning. Similarly, DRL [34] proposes a Weighted Token-wise Interaction to explore the fine-grained clues between sentence tokens and video frame embeddings and a Channel DeCorrelation Regularization to reduce feature redundancy from a micro-view. Since these methods explore fine-grained similarity measurements between visual-text modalities, they achieve better text-video retrieval performance than video-sentence based methods.
However, almost all of the existing methods only consider single cross-modal interaction from either the video-sentence level or the frame-word level. Few of them explore the intrinsic hierarchical semantic structure in the video and text data. Hence, in this paper, we propose a novel method, dubbed Hierarchical Cross-Modal Interaction, which hierarchically explores video-sentence, clip-phrase, and frame-word interactions to understand text-video contents comprehensively.

Hierarchical Cross-Modal Interaction
Given a set of videos v = {v i } N i=0 and corresponding captions t = {t i } N i=0 , our core motivation is to learn a visual encoder f v (·) and a text encoder f t (·) that well capture visual and text semantics. Here, The overall pipeline of Hierarchical Cross-Modal Interaction (HCMI) is given in Figure 2.

Video Input
Text Input

Visual Encoder
Text Encoder

Cross-Modal Interaction
Batch Size

Hard Sample Mnining
Hard Negative Sample + Figure 2: The pipeline of the proposed method. Given hierarchical visual and text features, HCMI first produces text-video pair labels via the adaptive label denoising strategy. Besides, the hard negative samples will be selected according to text-video matching scores. Then, the produced text-video labels and hard negative samples are sent to the hierarchical cross-modal contrastive loss and triplet loss, respectively.

Multi-Level Representations
Different from image, video has a hierarchical frame-clip-video structure, which describes the video content from different granularities. Similarly, the text consists of words, phrases, and global sentences. In a human sense, we measure the similarity between a video and a text in terms of multiple aspects, e.g., a frame matches a certain word or a clip matches a certain phrase. Following this motivation, HCMI designs and learns multi-level video and text representations to compute cross-modal similarity from multiple aspects.
It is noted that V f i and T w i contain frame-level and word-level information, which depict finegrained contents of video and text, respectively. To further extract features that capture temporal visual information and long-term word dependence, HCMI leverages self-attention to aggregate semantically related frames into clip representations and related words into phrase embeddings, automatically. Taking visual modality as an example, the aggregation function g v (·) can be defined as: Notably, N f and N c are frame number and clip number, respectively. h(·) is a two-layers FC-ReLU layer with channel changes actually aggregates several semantically related frame representations into a single one, thereby containing clip information. Also, the text aggregation function g t (·) can be given as: where T p i ∈ R Np×D contains phrase information. Instead of manually designing clip and phrase information, V c i and T p i are produced automatically by aggregating semantically related frames and words.
Similar to g v (·) and g t (·), V c i and T p i can be further aggregated into the video-level representa- HCMI describes a video and a sentence at frame-clip-video and word-phrase-sentence granularities, respectively. Next, we propose hierarchical contrastive learning for different granularities.
For {V f i , T w i }, the token-wise interaction function is given by: where < ·, · > is the dot product function. T I(V f i , T w i ) first calculates a pair-wise similarity matrix between frames and words and then aggregates all token-wise similarities into an overall score. L f −w is a symmetric cross-modal contrastive loss that measures the cross-modal similarity between a set of frames and words. ).
Similar to L f −w , L c−p uses T I(·, ·) to measure the cross-modal similarity between clips and phrases. For ).
Here, L v−s uses cosine similarity to measure the cross-modal similarity between video and sentence representations.
Finally, the loss function for hierarchical cross-modal interaction is: where α and β balance different terms.

Adaptive Label Denoising
Furthermore, most existing methods simply treat all video-text pairs with different sample IDs as negative pairs. For example, a text has only one positive video that is from the same data pair, and all the videos from other data pairs are treated as negative datapoints. However, in fact, one text usually has other similar video descriptions in the dataset, so when taking such datapoints as the negative ones, it will harm the retrieval performance of the model.
To tackle the above issue, we design a novel adaptive label denoising scheme to discover such potential similar datapoints. Specifically, for each video in a data batch, we target to discover other videos with similar content, which cannot be treated as negative pairs. For a video v i , we obtain two views v 1 i and v 2 i by randomly sampling its frames twice. We define two videos to be similar, if v 1 i and v 1 j are more similar than v 1 i and v 2 i . In other words, the feature embedding of these two views, i.e., V 1 i and V 2 i , should be similar, as they describe the same video content. Similarly, supposing there are two views V 1 j and V 2 j for the video v j , the video v j can be treated similar to v i , when the following inequality holds: where cos(·, ·) denotes the cosine similarity between the inputs. Here, V k i represents the video level representation of the k th view of video v i . Based on Eq. (8), we can obtain pair-wise similarities in a data batch. For example, for the i-th sample, we collect all the samples that meet the conditions of Eq. (8), and these samples make up a subset N + i . Notably, the samples in N + i are regarded as similar videos with i-th sample, which cannot be treated as negative samples.
In a data batch, when videos v i and v j are defined similar through Eq. (8), the video-text pairs (v i , t i ) and (v j , t j ) are considered as positive. Thus, Eqs. (3), (5) and (6) can be reformulated as follows: where N + i 's denote indexes of data pairs similar to the data pair (v i , t i ); N − i 's denote the indexes of data pairs dissimilar to (v i , t i ).

Marginal Sample Enhancement
During optimizing the contrastive loss, it is easy for the model to distinguish negative sample pairs with obviously different contents, which contain limited information. Thus, the model should focus more on those hard samples with a subtle content difference, and we design a marginal sample enhancement loss to emphasize the distinguishing capacity for hard text-video pair samples.
where θ is a margin coefficient. Here, we only use {V v i , T s i } at the video-sentence granularity, as the gradients will be back-propagated to {V f i , T w i } and {V c i , T p i }.

Overall Objective
The overall objective function of HCMI is where λ controls the balance across two terms.

Experiments
Experiments are conducted on five Text-Video Retrieval (TVR) benchmarks, and several ablation studies are given to demonstrate the effectiveness of each component in the proposed Hierarchical Cross-Modal Interaction (HCMI).

Datasets and Evaluation Metrics
HCMI is evaluated on five public benchmarks: MSR-VTT [38] is the most popular TVR benchmark, which contains 10,000 videos with 20 captions. We report the results on the standard full split.
DiDeMo [31] contains 10,000 videos with 40,000 sentences. Following [25], all captions are concatenated into a single query for text-video retrieval.
LSMDC [2] contains 118,081 videos with an equal number of caption sentences from 202 movies. We adopt the standard split for training and testing.
ActivityNet [5] contains 20,000 YouTube videos. Following [25], we concatenate all captions of a video as a single query.
Following [25], we report experiments under the standard TVR metrics at rank-K, which evaluate the percentage of query samples for which the right answer is founded in the tok-K retrieved results. We report the rank-1, rank-5, rank-10, median rank, and mean rank metrics. The median rank and mean rank calculate the median and mean rank of all correct results. Notably, higher rank-1, rank-5, and rank-10 are better, and lower median rank and mean rank are better.

Implementation Details
The basic visual and text encoders adopt the pre-trained weights in CLIP [30], which include ViT-B/16 and ViT-L/14 architectures [8]. Besides, a 4-layer temporal transformer is added to capture the temporal information on top of the visual encoder. Similar to CLIP4Clip [25], parameters of the temporal transformer are initialized from the first 4 layers of the text encoder in CLIP. The frame length N v and word length N w are 12 and 32 for MSR-VTT, MSVD, LSMDC, and 64 and 64 for DiDeMo and ActivityNet, respectively. The network is optimized by Adam with 5 epochs. The batch size is 128 for ViT-B/16 and 64 for ViT-L/14. The initial learning rate is 1e − 7 for the clip parameters and 1e − 4 for the non-clip parameters, respectively. The hyper-parameters are set as N c = N p = 6, α = 0.5, β = 0.1, and θ = λ = 0.1, which will be analyzed in the ablation study.
We use 32 A100 GPUs for the ViT-L/14 backbone and 8 A100 GPUs for the ViT-B/16 backbone under the 128 batch-size setting for experiments. Each GPU is equipped with 8 CPU cores and 48G RAM. The algorithm is implemented via python code.
For the HCMI with SMoE, we explore the best setting for different datasets using 64 A100 GPUs.
Here, we utlize the ViT-L/14 as the basic model, which has aroud 450M parameters. Then, we replace the FFN layers in ViT-L/14 with SMoE layers to expand the model scale. For MSR-VTT, LSMDC and MSVD with 12 frames and 32 words, we use 64 experts in each SMoE layer, which produces about 17B parameters of the whole model. For ActivityNet and DiDeMo with 32 frames and 64 words, we use 8 experts in each SMoE layer to fill up the GPUs. Finally, after experiments, We find that k = 2 gives the best results for MSR-VTT and LSMDC, and k = 4 is the best for MSVD. For ActivityNet and DiDeMo, k = 1 gives the best results.

Ablation Study
In this part, we analyze the effect of each component and hyper-parameter in HCMI and visualize some important results.

Evaluation on each component
We conduct experiments on MSR-VTT to evaluate the effect of each module and method in HCMI. 'Base' indicates the baseline CLIP4Clip model with a temporal transformer and ViT-B/16. 'GDP' is the global dot product interaction between video representations and sentence embeddings, which is used in the CLIP4Clip method. 'TWI' uses token-wise interaction in [34], which explores a dense relationship between different frames and words. 'HCI' is the multi-level cross-modal interaction in HCMI, which explores frame-word, clip-phase, and video-word relationships, simultaneously. 'Denoise' and 'MSE' indicate adding adaptive label denoising and marginal sample enhancement strategies. 'Dual' means using the dual-softmax proposed in [6]. Notably, 'Dual' is only used during the inference stage as a post-processing strategy.
The results are given in Table 1. It can be seen that, compared to the global dot product interaction in 'Base', our hierarchical cross-modal interaction ('HCI') brings a significant gain, e.g., 4.9% improvement on R@1. Compared to the token-wise interaction 'TWI', 'TCI' also shows obvious superiority, due to extra clip-phrase and video-sentence interactions. This demonstrates that considering multi-level interaction between video and text can significantly improve performance. Besides, both adaptive label denoising and marginal sample enhancement further bring gain, which shows that HCMI can successfully discover potential position samples in a batch. Finally, our HCMI is made up of Base+HCI+Denoise+MSE+Dual.

Evaluation on hyper-parameters
In this part, we evaluate the effect of each hyper-parameter in HCMI on four benchmarks, i.e., MSR-VTT, LSMDC, DiDeMo, and ActivityNet. Here, the post-processing strategy of Dual Softmax is not used.
First, we conduct experiments on different N c or N p . Here, we set N c = N p , which indicates how many clips or phrases HCMI obtains from frames or words. The results are given in Figure 3. It can be seen that, when N c is larger than 6 on MSR-VTT and LSMDC, the improvement becomes weak. The reason is that we sample 12 frames for MSR-VTT and LSMDC datasets, so N c = 6 is enough to capture the clip information. However, for DiDeMo and ActivityNet with 64 frames, large N c gives an obvious gain. In consideration of model simplicity and complexity, we choose N c = 6 for all the datasets to achieve an overall performance on the five datasets.
Then, we evaluate the effect of different α and β. In Figure 6, we find that α = 0.5 and β = 0.1 give the best performance. An interesting phenomenon is that, when α > 0.5 or β > 0.1, the performance drops seriously. The reason is that, the frame-word interaction in Eq. (3) and Eq. (9) also contains the clip-phrase and video-sentence information, as frames and words make up of clip and phrase. Thus, when using large α and β, it will lead to unbalanced loss contributions among frame-word, clip-phrase, and video-sentence granularities. Besides, the automatically aggregated clip representations and phrase embeddings also contain video-and sentence-level information. Thus, the optimal α is larger than β, and both α and β are sensitive to a large value.
Finally, λ is evaluated in

Visualization
The core motivation of HCMI is to explore hierarchical cross-modal interactions between frame-word, clip-phrase, and video-sentence. Thus, we give some matching results between frames and words to verify the effectiveness of fine-grained interaction. From Figure 4, most words can be matched with semantic-related frames. Especially, some important nouns and verbs play important roles in TVR with large similarity scores, which justifies the effectiveness of exploring hierarchical semantic similarities. Besides, we visualize auto-aggregated clip-phrase matching pairs in Figure 5. For each clip and phrase, we only visualize top-3 tokens for simplicity. From these results, HCMI can effectively aggregate semantic-related frames into an integrated clip and cluster important keywords into a phrase. Since clips and phrases are learned from datasets, clip-phrase matching results are worse than frame-word matching. Thus, this is one of our further research directions by exploring a stronger aggregation mechanism.

Comparison with State-of-the-Art Methods
In this part, we compare our HCMI with state-of-the-art methods on MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet benchmarks.  The MSR-VTT results are given in Table 2. It can be seen that HCMI significantly surpasses CLIP4Clip by 10.4% in R@1 and outperforms the brand-new method DCR by 1.6% in R@1. This is because that HCMI considers multi-level interactions between video and text modalities, which is ignored by CLIP4Clip and DCR. It is noted that the reason for the weak performance of ViT-L/14 is that a large-scale model tends to overfit on a small-scale dataset. Thus, for the large-scale LSMDC dataset in Tables 3 and 4, HCMI with ViT-L/14 obtains an obvious gain over the previous methods. Besides, MDMMT-2 also adopts ViT-L/14 as its backbone, but HCMI has a stronger ability to capture cross-modal correlation, thereby yielding superior retrieval performance.
Then, we report the best results of HCMI with the DSL strategy. These results demonstrate that the large-scale model can benefit from cross-modal knowledge when having enough data for training.
Besides dense model, we further expand HCMI to huge model with about 17B parameters, which is denoted as HCMI huge in the Tables. Especially on MSR-VTT and LSMDC with large-scale videos, HCMI shows dominated superiority over dense models. This proves the potential of large-scale models on tackling multi-modal tasks.
In summary, HCMI obtains new state-of-the-art performance on popular TVR benchmarks.

Dicussion about application
From the experimental results, HCMI can significantly improve text-video retrieval performance, thereby boosting industrial applications of this task. For example, HCMI can be used in the Web&App search and recommendation business to recommend relevant videos, when you give a paragraph of description. The Internet advertising industry can also be beneficial from HCMI, because it can recommend the propse digital advertisement content for you, when you read a commodity article. Besides, HCMI can be used to generate a paragraph of video description by searching a corresponding sentence for each video segment. The above potential applications show the importance of HCMI at both academia and industry.

Conclusion
In this paper, we revisited recent Text-Video Retrieval (TVR) methods and analyzed their pros and cons. Considering comprehensive interaction between two modalities in human perception, we proposed a novel method, named Hierarchical Cross-Modal Interaction (HCMI), which hierarchically explores video-sentence, clip-phrase, and frame-word interactions to understand text-video contents. Besides, two boosting strategies, e.g., adaptive label denoising and marginal sample enhancement, were designed to further improve the performance. Consequently, HCMI has been demonstrated to surpass the existing TVR methods on five benchmarks, i.e., MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet, by a notable margin.