Debiased Video-Text Retrieval via Soft Positive Sample Calibration

With the emergence of enormous videos on various video apps, semantic video-text retrieval has become a critical task for improving the user experience. The primary paradigm for video-text retrieval learns the semantic video-text representations in a common space by pulling the positive samples close to the query and pushing the negative samples away. However, in practice, the video-text datasets contain only the annotations of positive samples. The negative samples are randomly drawn from the entire dataset. There may exist soft positive samples, which are sampled as negatives but share the same semantics as positive samples. Indiscriminately enforcing the model to push all the negative samples away from the query leads to inaccurate supervision and then misleads the video-text feature representation learning. In this paper, we introduce debiased video-text retrieval objectives that calibrate the punishment of soft positive samples. In particular, we propose a novel uncertainty measure framework to estimate the credibility of negative samples for each instance. Then, the reliability of negative samples is used to find the soft positive samples and rescale their contribution within video-text retrieval losses, including triplet loss and contrastive loss. Experimental results on five widely used datasets demonstrate that our debiased video-text retrieval objectives achieve significant performance improvements and establish a new state-of-the-art.


I. INTRODUCTION
N OWADAYS, people are swamped with the massive volume of videos provided by various video apps, such as YouTube and TikTok. To improve the user experience, video retrieval, as a critical topic in the fields of information retrieval and video technology, has attracted increasing research interest.
The critical challenge of video-text retrieval is that the distributions and representations of videos and texts are inconsistent, making it difficult to measure the similarity between different modalities. To tackle this problem, the dominant approaches [1], [2], [3], [4], [5], [6], [7] for video-text retrieval firstly encode different modalities into a common representation space, and then leverage a suitable distance metric to measure the semantic similarities. With such a paradigm, some recent work has focused on designing more complicated encoder [1], [6], [7] to obtain better representations for modalities or more sophisticated matching strategies [2], [3], [5], [8]. For example, Gabeur et al. [2] focus on the multi-modality information in the videos and incorporate multi-layer transformers to learn strong video features. Chen et al. [3] introduce a hierarchical video-text encoder, which factorizes video-text matching into hierarchical levels, including events, actions, and entities.
Existing approaches have achieved remarkable performance by leveraging the strong representation ability of deep neural networks. Meanwhile, few of them are aware that the data preparation process of video-text retrieval brings biases. Annotators of video-text datasets are required to describe the entire untrimmed video in a few sentences [9], [10], [11], [12], [13], resulting in general annotations that disregard video specifics and can be paired with multiple videos. Furthermore, the video-text retrieval datasets only contain annotations for positive video-text pairs, i.e., video v i and text t i match in semantics, with no labeled negative pairs, i.e. video v i and text t j do not match.
Existing methods [1], [2], [3], [4], [6], [7], [14], [15], [16], [17] randomly sample negatives from the whole distribution, which contains inescapable noise. There are samples that are sampled as negatives but have semantics comparable to the query, termed as "soft positive samples". As shown in Fig.1, given a query t 0 "A man is singing and playing the guitar", only the annotated video v 0 is treated as a positive sample. Except for that, all the other samples are regarded as negative, although some of them ( e.g., v − 1 and v − 2 ) can also match the query t 0 perfectly. This  . The negative samples (v − * ) are randomly sampled over the whole distribution, which may contain biases. For example, the negative ones in the blue shadow (v − 1 , v − 2 , v − 4 ) also have close semantics to the query, which we refer to as "soft positive samples." Indiscriminately enforcing the model to maximize the distance between the query and these soft positive samples leads to biased supervision in optimization. By using our debiased video-text retrieval, the soft positive samples will receive the appropriate supervision based on their uncertainty scores η i j , resulting in a reasonable semantic space for cross-modal matching.
The conventional video-text retrieval methods push all the negative samples away from the query t 0 , including soft positive samples v − 1 , v − 2 , and v − 4 . However, indiscriminately enforcing the model to maximize the distance between the query and soft positive samples leads to inaccurate supervision, misleads video-text representation learning, and severely disrupts common space.
To tackle the above challenges, we introduce the Debiased Video-Text Retrieval (DVTR) method that calibrates the biased supervision of soft positive samples. We first introduce the video-text matching uncertainty estimation module, which evaluates the uncertainty [18] of the query and candidate samples to identify the soft positive samples. Specifically, a novel hierarchical probabilistic encoder is introduced for video-text pairs to map them into probabilistic embedding [19]. Then, the heterogeneity-aware multi-modal uncertainty learning strategy is presented to estimate the matching uncertainty of a given video-text pair. The uncertainty is defined as the probability of semantic mismatch of a given video-text pair. A lower uncertainty means the candidate sample has a higher semantic similarity with the query sample. As shown in Fig.1 and v − 4 may be identified as soft positive samples of query t 0 , if the estimated uncertainty score η 01 , η 02 , η 04 are small. Then, we propose the debiased video-text representation learning module, in which we weightedly reduce the penalty of soft positive samples in video-text retrieval losses by their uncertainty scores to address the biased supervision. As shown in Fig.1, the distance between query t 0 and v − 1 , v − 2 , and v − 4 are rescaled by their uncertainty score. In this way, our representation learning module can precisely capture the semantics of videos and texts. Note that the proposed debiased video-text retrieval objectives can be applied to any state-of-the-art video-text retrieval model by simply adjusting their losses. To verify the effectiveness of our proposed method, we conduct extensive experiments on five widely used video-text retrieval benchmark datasets, MSRVTT [9], MSVD [10], VATEX [11], ActivityNet [13] and LSMDC [12]. Extensive experiments with state-of-the-art performance demonstrate the effectiveness of our debiased objectives.
In brief, our contributions can be summarized as follows: • We propose the novel Debiased Video-Text Retrieval (DVTR) method to alleviate the biased supervision of soft positive samples. DVTR can be integrated into most video retrieval models for better retrieval performance with a few computational costs at training time and no additional time consumption at test time.
• We introduce the uncertainty estimation module to identify the soft positive samples, which precisely estimate the uncertainty of video-text pairs by the novel hierarchical probabilistic encoders and heterogeneity-aware multimodal uncertainty learning strategy.
• We present the debiased video-text representation learning objectives, in which we calibrate the biased supervision of soft positive samples by reducing their penalty in proportion to their uncertainty scores.
• Extensive experimental results on five widely used datasets demonstrate that our debiased video-text retrieval method achieves significant performance improvements and establishes a new state-of-the-art in video-text retrieval.
II. RELATED WORK In this section, we briefly review the previous methods most relevant to our work, including video-text retrieval, bias in video-text retrieval, and uncertainty estimation.

A. Video-Text Retrieval
Video-text retrieval aims to perform effective semantic retrieval across video and text data. Recently, semantic video-text retrieval methods [1], [2], [3], [4], [5], [6], [7] aim to develop a powerful encoder to map video and text to a shared embedding space or more sophisticated matching strategies to align video and text at different levels. For example, Gabeur et al. [2] introduce a transformer-based encoder architecture that aggregates multiple modality features extracted from videos to build effective video representations. Dong et al. [1] propose a novel dual network that exploits multi-level encodings to obtain global, local, and temporal patterns in videos and sentences. However, they still suffer from the scale of the training data and have a poor retrieval performance. Most recently, with the promising performance of CLIP [20], some clip-based methods [6], [7], [21] achieve incredibly high performance and outperform other methods by a large margin. Some approaches [17], [22] focus on the large-scale datasets. For example, Bain et al. [17] propose an end-to-end trainable model that is designed to use large-scale image and video captioning datasets for video-text retrieval. Ko et al. [22] introduce a multi-modal self-supervised framework to capture significant information from noisy and weakly correlated large-scale datasets by using a variant of dynamic time warping. Additionally, some recent works [23], [24] focus on splitting video and text into fine-grained levels to bridge semantic gaps in visual and textual. For example, inspired by the reading strategy of humans, Dong et al. [23] propose a reading-strategy inspired visual representation learning to represent videos. Zhang et al. [24] propose a local-global graph pooling network to disentangle the video and text into four levels with the graph neural networks and exploit a hierarchical pooling strategy to maximize the mutual information between pool features and the corresponding graph node features. Different from the above work, this paper focuses on addressing the biased supervision of soft positive samples in video-text retrieval representation learning rather than building more complex feature extractors or matching strategies. Furthermore, this work can be equipped with state-of-the-art methods to further improve its performance.

B. The Bias in Video-Text Retrieval
Recently, the bias in video representation learning is attracting more and more attention. Many efforts [4], [25], [26], [27], [28], [29], [30], [31] have been proposed to address a variety of biases. Some approaches [4], [26] focus on the bias of strict assumption of video-text retrieval, i.e., only a single text is relevant to a query video and vice versa [25]. Hence, some approaches [4], [25], [26], [27] are proposed to model the one-to-many or many-to-many correspondences in the retrieval task. For example, Patrick et al. [4] introduce a multi-modal cross-instance text generation task as the auxiliary to extract the inner one-to-many correspondences of instances for video-text retrieval. Chun et al. [26] encode image and text into probability distributions of concepts, and implicitly perform many-to-many matching between those concepts. Some recent work [30], [32], [33] making effort to alleviate biases in other areas. For example, Cheng et al. [32] explore the effectiveness of various video features on visual search and test different search strategies over different types of queries. Liu et al. [33] introduce the cross-modal semantic importance consistency to achieve invariance in the semantics of items during cross-modal aligning. It measures the semantic importance of items and learns a more reasonable representation vector by inter-calibrating the importance distribution. Yang et al. [30] present a novel contrastive self-supervised loss to update features of the foreground in a noise-free manner for instance segmentation. It considers the different roles of noisy labels in different subtasks' loss.
In this paper, we observe the biased supervision of soft positive samples. We introduce Debiased Video-Text Retrieval (DVTR) method, which tackles the bias by directly identifying the soft positive samples in negative samples and correcting their punishment.

C. Uncertainty Estimation
Uncertainty estimation aims to capture what a model does not know or is not confident. Various aspects of uncertainty have been explored [18], [19], [34], [35], [36], including the data-dependent uncertainty, model uncertainty, and the uncertainty on annotation. For example, Oh et al. [19] learn the uncertainty of image representation to obtain a more robust embedding. Chang et al. [34] learn the uncertainty of input data to alleviate the influence of observation noise for better network optimization. Zheng and Yang [35] estimate the uncertainty of the predicted pseudo labels for semantic segmentation. Uncertainty also has much research attention in other areas [37], [38], [39], [40]. Kim et al. [37] consider two types of uncertainty in multispectral pedestrian detection to alleviate the miscalibration of image pairs. Cheng et al. [38] argue that the unlabeled data in deep semi-supervised hashing methods is not always reliable. They introduce the uncertainty-aware and multi-granularity consistent constrained semi-supervised hashing method to alleviate the negative effects of noisy supervised signals, where the uncertainty score is estimated by Monte Carlo dropout. Kim et al. [39] introduce class uncertainty-aware loss for object detectors, in which the uncertainty score of classification is used to modulate the detector loss function. Su et al. [40] introduce an uncertainty-aware loss function in multi-view stereo scenario to measure the reliability of the estimated depth map.
In this work, we explore the reliability estimation of the randomly sampled negative pairs and alleviate the biased supervision of soft positive samples by re-weighting their contributions in video-text representation learning. To the best of our knowledge, this is the first work to utilize uncertainty to resist biases in video-text retrieval.

III. PROBLEM FORMULATION
and corresponding caption t i , the task of video-text retrieval is to retrieve the videos (or texts) whose semantics are similar to query text (or video). Here, we take v as a query and t as a candidate, as an example. The primary paradigm of video-text retrieval [1], [2], [3], [4], [6], [7], [14], [15], [16] is to encode the different modalities into a common representation space, then leverage distance metrics to directly compare the semantic similarity of video and text. Specifically, the cross-modal common space is built by ranking losses, in which the representational similarity between a given query v i and its positive samples D p t i is maximized while the similarity with negative samples D n v i is minimized. We denote the uncertainty [18], [19] that v i and t j have similar semantic content as η i j ∈ [0, 1]. Ideally, none of the negative samples match the query, i.e., η i j = 1, ∀t j ∈ D n v i . Unfortunately, there are no negative pair annotations in datasets, standard approaches thus sample negatives t j for given query v i from the whole distribution of text {t j } N j̸ =i instead. There are negative samples that are sampled as negatives but have semantics comparable to the query (η i j < 1), termed as "soft positive sample". Indiscriminately enforcing the model to maximize the distance between the query and soft positive samples leads to biased supervision, misleads video-text representation learning. To this end, we propose our debiased video-text retrieval method to mitigate the biased supervision of soft positive samples in the following section.

A. Model Overview
To address the biased supervision of soft positive samples in video-text representation learning we propose our Debaised The framework of the debiased video-text retrieval network. The video-text matching uncertainty estimation module takes global-and local-level representations of each pair (v i , t j ) as input to obtain its uncertainty score η. Specifically, the pair (v, t) first goes through the feature encoders F v and F t to extract features. Then the probabilistic encoders P v , P t project each modality into the probabilistic embeddings z v and z t . By joint considering the discrepancy of contributions and the distance of sampled point embeddings, we approximate the uncertainty η i j . The uncertainty is then employed to guide the video-text representation learning by reducing the penalty of the soft positive sample with η i j .
Video-Text Retrieval (DVTR) method. The overview architecture is illustrated in Fig.2. The framework consists of the following components: • Basic Video-text Retrieval Backbone: We construct a basic retrieval backbone to apply video-text retrieval. Specifically, we employ visual and textual encoders to extract video and text representations. Then, by minimizing the ranking loss functions, we could construct a cross-modal common space in which the representation distance between a given query (video or text) and its positive samples is minimized while the distance with negative samples is maximized. Unfortunately, such basic retrieval method can not handle the biased supervision introduced by the soft positive samples.
• Video-text Matching Uncertainty Estimation: To identify the soft positive samples, we propose a video-text matching uncertainty estimation module to measure the uncertainties between query and negative samples. We propose a hierarchical probabilistic encoder to map video and text as probability embeddings. In addition, we propose a heterogeneity-aware multi-modal uncertainty learning strategy to comprehensively measure the discrepancy of multi-modal probabilistic distributions. Based on the estimated video-text matching uncertainty, we could precisely detect the soft positive samples.
• Debiased Video-text representation learning: The debiased video-text representation learning aims to calibrate the inappropriate supervision for soft positive samples. The uncertainty scores between the query and each negative are first measure by the video-text matching uncertainty estimation module. Based on the estimated uncertainties, we modify the frequently used two ranking loss functions by weighted reducing the penalty of soft positive samples with their uncertainty score. We introduce the basic video-text retrieval module in Sec.IV-B. The proposed video-text matching uncertainty estimation module is introduced in Sec.IV-C. The debiased loss functions are detailed in Sec.IV-D. Finally, we elaborate on the training and inference flow in Sec.IV-E. The math notations of this paper are summarized in Tab.I.

B. Basic Video-Text Retrieval Backbone
Given a video v and text t, the basic video-text retrieval module aims to encode video and text into the common representation space. The local and global features of video-text pairs are then generated and fed into the video-text uncertainty estimation module.
Given a video v, the video encoder is used to learn the local-and global-level representa- where θ F v is the parameter of the video encoder. l v and g v are the local-and global-level video representations, respectively.
The global video representations g v is obtained by applying a pooling strategy to l v . Specifically, the video encoder is designed as transformer based architecture. For HiT [16], we use the outputs of the feature-level layer as the local-level representations of video and conduct the mean pooling of the semantic-level to aggregate the global-level representations. For CLIP2Video [7], the vision transformer (ViT) [41] is adopted to encode every frame into features and combines the temporal and spatial information to generate the local-level representations. Following [7], we then apply global average pooling to encode final global-level representations.
2) Text Encoder F t (t; θ F t ): Given a text t, text encoder F t (t; θ F t ) is used to encode it as a local-and global-level representation (l t , g t ), where θ F t denotes the learnable parameter. Specifically, we adopt the base BERT [42] as the text encoder and fine-tune it. For HiT [16], we use the outputs of the word-level layer as the local-level representations l t , and perform the average pooling for the semantic-level to aggregate the global-level representations g t . And for CLIP2Video [7], we obtain the local-level features l t from the hidden states of BERT and take the highest number in each hidden state as the global-level features g t .
3) Video-Text Matching: Based on the above video and text encoder, the parameter θ F v and θ F t are updated by minimizing Eq.1 or Eq.2 as follows: where S i j is the similarity of v i and t j , S + are positive pairs, S − are negative pairs, λ is the margin, [·] + = Max{·, 0}. λ and τ are the margin and temperature parameters, respectively. B is the batch size.

C. Video-Text Matching Uncertainty Estimation
In this section, we introduce the video-text matching uncertainty estimation module, which evaluates the uncertainty η − i * of the query sample v i and candidate samples t * by encoding the video and text into the probabilistic embedding space [19]. Different from common point embedding methods which project input x into an embedding vector z with fixed dimension d, i.e., a point in R d , probabilistic embedding maps the input x into a probabilistic distribution in R d which not only preserves the semantic information but also captures the inherent uncertainty [18] in data. We build the video-text matching uncertainty estimation module by extending probabilistic embedding to the video-text retrieval scenario.
1) Hierarchical Probabilistic Encoders: Given a pair (v i , t j ) consisting of i-th video and j-th text (for clarity we omit i and j in this section), we first propose the video probabilistic encoder P v (l v , g v ; θ P v ) and text probabilistic encoder P t (l t , g t ; θ P t ) to project the local-and global-level representation of video and text as probabilistic embedding z v and z t , where θ P * * denote the parameters. The local and global features l v , g v of video v are first obtained by powerful feature extractors F v (v; θ F v ) to capture the semantic information. Then the video probabilistic encoder projects them into probabilistic embedding The mean v µ and the variance v of video distribution z v are obtained as follows: where LN(·) is the LayerNorm [43], Attn * v (·) is the selfattention layer, MLP * v indicate the multilayer perception (MLP) layer σ (·) means the sigmoid function.
Similar to the video probabilistic encoder, the text probabilistic encoder P t (l v , g v ; θ P t ) attempts to encode text representations l t and g t as a probabilistic embedding z t = P t (l v , g v ; θ P t ),where z t = N (t µ , t ). The local and global features l t , g t of text t are obtained by F t (t; θ F t ) to capture the semantic information. The mean t µ and the variance t of text probabilistic embedding z t are formulated as follow: 2) Heterogeneity-Aware Multi-Modal Uncertainty Learning: Existing probabilistic embedding methods [19], [26] mainly focus on single-modal data. They estimate the probability that the semantics of z v and z t are matched by directly calculating the distance of points sampled from the distributions. We argue that the inherent heterogeneity gap between visual and textual modalities may lead to invalidation of such random sampling measures, especially when the sample size is small. Thus, in this work, we measure the matching probability between two probabilistic distributions by jointly considering the similarity of the point sampling from each distribution and the divergence between distributions: where v and t are sampled from distribution z v and z t by Monte-Carlo sampling. If the two probabilistic distributions are aligned well in the common space, the Eq.7 is degenerated as the standard format: D(z v , z t ) = v T t, which only considering the distance between point embeddings. Once the two probabilistic distributions are not aligned well, Eq.7 measures the discrepancy between probabilistic embedding by introducing the divergence of distribution.
During training, based on the similarity of distribution, the loss function is formulated as follows: where K is the Monte-Carlo sampling number, the KL(·∥·) is the KL divergence. However, the KL divergence is asymmetric and there is a problem of vanishing gradient. To address it, we adopt the 2-Wasserstein distance [44] to minimize the discrepancy between z v and z t . As the z v and z t follow Gaussian distribution, the 2-Wasserstein based method is reduced to: Then, we measure the uncertainty that the pair (v i , t j ) is matched by: where MLP η denotes a MLP layer. σ is the sigmoid activation function. Furthermore, we optimize L η to minimizing the matching probability of the: Following the aforementioned definition of uncertainty, we can effectively identify the soft positive samples from negative ones since they have a close η i j with the positive samples, a.k.a. the candidate video/text is highly semantically consistent with the query sample.

D. Debiased Video-Text Representation Learning
As shown in Fig.1, we observe that candidate samples with higher semantic similarity to the query have a lower uncertainty. Based on this observation, we propose a novel debiased video-text representation learning module, in which we weightedly reduce the penalty of soft positive samples in ranking losses by their uncertainty scores. Specifically, the two most commonly used ranking losses: triplet loss and contrastive loss, are altered with uncertainty.
1) The Debiased Triplet Loss: Triplet loss is commonly used in video-text retrieval to make the similarity of positive pairs S + ii be at least λ larger than negative pairs S − i j . The conventional equation of triplet loss L T L is shown in Eq.1.
Minimizing L T L means maximizing the similarity of positive pairs S + ii and minimizing the similarity of negative pairs S − i j . It is implemented by pulling the representation of the positive sample v i close to the query t i , and pushing away the representation of negative sample v j from the query t i , until the negative samples narrows the positive one with at least λ margin. However, when facing the soft positive samples, the existing models still try to push them away by imposing a significant penalty, leading to the wrong optimization direction.
In our debiased video-text representation learning, we define the debiased triplet loss as follows: Algorithm 1 Debiased Video-Text Retrieval Training ; max epoch number E Output: Learned parameters θ F v ,θ F t , θ P v ,θ P t , and θ U 1 repeat 2 // Training the uncertainty estimation module; Extract features: Generate probabilistic embeddings: Calculate L U and L η ; // Training the debiased retrieval module; Calculate η i j for negative pairs ; 13 Calculate debiased loss L U T L or L U C L ; where for the pairs with low uncertainty scores, such as soft positive samples, a smaller weight η − i j is used to reduce its penalty, leading to a smaller gradient in optimization. For the pairs with high uncertainty, the η − i j S − i j stays high, resulting in a high gradient in optimization. Thus, the L U T L can effectively prevent the pairs with high semantic similarity from being separated by a far distance in embedding space.
2) The Debiased Contrastive Loss: Contrastive loss is also a widely adopted loss in video-text retrieval, which aims to make the similarity of positive pair S + ii account for the largest proportion in the sum of similarity of all pairs in a batch B k=1 S ik . The conventional equation of contrastive loss L C L is shown in Eq.2.
By minimizing L C L , the similarity of positive pairs S + ii will approach to 1 and the similarity of all the other negative pairs B k̸ =i S − ik will approach to 0. It is implemented by pulling the representation of v i and t i as close as possible and pushing apart the representation of v i and t j as far as possible. Thus, the contrastive loss also cannot handle the soft positive samples well due to the conflict between pulling the similar semantic samples close to the query and pushing away the soft positive samples.
In our debiased video-text representation learning, we define the debiased contrastive loss as follows: In L U C L , we can observe that the issue of soft positive samples can be well addressed, since the lower uncertainty η − ik the pair has, the smaller the gradient and the contribution of this pair to the optimization.

E. Training and Inference
We summarize our training algorithm in Alg.1. We first learn the uncertainty estimation model with positive pairs. Then, we train the video-text retrieval method by losses L U T L or L U C L with the contribution of negative samples calibrated by their uncertainty. Given the positive training pairs {(v i , t i )} B i=1 , we first extract the local and global features of video and text by basic retrieval module. Then, these features are fed into the corresponding probabilistic encoder P v and P t , which maps them into probabilistic embeddings. We train the video and text probabilistic encoders by minimizing L U and L η . For training the debiased retrieval module, we first obtain negative pairs {(v i , t j )} B i̸ = j . Then, we calculate the uncertainty η i j for each negative sample. The debiased L U T L or L U C L are then optimized with each sample receiving proper supervision. In the inference stage, given a query sample, we extract its features by F v /F t and sort the similarities between the query and candidates to choose the best matching samples.

V. EXPERIMENTS
A. Experimental Details 1) Datasets: To achieve a comprehensive evaluation of our DVTR, we carry out our experiments on five widely used [1], [2], [3], [6], [7], [14], [16] video-text retrieval datasets with various scales and video sources. Table II summarizes the brief statistics of these datasets. a) MSRVTT [9]: The MSRVTT dataset consists of 10K videos collected from YouTube. Each video lasts 10 to 30s and is annotated with about 20 natural sentences in English. Our results are reported on the train/test splits named Full [9] and 1k-A [45]. Following the Full split, 6,513 videos are used as the training set, 497 for validation, and 2,990 for the testing set. The 1k-A split was introduced by [45] that 9K videos are used for training, 1K for testing and validation. b) MSVD [10]: The MSVD dataset consists of 80K English sentences for 1,970 videos from YouTube. Each video is described with around 40 sentences. Our results are reported base on the standard split that uses 1,200, 100, and 670 videos for training, validation, and testing. c) VATEX [11]: VATEX dataset is a multilingual video-text dataset with 34,911 videos. Each video, collected from YouTube, has a duration of 10s and at least 10 English captions. In our work, we only use English annotations. We use the official split, 25,991, 3,000, and 6,000 videos for training, validation, and testing. d) ActivityNet [13]: The ActivityNet dataset consists of 20,000 YouTube videos. We follow the setting of [51], which concatenates all the captions of a video into a paragraph, and evaluate the model on the "val1" split. e) LSMDC [12]: The LSMDC dataset contains 118,081 videos and equal captions extracted from 202 movies, with a split of 109,673, 7,408, and 1,000 as the train, validation, and test sets. Every video is selected from movies ranging from 2 to 30 seconds.
2) Implementation Details: Follow Alg.1, we apply our debiased video-text retrieval objectives to the HiT [16] and CLIP2Video (C2V) [7] to obtain our DVTR + HiT model and DVTR + C2V model, respectively. We follow the HiT [16] to We use 30 and 25 as the frame length and caption token numbers in the DVTR model. The initial learning rate is set to 2e-5 and the network is optimized by the AdamW [58] optimizer. We use the 10% proportion of warm up and cosine decay for scheduling the learning rate. The batch size is 128 and we train 40 epochs. We follow the CLIP2Video [7] to set the DVTR + C2V model. We initialize the spatial transformer (ViT) with CLIP (ViT-B/16) [20] by reusing parameters of similar dimensions in CLIP. We use 12 and 32 as the frame length and caption token number in the DVTR + C2V model. We fine-tune the model with the Adam optimizer. For the learning schedule, we follow the cosine schedule of CLIP [20] to decay the learning rate. The learning rate is set as 1e-7 for both video encoder and text encoder, and 2e-5 for our uncertainty estimation module. The batch size is 128 and running 5 epochs. The sample number of Monte-Carlo sampling is set to 7 for both DVTR + HiT and DVTR + C2V. We set the λ as 0.5 for Eq.12 and τ as 0.07 for Eq.13.

3) Evaluation Metrics:
We adopt the common metrics to report retrieval performance, including Recall at K (R@K), Median Rank (MedR). R@K is the fraction of queries that correctly retrieve desired items in the top K of the ranking list. Following the tradition, K = 1,5,10 are adopted. Especially, for ActivityNet, K=1,5,50. Therefore, a higher score of R@K means better performance of the retrieval methods. The MedR computes the median rank of the correct targets for a query, where a lower score indicates a better performance. Furthermore, rsum is also considered as the evaluation metric on the overall perspective, which is the sum of the R@K.

4) Compared Methods:
To validate the effectiveness of our DVTR, we choose baselines from the following aspects to compare.
a) Conventional Video-Text Retrieval Models: The conventional video-text matching methods [2], [3], [16] focus on  TABLE III   VIDEO-TEXT RETRIEVAL COMPARISON WITH STATE-OF-THE-ART METHODS ON MSRVTT DATASETS mining the multi-modality information from video and text to improve the retrieval performance.
• HGR [3] proposes a hierarchical graph reasoning module that decomposes video-text matching into global-to-local levels.
• MMT [2] presents a multi-modal transformer to jointly encode the different modality features in video and allows them to make hierarchical interaction with the text feature.
• HiT [16] proposes a hierarchical transformer for videotext retrieval. It performs hierarchical cross-modal contrastive matching at both feature and semantic levels and achieves a multi-grained matching between video and text modality.
b) The Pretrained Model based Video-Text Retrieval Models: The pre-trained model based video-text retrieval methods [6], [7], [21], [49] transfer the ability of the pre-trained model to the cross-modal retrieval task by fine-tuning in the downstream datasets.
• CLIP-straight [21] directly adopts CLIP [20] to obtain video and text represnetaions for video-text retrieval.
• CLIP4Clip [6] aims to transfer the knowledge of the CLIP model to video-text retrieval and introduces several cross-modal fusion modules to investigate an appropriate cross-modal matching strategy.
• CLIP2Video (C2V) [7] presents a temporal difference block to capture motions at fine temporal video frames, and a temporal alignment block to re-align the token of video clips and phrases and improve the multi-modal matching.
• X-Pool [49] focuses on the difference of information between video and text and proposes an x-pool strategy that main mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames.
c) The Debiased Video-text Retrieval Models: The debiased cross-modal retrievals [15], [27], [48], [50] reveal and alleviate the bias in retrieval datasets. And all above them could be applied directly to many different video-text retrieval for improving the retrieval performance.
• TT [15] aims to alleviate the bias in captions and introduces multiple text encoders as complementary cues to provide an enhanced supervisory for the retrieval model.
• CMGSD [27] proposes an adaptive margin changed with the distance between positive and negative pairs to solve the influence of soft negative samples.
• CAMoE [48] introduces a alignment strategy named dual softmax, which could rectify the similarity matrix by dual soft max to avoid the one-way optimum-match in crossmodal matching.
• QB-NORM [50] presents a re-normalize strategy to alleviate impacts of hub embedding that is close to many queries in common space.
In the following, the best performance is highlighted in bold, "-" means no result reported.  IV   VIDEO-TEXT RETRIEVAL COMPARISON WITH STATE-OF-THE-ART METHODS ON MSVD DATASETS   TABLE V VIDEO-TEXT RETRIEVAL COMPARISON WITH STATE-OF-THE-ART METHODS ON VATEX DATASETS B. Results 1) Comparison of State-of-The-Arts: Tab.III, Tab.IV, Tab.V, Tab.VI and Tab.VII show the performance comparison results between our model and state-of-the-art methods on the five benchmark datasets. Our performance surpasses all state-of-the-art methods on five common datasets across most evaluation metrics for both text-to-video and video-to-text retrieval. We compare our model with other state-of-the-art methods on the MSRVTT Full and 1k-A partitions, respectively. Our retrieval performance at rsum exceeds recent stateof-the-art methods T2VALD [5] by over 10 points. On the MSVD, VATEX, ActivityNet and LSMDC, the DVTR + HiT also outperforms comparison methods by a large margin. The significant improvements achieved by DVTR indicate that the samples which belong to negative but have a close semantic distance with positive ones have seriously disrupted video-text representation learning in the state-of-the-art methods.
We also compare our DVTR + C2V model with the stateof-the-art methods that are pretrained with extra training data, such as pretrained on HowTo100M [59] or adopting the pretrained features from CLIP [20]. On the MSRVTT, MSVD, VATEX, and ActivityNet datasets, we achieve state-of-theart performance improvements compared with all baselines. On the LSMDC dataset, we outperform the state-of-the-art model on most of the metrics. The results show that pretrained models are still suffering from the negative impact of soft positive samples in the transfer to downstream tasks, althought they maintain a powerful ability and achieve a significant retrieval performance. We notice that the increase brought by DVTR is not as large as the non-pretrain methods. This may be because pretrained models are trained on vast and comprehensive datasets, in which the probability η + that one sample has similar semantics to another random sample is smaller in that dataset, thus suffering minor biases.
2) Comparison With Other Denoising Methods: Tab.VIII shows the performance comparison between our proposed DVTR and other denoising methods on the full split of MSRVTT. Following the CMGSD [27] and TT [15], we adopt CE [14] as our backbone model, keep all the settings unchanged and apply our DVTR to it. Specifically, CMGSD gives samples a dynamic margin to reduce their optimization time. But it neglects the soft positive samples, and still provide the same supervision information as the ordinary negative samples. TT uses multiple text encoders to provide abundant text supervision while never considering the impacts introduced by soft positive samples. This improvement can be attributed to the additional supervision information provided by multiple text encoders. The results show that our method surpasses CMGSD [27] and TT [15] and further upgrades the retrieval performances of the baseline by considering the semantics of soft positive samples. Specifically, our method gains 2.7, 5.6, and 4.7 on R@1, R@5, and R@10 for CE, respectively. The experimental results show that our model can effectively fix the bias and improve the retrieval performance by identifying the soft positive samples and correcting their inaccurate supervision.

3) Comparison With Different Probabilistic Embedding:
In this work, we theoretically analyze the biased supervision problem in Sec.III and find the root causes that the existing methods draw negative samples from the whole dataset, which contains biases. Inspired by the probabilistic embedding [19], we propose an innovative and effective way to solve the problem: locate the biased samples and rescale their contributions by their uncertainty score, as shown in Eq.12 and Eq.13. As reviewed in related work, the PCME [26] also tackles the biased supervision problem, they conjecture that the problem is that many-to-many relationships are not modeled. Thus, they introduce probabilistic embedding to capture many-to-many relationships. The uncertainty in PCME is a by-product of providing interpretability for retrieval results.
In this section, we compare our method with the PCME, and extend the PCME model by replacing our video-text matching uncertainty estimation module in DVTR with it (DVTR pcme + HiT and DVTR pcme + C2V). Results are shown in Tab.IX. According to the results, the PCME model performs poorly in video-text retrieval tasks in R@K metrics. This may be because the probability embedding may better capture relations rather than represent samples. The DVTR pcme + HiT and DVTR pcme + C2V performed well, which demonstrated the effectiveness of our proposed debiased framework in identifying the soft positive samples and calibrating the biased supervision. The DVTR + * models outperform the DVTR pcme + * indicating the effectiveness of the proposed hierarchical probabilistic encoder and heterogeneity-aware multi-modal uncertainty learning. Furthermore, it also demonstrated that the proposed video-text matching uncertainty estimation module estimates the uncertainty score more accurately than PCME.

A. Debiased Loss Functions
Tab.X shows the results of ablation studies on the MSRVTT Full datasets of video-text retrieval task. The L U T L and L U C L represent the debiased triplet ranking loss and debiased Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Fig. 3. Visualization of the similarity matrix on a batch of MSVD test datasets The max value on each row and column is highlighted with the red border. The diagonal is the ground-truth pair (v * , t * ). When obtaining text-to-video retrieval, the model returns the max similar video on each row. For the video-to-text retrieval, the most similar texts over columns are returned.  contrastive learning loss, respectively. According to Tab.X, we can find that both L U T L and L U C L can help the model achieve a better performance. This illustrates that the debiased loss functions help models to learn a better representation of videos and texts. Furthermore, the debiased contrastive loss functions yield a better performance, indicating that bias may affect conventional contrastive loss functions more easily.

B. Monte-Carlo Sampling Numbers
Tab.XI reports the effect of the number of samples on the retrieval results by Geometric-Mean of R@1-R@5-R@10 at text-to-video retrievals. In these experiments, we only modify the number of samples of Monte-Carlo Sampling and keep other parameters unchanged. The results show that, the retrieval performance grows with the number of samples. Due to the computation limits, we choose the K = 7 as the number of the sample finally.

C. Hyperparameter of Triplet Loss
λ is the key parameter of the triplet loss function. To explore the suitable value of margin λ, sufficient experiments are conducted at Tab.XII. According to Tab.XII, when the margin λ is 0.5, the model achieves the best performance.

D. Hyperparameter of Contrastive Loss
The temperature τ in contrastive loss is a sensitive parameter. To further analyze the effect of τ , we present sufficient experiments and show results at Tab.XIII. A fine-grained step length 0.01 is used to explore the most appropriate τ value. As Tab.XIII shows, the best retrieval performance can be achieved when τ is set to 0.07.

A. Visualization of the Similarity Matrix
In this section, we study how the debiased objectives work on retrieval by visualization of the output of the similarity   Fig.3 visualized the similarity scores between the query and candidate samples in the video-text retrieval. From Fig.3, we conclude that: (1) The conventional methods fail in most retrieval cases since they have trouble with the bias of soft positive samples, while the proposed method performs well. (2) Both the conventional method and our proposed method encourage the model to give a high similarity score to the diagonal, i.e., the positive samples. (3) For the negative ones that are not diagonal, our debiased video-text retrieval objectives weightedly reduce their punishment, resulting in an accurate similarity score for the soft positive samples and some minor scores for the true negatives.

B. Visualization of the Soft Positive Sample Calibration
To show the effectiveness of debiasing more clearly, we randomly select a query and show its training loss along with the uncertainty on the whole MSVD dataset. From Fig.4, we can conclude that: (1). Over 95% of negative samples have an uncertainty of over 0.9, which indicates that they are the real negative ones. (2). The model can quickly fit the real negative samples, so they have lower losses. (3). The model has trouble with the biased supervision problem. The wrong supervision forces the model to give a low similarity score for the soft positive samples by imposing significant losses. However, the soft positive samples do have a similar semantic with the query. (4). The proposed negative sample reweighting can effectively reduce the punishment for soft positive samples. Fig.6 show the retrieval results in our video-totext retrieval with the output of the uncertainty estimation module. For both visualization examples, we display the top 3 retrieved results for analysis, among which, the correct results are marked in green and the wrong ones in red. The uncertainty between the retrieved results and the query sample is shown beside or at the bottom of the retrieved results. We can find that although the retrieved results may have no positive candidate from the retrieved results, the proposed debiased retrieval model can still return the most relevant result, which has close semantics to the query. Furthermore, our uncertainty estimation module can also identify them by giving an accurate uncertainty score.

D. Time Complexity
The uncertainty estimation module in DVTR is based on the features of the video-text retrieval backbone. The additional time is required when DVTR projects v and t as the probabilistic embeddings. At the projection stage, the main time consuming operator is the attention mechanism, which needs additional time of O(N 3 ). Numerous methods [4], [5], [16] adopt transformers or attention mechanisms to learn better representations of video and text. Thus, compared to other methods, no additional time consumption is required in our DVTR.

E. Space Complexity
The common method of point embedding requires O(N ) space to store the features in the joint embedding space. In our DVTR, extra spaces are used at the stages of probabilistic embedding and Monte-Carlo sampling. For probabilistic embedding stage, our DVTR projects v and t as probabilistic distributions. µ and of video and text need to be stored beforehand, resulting in the doubled space requirement. For Monte-Carlo sampling, our DTVR needs K 2 storage by sampling K points from video and text distributions, respectively. Thus, the additional space requirements of DVTR are O (2N  *  K 2 ).

VIII. CONCLUSION
In this work, we tackle the biased supervision of soft positive samples in video-text retrieval learning and propose the novel Debiased Video-Text Retrieval (DVTR) method to alleviate the biased supervision of soft positive samples. We first introduce the novel video-text matching uncertainty estimation module, which identify the soft positive samples by evaluates the uncertainty of the query and candidate samples with probabilistic embeddings. Then, a debiased video-text representation learning objective is employed to fix the inaccurate supervision by weightedly reducing the penalty of soft positive samples in ranking losses. DVTR can be integrated into most video retrieval models for better retrieval performance with a few computational costs at training time and no additional time consumption at test time. Comprehensive experimental results on five widely used datasets demonstrate the superiority of the proposed method compared with other state-of-the-art video-text retrieval methods.