Modeling Multimodal Uncertainties via Probability Distribution Encoders Included Vision-Language Models

In the field of multimodal understanding and generation, tackling inherent uncertainties is essential for mitigating ambiguous interpretations across multiple targets. We introduce the Probability Distribution Encoder (PDE), a versatile, plug-and-play module that utilizes sequence-level and feature-level interactions to model these uncertainties as probabilistic distributions. Furthermore, we demonstrate its adaptability by seamlessly integrating PDE into established frameworks. Compared to previous methods, our probabilistic approach substantially enriches multimodal semantic understanding. In addition to specific tasks, the unlabeled data contains rich prior knowledge, especially multimodal uncertainties. However, current pre-training methods are designed based on point representations, which hinders the effective functioning of our distribution representations. Therefore, we incorporate this uncertainty modeling into three new pre-training strategies: Distribution-based Vision-Language Contrastive Learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). Empirical experiments show that our models achieve State-of-the-Art (SOTA) results in a range of downstream tasks, including image-text retrieval, visual question answering, visual reasoning, visual entailment and video captioning. Furthermore, the qualitative results reveal several superior properties conferred by our methods, such as improved semantic expressiveness over point representations, and the ability to generate diverse yet accurate predictions.


I. INTRODUCTION
Human inherently possesses the ability to comprehend real-word objects with precision, which includes discerning objects with similar semantics and mapping relationships across diverse modalities.Our computational models are engineered to emulate this capability, navigating through complex multimodal semantic landscapes while being acutely aware of inherent data uncertainties.However, the pursuit of such exactitude presents challenges.While The associate editor coordinating the review of this manuscript and approving it for publication was Li He .multimodal data boasts a rich semantic depth, it also brings about more ambiguity and noise than its single-modality counterparts.
Building upon the pursuit of exactitude in navigating multimodal semantic landscapes, multimodal representation learning techniques serve as a pivotal approach for enhancing sophisticated interpretation across diverse data types [2].Nevertheless, these methods are not without their own range of challenges.Chief among them is the issue of uncertainty, manifesting both within individual modalities and across different modalities, corroborated by recent works [3], [4].Consider the image labeled as (a.0) in Fig. 1 as an FIGURE 1. Exemplifying multimodal uncertainties and exploring a case of language uncertainty through point and distribution representations with examples from MSCOCO dataset [1].
example: within a single visual region, one can observe a variety of objects including a billboard, several zebras, and mountains.Consequently, it becomes ambiguous as to which objects are being referred to when discussing this particular region.In the language example labeled as (b) in Fig. 1, complexities arise from intricate relationships among words, contributing to uncertainties like synonymy and hyponymy.In Fig. 1 (c)&(d), the same object is often represented differently across modalities like text and images, exemplifying the challenges of inter-modal uncertainty.The aforementioned multimodal uncertainties pervade a range of tasks in multimodal understanding and generation, such as cross-modal retrieval, visual questionanswering, and video captioning.These uncertainties show considerable challenges in the effective training of AI models for these specialized applications.Contrary to addressing these issues, existing methods [5], [6], [7] often overlook these uncertainties, which often results in limited capabilities in comprehending complex concept hierarchies and a lack of prediction diversity.Therefore, it is imperative to model such multimodal uncertainties.
In tackling uncertainties inherent to the feature representation space, the utilization of Gaussian distribution stands as a leading approach [3], [8], [9], [10].In these approaches, the derived uncertainty relies on individual features, neglecting the interplay of all features, which is crucial for understanding inherent relationships.To mitigate this, we employ a specialized component, the Probability Distribution Encoder (PDE), to capture these uncertainty semantics.Beyond the interaction with entire objects, we extend the interactions between word tokens and image patches during the formulation of distribution representations, aiming to learn additional information.In Fig. 1 (e), we showcase two types of representations for language uncertainty, where distribution representations reveal richer semantic relationships than point representations.Moreover, the variance within these distribution representations serves as a metric for textrelated uncertainty.Incidentally, distribution representations facilitate diverse generations, yielding several plausible predictions through random sampling.Expanding upon this, in Sec.V-A, we show the effectiveness of the PDE across diverse scenarios of multimodal understanding and generation.Moreover, as illustrated in Fig. 2, we offer a qualitative example in the context of video captioning tasks to provide a deep and intuitive understanding of PDE's functionality.In contrast to the current methods, the PDE module not only contributes to a richer array of possibilities within each video frame but also effectively captures multimodal uncertainties.Consequently, our plugand-play PDE module serves to enhance the robustness of the models.
In addition to the aforementioned tasks with labeled data, the use of unlabeled multimodal data is flourishing.Concurrent with this trend, various Vision-Language Pretraining (VLP) methods have emerged for self-supervised learning from unlabeled data, offering performance gains in a range of downstream applications [11], [12], [13], [14], [15], [16].However, existing deterministic representations often lack the ability to grasp uncertainty in pre-training data, as they merely pinpoint positions in semantic space and gauge relationships between targets using certainty metrics like Euclidean distance.What is the effective approach to modeling multimodal uncertainties within pre-training datasets?
Therefore, to learn this multimodal uncertainty in a selfsupervised manner, we propose a Multimodal uncertainty-Aware vision-language Pre-training (MAP) framework.Specifically, we integrate uncertainty modeling into the multimodal pre-training strategies, yielding the following three The contributions of our work are outlined as follows:1 • We delve into the multimodal uncertainties.Moreover, we introduce a new plug-and-play module, termed Probability Distribution Encoder (PDE), to model the uncertainty in distribution representations.
• Our proposed PDE methodology enhances the effectiveness of frameworks in various multimodal understanding and generation tasks.Moreover, to the best of our knowledge, we are the first to integrate uncertainty learning into video captioning.
• We formulate three uncertainty-aware multimodal pretraining strategies, namely D-VLC, D-MLM, and D-ITM, to learn multimodal uncertainties in large-scale unlabeled datasets.To our knowledge, this effort represents one of the first attempts to harness the probabilistic nature of distributions in VLP.Our code is available at https://github.com/IIGROUP/MAP.
• We seamlessly integrate our proposed pre-training tasks into a comprehensive end-to-end MAP framework.Moreover, the empirical evaluations demonstrate that the MAP model achieves SOTA performance across multiple downstream tasks.Additionally, our qualitative analysis showcases the effectiveness of our design in capturing multimodal uncertainties within cross-modal tasks, enabling the model to generate diverse and accurate predictions.

II. RELATED WORK A. PROBABILITY DISTRIBUTION REPRESENTATIONS
Current representation learning approaches predominantly utilize point representations, aiming to closely align these features with the ground truth in high-dimensional space [18], [19].However, many tasks present inherent uncertainties, suggesting the need for multiple suitable point representations.To tackle this, several works have introduced probability distribution representations, enriching inference and enhancing model robustness to avoid overfitting to a singular solution.Moreover, the recent studies have been conducted to address the uncertainty of input objects, achieving progress in single-modal settings.For instance, W2GM introduces word distributions formed from Gaussian mixtures to cater to multiple word meanings, entailment, and abundant uncertainty information, proposing an energy-based max-margin objective for learning these distributions [20].
Building upon the theme of uncertainty, Smoothed Box employs Gaussian convolutions to craft embeddings under the guidance of uncertain annotations [21].This approach allows for a nuanced understanding of soft inclusions among various concepts.To tackle the long-tail issue in relation prediction, Gaussian distribution is employed to encapsulate uncertainty in object relationships, aiding scene graph generation [22].In the multimodal domain, recent efforts in constructing distributions have led to progress in diversifying predictions for cross-modal retrieval tasks [4].Unlike existing methods that construct distributions at the feature level for an entire image or sentence, our approach models each token within them, like patches in an image and words in a sentence.Consequently, our method is capable of handling interactions at both the sequence-level and featurelevel, facilitating the realization of multimodal uncertainty learning.

B. VISION-LANGUAGE PRE-TRAINING (VLP)
Recently, the Vision-Language (VL) pre-training models have garnered significant attention within the multimodal research community by adeptly addressing real-world challenges through the pre-training and fine-tuning paradigm.

III. APPROACHES
Firstly, as shown in Fig. 3, we outline the PDE module in Sec.III-A.Then, we delve into the MAP in Sec.III-B and the overview of it is in Fig. 4.After that, we delineate the distribution-based VLP strategies in Sec.III-C.As detailed in Sec.III-D, following its comprehensive pre-training, the model is fine-tuned on specific VL downstream tasks.

A. PROBABILITY DISTRIBUTION ENCODER (PDE)
The inputs of PDE are derived from the embedding space that encompasses various modalities.To learn the complex multimodal uncertainty, we further model the features using multivariate Gaussian distributions.In detail, for each feature input, PDE calculates a mean vector (µ) and a variance vector (σ 2 ), where the mean vector represents the central position of distributions in the probabilistic space, and the variance vector depicts the extent of distributions along each dimension.As illustrated in Fig. 3, we present the detailed architecture of PDE, encompassing both sequence-level and feature-level interactions.In particular, the Multi-Head (MH) operation handles sequence-level interactions, whereas the feed-forward layer tackles feature-level interactions.In the MH operation, the input representations H ∈ R T ×D are divided into k heads, with T representing the sequence length and D denoting the hidden size.Within each head, the representations are segregated and channeled into two paths (µ, σ 2 ).In every path, the input representations The SWINPDE Model integrates PDE module into the SwinBERT [29] architecture for enhanced video captioning.
R T ×D/2k are projected onto For instance, the operation within the µ path is as follows: where, d k is designated the value of D/(2k), and the weight matrix W qkv ∈ R d k ×3 d k aims to project the features into the subspace of each head.Likewise, the weight matrix W O ∈ R kd k ×D is utilized to project the concatenated results of k heads into the output space.Furthermore, the term ''Act'' encompasses an activation function and a normalization function, enabling sequence-level interaction.The operations in the σ 2 path are similar to the µ path.Moreover, given the correlation between the input point representation and the mean vector, an ''add'' operation is utilized to derive the mean vector.The rationale for these design choices is elaborated in Sec.V-C2.Post PDE processing, whether visual or linguistic, each token is depicted as Gaussian distributions within a high-dimensional probabilistic space.
Our design shows that PDE serves as a modular, plug-andplay component, seamlessly integrating with existing pointbased frameworks.For instance, to tackle the challenges of video captioning, we incorporate PDE into the widely-used SwinBERT architecture [29], resulting in a modified model known as SWINPDE, as shown in Fig. 5.In this context, we employ a video transformer to extract spatial-temporal video representations from raw video frames.The PDE module then adeptly converts these representations into probabilistic distributions, effectively capturing and learning from the uncertainties.In the final stage, the multimodal transformer processes the sampled representations from the PDE, translating them into coherent natural language sentences for captions via a sequence-to-sequence mechanism.This approach not only enhances the diversity of video captioning but also ensures the adaptability of our module in various scenarios.Furthermore, to extend its applicability to a diverse set of Visual-Language (VL) downstream tasks and pre-training contexts, we embed PDE into our MAP framework, as delineated in Sec.III-B.

B. MODEL OVERVIEW OF MAP 1) FEATURE EXTRACTION
We utilize a vision feature encoder and a language feature encoder.Specifically, the CLIP-ViT [11] serves as the vision encoder, while RoBERTa-Base [30] is employed for language encoding.An image is embedded into a patch feature sequence {v [CLS] , v 1 , . . ., v N }, with v [CLS] representing the overall vision feature, and similarly, the input text is transformed into a token sequence {w [CLS] , w 1 , . . ., w M }, where w [CLS] denotes the overall language feature.

2) CROSS-MODAL TRANSFORMER
In recent studies, multimodal transformers primarily fall into two categories for fusing diverse modalities: singlestream [12], [31], [32] and dual-stream [28], [33], [34].In a common setting, the length of image patch sequences significantly surpasses that of text sequences, which poses a challenge for jointly computing attention scores due to the overwhelming weight of vision features [35].To handle this challenge, we opt for a dual-stream architecture, entailing two separate transformer branches for fusing the input modalities by multiple attention matrices.
As detailed in Fig. 4, we present the overall structure of MAP, including N L layers of cross-modal encoders.In detail, each encoder layer comprises two Self-Attention (SA) modules and two Cross-Attention (CA) modules.Within the SA block of each modality, the query, key, and value vectors are linearly projected from either vision or language features.
Within the vision-to-language CA module of the i-th layer, the query vectors, embodying language feature T ′ i subsequent to the SA module, align with the key or value vectors indicative of vision feature I ′ i .The application of the Multi-Head Attention (MHA) operation in the CA block not only enables the integration of visual information across modalities by the language features but also ensures a similar cross-modal interaction in the language-to-vision CA block, mirroring its vision-to-language counterpart.The operations of the i-th layer encoder unfold as follows:

C. DISTRIBUTION-BASED PRE-TRAINING TASKS
To capture the uncertainty semantic in common sense, we pre-train the MAP utilizing distribution-based VLP strategies on large-scale unlabeled data.In the pre-training phase, we apply PDEs after feature extractors and crossmodal transformer, respectively.Specifically, the PDE following the feature extractors derives unimodal distribution representations to execute the coarse-grained multimodal alignment.Furthermore, situated at the end of MAP, another PDE is entrusted with fine-grained multimodal alignment.

1) COARSE-GRAINED PRE-TRAINING TASKS
We introduce a method termed D-VLC (Distributionbased Vision-Language Contrastive Learning) to achieve coarse-grained multimodal alignment of unimodal representations prior to fusion.Specifically, we employ the 2-Wasserstein distance [36], [37], [38] to measure the distance between multivariate Gaussian distributions.Considering two Gaussian distributions as an example, N (µ 1 , 1 ) and N (µ 2 , 2 ), the 2-Wasserstein distance (D 2W ) between them is: Furthermore, both 1 and 2 are diagonal matrices, which implies that The formula above can be rewritten as: where σ denotes a standard deviation vector.The distribution representations of the [CLS] tokens from the PDEs following feature extractors represent the overall unimodal representations.The similarity between text and an image is calculated as: where a acts as a negative scale factor due to the inverse proportionality of similarity to distance, and b serves as a shift value.Within a batch containing N image-text pairs, we identify N positive matched samples alongside N (N − 1) negative samples, employing the InfoNCE loss as follows: where τ represents a learned temperature parameter.The above expressions are aggregated to form the D-VLC loss L D-VLC .

2) FINE-GRAINED PRE-TRAINING TASKS
After the cross-modal transformer, fine-grained interaction is enabled on each token across different modalities.We propose two methods to handle the fine-grained multimodal pre-training, which are Distribution-based Masked Language Modeling (D-MLM) and Distribution-based Image Text Matching (D-ITM).D-MLM necessitates the model to predict masked words by interpreting the text in conjunction with an image.The conventional Masked Language Modeling task, initially employed as a pre-training task for BERT [39], aims at enhancing contextual modeling capabilities.In the VLP scenario, missing words are reconstructed using information from other features and modalities.According to the configurations from several multimodal models [15], [35], the model masks text tokens at a probability of 15%, with 80% of them replaced by the [MASK] token, 10% substituted with random words, and the remaining 10% left unchanged.To predict the masked words, we sample the points from distribution representations, wherein D-MLM minimizes a Cross-Entropy (CE) loss across µ vectors and other sample point vectors: where K denotes the sample number and y represents the label of the masked word.µ is indicative of a mean vector, whereas z (i) stands for stochastic sample point vectors; these vectors are subsequently channeled into the classifier φ.In the inference phase, the final output is derived by averaging the prediction results of all samples: D-ITM is a binary classification task that predicts whether a pair of image-text is matched or not.In detail, we extract the point vectors from w [CLS] distributions of multimodal representations, and merge them to generate the results.
where v µ and w µ represent the mean vectors of vision and language [CLS] distributions, respectively, while v (i)  and w (i) denote the sampled points.The D-ITM classifier is denoted by φ.The matched image-text pairs serve as positive examples.Negative examples are generated through the random substitution of either images or text descriptions.

3) PRE-TRAINING OBJECTIVES
We observe that random sampling process elevates the training difficulty.Training the model solely with the aforementioned losses induces a variance collapse.As all sampled vectors converge to the optimal position, the distribution eventually degenerates into a point, losing the ability to learn multimodal uncertainty.Hence, we apply a regularization loss to prevent the uncertainty level of distributions from being lower than a specified threshold: where γ serves as a threshold, modulating the uncertainty levels in the learned distributions.Additionally, the function h(N (µ, σ 2 )) quantifies the entropy of multivariate Gaussian distributions, as further detailed below: where denotes the covariance matrix, characterized as a diagonal matrix.Furthermore, with the diagonal vector of being σ 2 , (11) can be reformulated as follows: where d indicates the feature dimension.We observe that the sampling process for N (µ, σ 2 ) poses a challenge in inhibiting gradients from propagating back.To address this, by employing the reparameterization trick [40], we sample a random variable ϵ from standard normal distributions, rather than directly sampling from N (µ, σ 2 ): Following (13), the output z is distributed according to the predicted distributions derived from the PDE.Consequently, we can decouple the calculations of the mean and standard deviation from the sampling operation.This decoupling allows these parameters to be trainable.
In summary, during the pre-training phase, the model executes forward propagation thrice in a single step, conducting the tasks D-MLM, D-ITM, and D-VLC in sequence.Thus, the entire pre-training objective is expressed as follows: where α denotes the weight of L reg .

D. FINE-TUNING
For applying our MAP model on the VL downstream tasks, we employ the fine-tuning method as illustrated in Fig. 6 after the pre-training stage.To address various downstream tasks, we construct a basic MLP layer for comprehension tasks.Initially, we extract point vectors from the distribution representations of the [CLS] tokens.Subsequently, we merge point representations from different modalities as overall features to perform classification, implementing a mean pooling operation on all sampled vectors.

IV. EXPERIMENTAL SETTINGS A. VL UNDERSTANDING AND GENERATION TASKS
We assess the performance of MAP on the widely recognized Vision-Language (VL) understanding and generation benchmarks, such as video captioning, image-text retrieval, visual question answering, visual reasoning and visual entailment.

1) VIDEO CAPTIONING
In the field of video captioning, MSRVTT [47] stands out with its collection of 10K open-domain video clips, each accompanied by 20 ground-truth captions.Adopting the standard split, our dataset encompasses 6.5K training videos and 2.9K testing videos.We benchmark our method against earlier research using the same test split.In keeping with common evaluation methods [29] in video captioning, we provide detailed comparisons using well-known metrics such as BLEU4, METEOR, ROUGE-L, and CIDEr.

2) IMAGE-TEXT RETRIEVAL
Image-Text retrieval tasks encompass two sub-tasks: Image Retrieval (IR) task and Text Retrieval (TR).Both sub-tasks necessitate the AI system to rank images or text based on the understanding of image-text similarity.We utilize the popular MSCOCO [1] and Flickr30K [48] datasets in the experiments, specifically employing the Karpathy & Fei-Fei 5K MSCOCO test set and the Flickr30K test set, and report the top-K retrieval results.
3) VISUAL QUESTION ANSWERING The objective of the task is to accurately address queries posed in natural language, based on the visual content within provided images.In line with prior work [35], we perform experiments on the VQA2.0 dataset [49], treating the task as a classification problem.For evaluation, we use accuracy as the principal metric.

4) VISUAL REASONING
Within the scope of visual reasoning, the NLVR2 [50] task requires the system to evaluate the consistency between textual descriptions and their corresponding dual-image sets.This dataset encompasses a total of 107, 292 instances, each comprising a human-annotated English sentence paired with two photographs.We utilize accuracy as the metric for evaluation.

5) VISUAL ENTAILMENT
Visual Entailment (VE) is a concept that involves pairs of images and sentences, where the premise is established by an image, diverging from the traditional textual entailment tasks that utilize natural language sentences.The SNLI-VE dataset [51] aims to assess the performance of sophisticated VE models by gauging their ability to accurately infer the semantic congruence between the image and the associated text.Accuracy is employed as the evaluation metric.

B. BASELINES
In our experiments for image-text tasks, our model is compared with an array of SoTA VLP baselines, including but not limited to ALBEF [28] and METER [35].For a fair comparison environment, we follow the definition of model size [32] for classification.In detail, considering model parameter efficiency, the model size of VLP models can be categorized into at least 2 distinct tiers: Base, and Large.(1) ''Base'' corresponds to the VLP models with similar size to BERT-Base [39].(2) ''Large'' is the VLP model with a similar size to BERT-Large.Furthermore, we summarize all referenced VLP modes with model size and pre-training datasets in Table 1.
For video-based tasks, we benchmark our approach against existing SOTA methods, such as POS+VCT [57], STG-KD [58], ORG-TRL [59], OpenBook [60], and SWIN-BERT [29].In the video-related experiments, we align the same training procedure as the SOTA methods for a fair evaluation environment.In detail, the model parameters of SWINPDE are initialized randomly.Subsequent training occurs on the training set and evaluation is conducted on the test split.

C. PRE-TRAINING DATASETS
Our pre-training datasets comprise MSCOCO [1], Visual Genome (VG) [54], SBU [55] and Conceptual Captions (CC-3M) [56].Moreover, Table 2 provides statistics on the images and text contained in the pre-training datasets of all referenced models.These datasets are assembled from various public sources.However, some image URLs are unavailable, potentially leading to a lower number of images than initially estimated.During pre-processing phase, we standardize each image into the size of 288 × 288 pixels.

D. IMPLEMENTATION DETAILS
Following a widely-used setting [35], we set the hidden feature sizes to 768, the head number to 12 in the MHA operation, and the layer number (N L ) of the cross-modal transformer to 6.For data processing, each image is resized and cropped to 384 × 384, with the image patch size set to 16.In the term of text, we set the maximum token length to 50.In the PDE module, the default activation function (''Act'') in ( 1) is Softmax and the head number k is set to 6.In all experiments, the AdamW optimizer is employed, with the learning rate first warmed up and then linearly decayed.In the process of extracting point vectors from distribution representations, we set the sample number K to 5. The experiments are conducted on 8 NVIDIA A100 GPUs.
In the pre-training phase, our MAP is pre-trained with D-MLM, D-ITM, and D-VLC.In detail, a is set to −0.005 and b is set to 6 in (5) of D-VLC task.For the regularization loss of distributions in (10), we set the threshold γ to 300.In (14) of the full loss, α is 0.01.The model undergoes 100K pretraining steps, utilizing a batch size of 4, 096.The learning rate of feature extractors is set to 1e−5, while the cross-modal transformer and the PDE are set to 5e − 5.
In the fine-tuning stage, MAP is trained for 10 epochs, with the learning rates of feature extractors, cross-modal transformer, and PDE set to 5e − 6, 2.5e − 5, and 2e − 4 respectively.In the video-based tasks, SWINPDE is trained on the dataset for 15 epochs, using a learning rate of 3e − 4. Adopting a similar principal architecture to SWINBERT [29], we utilize VidSwin as the encoder for video data and incorporate our PDE to learn the uncertainty.After random sampling from the distribution representations, we engage the same multimodal transformer to decode the captions.For each video in the dataset, random cropping is executed on all frames, consistently targeting the same spatial coordinates to extract a 224 × 224 region.

V. RESULTS AND ANALYSIS A. RESULTS ON VL UNDERSTANDING AND GENERATION TASKS 1) RESULTS FOR RANDOM INITIALIZED MAP IN VQA2.0
To assess the performance of MAP without the influence of additional data, we compare it with existing methods in the VQA2.0 task for vision-language understanding.Additionally, we aim to understand the effectiveness of the PDE.As revealed in Table 3, MAP performs better than all other methods that do not utilize extra data, achieving SOTA results on the VQA2.0 task.These findings suggest that PDE effectively incorporates multimodal uncertainty into the models, even in the absence of large-scale pretraining datasets.Such outcomes lend further support to the effectiveness and generalizability of our distribution representation modeling approach.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 8.
An additional example with some images and captions of ''chef'', ''kitchen'', ''person'', ''bike'' and so on.The samples are from the MSCOCO generation of diverse captions.It further validates that our PDE is capable of grasping uncertainty nuances, and distribution representation effectively models this complexity.

3) EVALUATION ON IMAGE-TEXT RETRIEVAL
As shown in Table 5, our MAP model demonstrates superior performance on the MSCOCO dataset, consistently surpassing competing models in all evaluation metrics.On the Flickr30K dataset, the model consistently ranks in either the first or second position across various benchmarks.This is particularly noteworthy, considering that ALBEF, a model specifically designed for retrieval tasks, falls short in comparison to our MAP model.Despite ALBEF's large-scale pre-training with 14M images, our MAP model surpasses it across all metrics on the MSCOCO retrieval task.This performance shows the effectiveness of our approach to uncertainty modeling.On the Flickr30K dataset, our MAP achieves the best performance or only about 0.2 point behind the best score.This slight discrepancy could be attributed to the relatively smaller sample size of Flickr30K compared to other datasets, which might render the model more susceptible to overfitting.
Shifting our focus to a comparison with PCME [4], our MAP model maintains a good performance.Specifically, PCME employs probabilistic distribution representations for retrieval tasks.This superiority is further highlighted on the MSCOCO dataset, where PCME scores 44.2/31.9 on TR@1/IR@1, while our MAP model achieves impressive scores of 79.3/60.9.PCME, which adopts a dual-tower architecture for retrieval, employs a soft contrastive loss with sampled points from distributions.In contrast, our contrastive loss on MAP is based on the 2W distance, a measure that directly handles the multiple distributions.This further underscores the effectiveness of our approach to uncertainty modeling.

B. QUALITATIVE RESULTS
Upon deriving the distribution representations from the PDE, we conduct a series of 2D illustrative experiments employing clustering algorithms in Sec.V-B1 and Sec.V-B2.Firstly, we employ the pre-trained MAP to encode images and text into distribution representations.After that, these illustrative experiments are performed to uncover non-linear relationships within the high-dimensional embeddings.Specifically, we separately examine the µ and σ 2 representations in the experiments, with each experiment evaluating over a thousand image-text pairs.
In Sec.V-B3, we examine the impact of PDE on the results generated by MAP in the VQA2.0 task.Specifically, we show the results outputted by the models.For models incorporating PDE, we perform three samplings from the predicted distributions and subsequently convert these embedings back into natural language.

1) VISUALIZATION ON THE DISTRIBUTION REPRESENTATIONS
We focus on the visualization of distribution representations generated by the pre-trained MAP model.Fig. 7 presents the characteristics of distribution representations, showing that distributions with similar semantic meanings tend to cluster.Both the geometry of the images and the textual descriptions manifest congruent characteristics.The completeness of the enclosing ellipses signifies a robust semantic coverage across both visual and textual elements.For example, owing to the part-whole relationship between images ''4'' and ''8'', the area under ellipse ''8'' nearly subsumes that of ellipse ''4''.The intersection of ellipses for images ''0'', ''4'', and ''8'' and their captions indicate several shared themes (such as ''a young boy'') in both visual and textual data.Additional similar behavioral patterns of our MAP model are presented in Fig. 8.The qualitative findings affirm that the model's uncertainty modeling serves to encapsulate complex semantic relationships and rich contextual nuances effectively.

2) COMPARATIVE VISUALIZATION OF POINT REPRESENTATIONS AND DISTRIBUTION REPRESENTATIONS
In our study, we aim to explore the differences between the representations generated by our proposed method and those produced by ALBEF [28], a well-known method in the field of multimodal representation learning.As shown in Fig. 9, we follow the same method and visualize the features of the same image-sentence pairs for ALBEF (4M). to ALBEF, our method takes advantage of capturing rich semantics and diverse concepts embedded in the imagesentence pairs.Our method effectively captures intra-modal and inter-modal uncertainty through distribution, reflected in the distribution's characters within the representation space, such as shapes and positions.This capability is crucial for many downstream tasks like visual question answering, requiring a nuanced understanding of both visual and textual content.

3) CASES FOR DIVERSE PREDICTIONS
As illustrated in Fig. 10, we explore the advantages of distribution representation, which allows for a diverse range of prediction results.In the field of multimodal tasks, semantic uncertainty is a prevalent challenge.In multimodal understanding tasks such as VQA, a notable benefit of uncertainty modeling is the ability to sample multiple predictions from distribution representations, thereby yielding diversity.Take Case 3 in Fig. 10 for instance, where MAP furnishes multiple plausible answers (field, park, and grass), closely mirroring real-world scenarios.On the other hand, the point representations from MAP without PDE invariably generate a singular answer, overlooking other possible descriptions.Moreover, the distribution representations extend their utility to other multimodal tasks like video captioning, enabling the generation of diverse and fitting descriptions.This benefit arises from the diversity created by uncertainty modeling.

C. ABLATION STUDIES 1) IMPACT OF PROBABILITY DISTRIBUTION REPRESENTATIONS ON VL TASKS
As illustrated in Table 7, applying PDE improves performance in various VL downstream tasks.Regardless of whether the model is initialized with random or pre-trained weights, distribution representations consistently outperform point representations in terms of VL understanding.The superior performance of distribution representations can be attributed to their ability to capture multimodal uncertainties, thereby conveying a more nuanced semantic understanding.

2) DESIGN CONSIDERATIONS FOR PDE
As illustrated in Table 8, we investigate the influence of various designs on the performance of the PDF.Upon eliminating the sequence-level interaction in PDE, we refer to it as ''MLP only'' (MultiLayer Perceptron), a prevalent approach utilized in previous studies [3], [4], [9].Our PDE (Softmax) exceeds the performance of the ''MLP only'' approach on VQA2.0, thereby gaining an advantage from the sequence-level information.To explore the impact of the structures on the outcomes, we propose several potential activation functions: ReLU, ReLU 2 , Sigmoid, and Softmax.The function ReLU indicates the activation status of the relationship between tokens, while ReLU 2 enhances ReLU by being differentiable.We note that ''MLP only'' surpasses ReLU and ReLU 2 , illustrating the importance of sequence-level interaction design.The function Sigmoid maps input data to a range between 0 and 1, smoothly assigning weights among different tokens.Lastly, the function Softmax surpasses the others in VQA2.0, indicating that Softmax is apt for expressing the correlation between tokens.As a result, we select Softmax as the primary activation function.

3) EVALUATING THE ROLE OF DIFFERENT PRE-TRAINING STRATEGIES
Table 9 shows the influence of different pre-training tasks on the performance of VL downstream tasks.The lack of D-MLM pre-training results in the lowest performance among all pre-training strategies, underscoring the crucial role of D-MLM in pre-training.Additionally, both D-VLC and D-ITM aid the model in understanding the semantic similarity between different modalities.Concerning specific tasks, D-VLC yields more substantial improvements in VQA2.0, while D-ITM proves to be more efficacious in enhancing performance on SNLI-VE and NLVR2.

4) ANALYZING THE EFFICACY ACROSS DIFFERENT LAYER ARCHITECTURES OF THE CROSS-MODAL TRANSFORMER
As shown in Table 10, we explore the influence of layer count in the VQA2.0 task, considering both random initialization and pre-training strategies such as D-MLM, D-ITM, and D-VLC on the MSCOCO dataset.With random initialization, a model with six layers demonstrates optimal performance, albeit hitting a performance plateau.Upon employing pre-training strategies, the eight-layer model surpasses its six-layer counterpart, indicating that pre-training aids in overcoming the bottleneck posed by parameter limitations.This improvement is likely attributed to large-scale data pretraining mitigating the issue of over-fitting when more parameters are involved.Additionally, the benefits of pre-training diminish as the number of layers decreases, owing to the model's constrained learning capacity.

VI. CONCLUSION
In this study, we focus on quantifying the multimodal uncertainties associated with real-world objects via probabilistic modeling.Leveraging both sequence-level and feature-level interactions, we introduce a Probability Distribution Encoder (PDE) designed to obtain distributional representations across various modalities.To facilitate its application, PDE seamlessly integrates into well-established vision-language models, such as SWINPDE.Our qualitative results highlight the advantages of employing distribution representations over point representations, particularly in enhancing semantic expressiveness and diverse predictions upon learning uncertainties.To exploit large-scale unlabeled data for multimodal uncertainty learning, we formulate three new pre-training tasks: D-MLM, D-ITM, and D-VLC.Moreover, we present an end-to-end Multimodal uncertainty-Aware vision-language Pre-training model (MAP) aimed at acquiring generic distributional representations.Empirical evidence suggests that these distribution representations significantly contribute to the performance in vision-language understanding and generation tasks.Our models and methods set new benchmarks, achieving SOTA results on multiple datasets and tasks.

FIGURE 2 .
FIGURE 2. Qualitative examples presenting the effectiveness of our PDE-based framework (SWINPDE) in generating captions.The produced captions maintain semantic coherence and offer a variety of expressions that accurately describe the video content.
new tasks: Distribution-based Masked Language Modeling (D-MLM), Distribution-based Image-Text Matching (D-ITM) and Vision-Language Contrastive learning (D-VLC) pre-training strategies.We construct new objectives and computational processes of these multimodal pre-training tasks, ensuring they effectively adapt to the distribution-based representations.Following fine-grained interactions, D-ITM and D-MLM are deployed for overall-level and token-level alignment of images and text.Moreover, D-VLC addresses coarse-grained multimodal alignment by measuring entire distributions to align representations across modalities.

FIGURE 3 .
FIGURE 3. Structure of the Probability Distribution Encoder (PDE) module.

FIGURE 4 .
FIGURE 4. MAP's pre-training architecture and objectives: Utilizing PDE to model representations as multivariate Gaussian Distributions (GD).The term ''N L '' denotes the cross-modal transformer layer count.And, we provide an illustrated example with a two-dimensional GD. ''Rep.'' indicates representations.

FIGURE 6 .
FIGURE 6.Fine-tuning our MAP on the VL downstream tasks.The ''N L '' is the layer number of the cross-modal transformer.''Rep.'' indicates representations.

FIGURE 7 .
FIGURE 7. Visualization of the distribution representations from pre-trained MAP.The images and related captions come from the MSCOCO dataset.Each 2D Gaussian distribution is represented as an ellipse with 95% confidence.The labels of images and related captions are in the legend.

FIGURE 9 .
FIGURE 9. Visualization analysis on representations and point representations.The image-text pairs are from the MSCOCO dataset.

10 .
Predictions sampled from the distribution representations.The samples come from the VQA2.0 dataset.From a quantization view, pre-training on unlabeled data clearly benefits our MAP model, which outperforms PCME in a great improvement.This comparative analysis further proves the robust performance of our MAP model.

4 )
EVALUATION ON VQA2.0,NVLR2, AND SNLI-VE As shown in Table 6, our MAP outperforms the previous SOTA models in Group 1.For example, compared to VLMo-Base, the MAP improves 0.53 points on the NLVR2 dev set.Furthermore, our model achieves a performance boost of +0.35 points on the VQA2.0 test-dev and +0.54 points on the SNLI-VE validation set over METER.It is noteworthy that MAP consistently outperforms SimVLM-Base, which was trained with 1.8 billion pre-training images, across all tasks.

TABLE 1 .
Baseline Model with complexity measured in parameters and data scale in pre-training images.

TABLE 2 .
Details of pre-training datasets in Table 1.

TABLE 3 .
Evaluation on VQA2.0 of models with random initialization.Best scores are in bold.

TABLE 4 .
Evaluation on MSRVTT of models with random initialization.Best results are bolded.

TABLE 5 .
An overall image-text retrieval SoTA comparison with best scores in bold and second best underlined.

TABLE 6 .
A comparative analysis with SoTA models on tasks of visual question answering, visual reasoning, and visual entailment is presented.The highest scores are highlighted in bold, while the second highest scores are underlined.

Table 4
offers a comprehensive comparison of performance metrics on the MSRVTT datasets, where SWINPDE leads among competitive models.Specifically, SWINPDE marks a great improvement (+2.8) in BLEU4 scores over SWIN-BERT.This outcome shows that our PDE, through the use of distribution representation for visual information, successfully incorporates uncertainty factors, leading to the

TABLE 7 .
The effectiveness of probability distribution representations on VL downstream tasks.For ''MAP w/o PDE'', we train a new model without PDE to conduct the experiments.Pre-trained methods for MAP: D-MLM, D-ITM.Pre-trained methods for MAP w/o PDE: MLM, ITM.

TABLE
Effect of different structures of PDE.We explore the different designs of ''Act'' in Equation1.Normal denotes the normalization operation.

TABLE 9 .
The effect of distribution-based pre-training tasks.We pre-train the model on the MSCOCO dataset.

TABLE 10 .
The effect of different layer numbers in the cross-modal transformer on VQA2.0.