Conferences >2024 IEEE International Test ...

Understanding and Mitigating the Soft Error of Contrastive Language-Image Pre-training Models

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In recent years, MultiModal Large Language Models (MM-LLMs), based on the Contrastive Language-Image Pretraining models (CLIP), have achieved the best results in many fie...Show More

Metadata

Abstract:

In recent years, MultiModal Large Language Models (MM-LLMs), based on the Contrastive Language-Image Pretraining models (CLIP), have achieved the best results in many fields. CLIP breaks through the gaps between language models and image models, realizes zero-shot image classification, and achieves excellent performance in tasks such as text-to-image generation, image style transformation, and long video generation. However, there are few studies on the fault tolerance of CLIP with soft errors, which hinders the application of multimodal large models in the field of security. Based on the analysis of the fault tolerance of common multimodal large models, we proposes a soft error mitigation framework. According to the experiments in this paper, the framework can effectively detect soft errors and mitigate the errors.

Published in: 2024 IEEE International Test Conference in Asia (ITC-Asia)

Date of Conference: 18-20 August 2024

Date Added to IEEE Xplore: 10 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/ITC-Asia62534.2024.10661330

Conference Location: Changsha, China

Contents

I. Introduction

MultiModal Large Language Models (MM-LLMs) such as Sora become one of the hottest AI achievements in recent years. Through DNN realizing excellent grades in a separate language or image task, there has been a lack of systematic models for MultiModal tasks. For example, in the image understanding task, we need to use pure visual perception algorithms to get the information on image type, color, etc., and then use text generation models to generate the corresponding text. This means that multimodal models require a lot of manual representation engineering, which hinders the further expansion of multimodal applications. In 2022, Openai published the CLIP (Contrastive Language-Image Pretraining) model [1]. While CLIP is initially known for its Zero-shot learning capabilities, it shows great capacity in subsequent downstream tasks in object detection, object segmentation, image generation, and video action. This demonstrates that CLIP’s greatest contribution is breaking the shackles of fixed kinds of labels and allowing for more flexible inference in downstream tasks. Diffusion module, gpt model, and sora are typical models based on CLIP. Figure 1 shows the CLIP-based applications, the image generation (left), and stick-figure style generation (right).

References is not available for this document.

Understanding and Mitigating the Soft Error of Contrastive Language-Image Pre-training Models

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Understanding and Mitigating the Soft Error of Contrastive Language-Image Pre-training Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?