Loading [a11y]/accessibility-menu.js
Understanding and Mitigating the Soft Error of Contrastive Language-Image Pre-training Models | IEEE Conference Publication | IEEE Xplore

Understanding and Mitigating the Soft Error of Contrastive Language-Image Pre-training Models


Abstract:

In recent years, MultiModal Large Language Models (MM-LLMs), based on the Contrastive Language-Image Pretraining models (CLIP), have achieved the best results in many fie...Show More

Abstract:

In recent years, MultiModal Large Language Models (MM-LLMs), based on the Contrastive Language-Image Pretraining models (CLIP), have achieved the best results in many fields. CLIP breaks through the gaps between language models and image models, realizes zero-shot image classification, and achieves excellent performance in tasks such as text-to-image generation, image style transformation, and long video generation. However, there are few studies on the fault tolerance of CLIP with soft errors, which hinders the application of multimodal large models in the field of security. Based on the analysis of the fault tolerance of common multimodal large models, we proposes a soft error mitigation framework. According to the experiments in this paper, the framework can effectively detect soft errors and mitigate the errors.
Date of Conference: 18-20 August 2024
Date Added to IEEE Xplore: 10 September 2024
ISBN Information:

ISSN Information:

Conference Location: Changsha, China

I. Introduction

MultiModal Large Language Models (MM-LLMs) such as Sora become one of the hottest AI achievements in recent years. Through DNN realizing excellent grades in a separate language or image task, there has been a lack of systematic models for MultiModal tasks. For example, in the image understanding task, we need to use pure visual perception algorithms to get the information on image type, color, etc., and then use text generation models to generate the corresponding text. This means that multimodal models require a lot of manual representation engineering, which hinders the further expansion of multimodal applications. In 2022, Openai published the CLIP (Contrastive Language-Image Pretraining) model [1]. While CLIP is initially known for its Zero-shot learning capabilities, it shows great capacity in subsequent downstream tasks in object detection, object segmentation, image generation, and video action. This demonstrates that CLIP’s greatest contribution is breaking the shackles of fixed kinds of labels and allowing for more flexible inference in downstream tasks. Diffusion module, gpt model, and sora are typical models based on CLIP. Figure 1 shows the CLIP-based applications, the image generation (left), and stick-figure style generation (right).

Contact IEEE to Subscribe

References

References is not available for this document.