I. Introduction
MultiModal Large Language Models (MM-LLMs) such as Sora become one of the hottest AI achievements in recent years. Through DNN realizing excellent grades in a separate language or image task, there has been a lack of systematic models for MultiModal tasks. For example, in the image understanding task, we need to use pure visual perception algorithms to get the information on image type, color, etc., and then use text generation models to generate the corresponding text. This means that multimodal models require a lot of manual representation engineering, which hinders the further expansion of multimodal applications. In 2022, Openai published the CLIP (Contrastive Language-Image Pretraining) model [1]. While CLIP is initially known for its Zero-shot learning capabilities, it shows great capacity in subsequent downstream tasks in object detection, object segmentation, image generation, and video action. This demonstrates that CLIP’s greatest contribution is breaking the shackles of fixed kinds of labels and allowing for more flexible inference in downstream tasks. Diffusion module, gpt model, and sora are typical models based on CLIP. Figure 1 shows the CLIP-based applications, the image generation (left), and stick-figure style generation (right).