I. Introduction
Foundation models [1] are becoming increasingly important in the field of artificial intelligence (AI). Compared to small, specialized models tailored for specific tasks or domains, “one-for-all”-style general-purpose foundation models typically exhibit superior capabilities and generalization abilities in a wide range of downstream tasks. Numerous foundation models have emerged in recent years, such as SimCLR [2], mean absolute error (MAE) [3], and segment anything model (SAM) [4] for computer vision, Bidirectional Encoder Representations from Transformers (BERT) [5] and generative pre-trained transformer (GPT) series [6], [7] for natural language processing, also contrastive language image pretraining (CLIP) [8] and Flamingo [9] for vision-language.