I. Introduction
Benefiting from the large models and large-scale datasets in NLP, researchers realized that the large models [1], [2], [3], [4], [5] have a crucial emergent ablity, which is In-context Learning. The purpose of in-context learning is to assist the model in comprehending new tasks and making predictions for new examples based on provided prompt. Typically, the prompt is a concise, structured input that provides context for the task, such as a task description or an example of an input-label pair. As a well-known field in NLP [6], [7], [8], in-context learning has just started in the field of vision. Indeed, visual in-context learning is becoming increasingly important in computer vision, particularly with the rise of large-scale models. Although these models can achieve impressive results in many tasks [9], [10], they often require huge amounts of data and computation to train, making them impractical for many real-world applications. As such, visual in-context learning is becoming increasingly important for developing more efficient and accurate computer vision systems that can operate in real-world settings. However, these research is relatively limited, so we are concentrating on visual in-context learning and carrying out preliminary studies.