Exploring Effective Factors for Improving Visual In-Context Learning | IEEE Journals & Magazine | IEEE Xplore

Exploring Effective Factors for Improving Visual In-Context Learning


Abstract:

The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widel...Show More

Abstract:

The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.
Published in: IEEE Transactions on Image Processing ( Volume: 34)
Page(s): 2147 - 2160
Date of Publication: 31 March 2025

ISSN Information:

PubMed ID: 40168207

Funding Agency:

References is not available for this document.

I. Introduction

Benefiting from the large models and large-scale datasets in NLP, researchers realized that the large models [1], [2], [3], [4], [5] have a crucial emergent ablity, which is In-context Learning. The purpose of in-context learning is to assist the model in comprehending new tasks and making predictions for new examples based on provided prompt. Typically, the prompt is a concise, structured input that provides context for the task, such as a task description or an example of an input-label pair. As a well-known field in NLP [6], [7], [8], in-context learning has just started in the field of vision. Indeed, visual in-context learning is becoming increasingly important in computer vision, particularly with the rise of large-scale models. Although these models can achieve impressive results in many tasks [9], [10], they often require huge amounts of data and computation to train, making them impractical for many real-world applications. As such, visual in-context learning is becoming increasingly important for developing more efficient and accurate computer vision systems that can operate in real-world settings. However, these research is relatively limited, so we are concentrating on visual in-context learning and carrying out preliminary studies.

Select All
1.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI, San Francisco, CA, USA, Tech. Rep., 2018.
2.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, San Francisco, CA, USA, Tech. Rep., 2019.
3.
T. B. Brown, “Language models are few-shot learners,” in Proc. NIPS, 2020, pp. 1877–1901.
4.
M. Peters, “Deep contextualized word representations,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., 2018, pp. 2227–2237.
5.
Y. Liu, “RoBERTa: A robustly optimized BERT pretraining approach,” 2019, arXiv:1907.11692.
6.
S. Garg, D. Tsipras, P. Liang, and G. Valiant, “What can transformers learn in-context? A case study of simple function classes,” in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022, pp. 30583–30598.
7.
S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of in-context learning as implicit Bayesian inference,” in Proc. Int. Conf. Learn. Represent., Jan. 2021.
8.
H. Liu, “Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning,” in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022, pp. 1950–1965.
9.
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16000–16009.
10.
Z. Li, Y. Sun, L. Zhang, and J. Tang, “CTNet: Context-based tandem network for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9904–9917, Dec. 2022.
11.
X. Wang, W. Wang, Y. Cao, C. Shen, and T. Huang, “Images speak in images: A generalist painter for in-context visual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 6830–6839.
12.
A. Bar, Y. Gandelsman, T. Darrell, A. Globerson, and A. A. Efros, “Visual prompting via image inpainting,” in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022, pp. 25005–25017.
13.
O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., 2022, pp. 2655–2671.
14.
S. Min, “Rethinking the role of demonstrations: What makes in-context learning work?,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2022, pp. 11048–11064.
15.
S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi, “MetaICL: Learning to learn in context,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., 2022, pp. 2791–2809.
16.
J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, “What makes good in-context examples for GPT-3?,” in Proc. DeeLIO Workshop, 2022, pp. 100–114.
17.
S. Min, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Noisy channel language model prompting for few-shot text classification,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 5316–5330.
18.
D.-H. Lee, “Good examples make a faster learner: Simple demonstration-based learning for low-resource NER,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 2687–2700.
19.
D. Zhou, “Least-to-Most prompting enables complex reasoning in large language models,” in Proc. Int. Conf. Learn. Represent., Jan. 2022.
20.
Y. Zhang, K. Zhou, and Z. Liu, “What makes good examples for visual in-context learning?,” in Proc. Adv. Neural Inf. Process. Syst., Jan. 2023, pp. 17773–17794.
21.
X. Chen, “Context autoencoder for self-supervised representation learning,” Int. J. Comput. Vis., vol. 132, no. 1, pp. 208–223, Jan. 2024.
22.
H. Bao, D. Li, and F. Wei, “BEiT: BERT pre-training of image transformers,” in Proc. Int. Conf. Learn. Represent., Jan. 2021.
23.
A. Lewkowycz, “Solving quantitative reasoning problems with language models,” in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022, pp. 3843–3857.
24.
L. Ouyang, “Training language models to follow instructions with human feedback,” in Proc. Adv. Neural Inf. Process. Syst., 2022, pp. 27730–27744.
25.
M. Chen, “Improving in-context few-shot learning via self-supervised training,” in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 3558–3573.
26.
Y. Chen, R. Zhong, S. Zha, G. Karypis, and H. He, “Meta-learning via language model in-context tuning,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 719–730.
27.
X. Garcia and O. Firat, “Using natural language prompts for machine translation,” 2022, arXiv:2202.11822.
28.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Annu. Meeting Assoc. Comput. Linguistics, Jun. 2019, pp. 4171–4186.
29.
O. Press, M. Zhang, S. Min, L. Schmidt, N. Smith, and M. Lewis, “Measuring and narrowing the compositionality gap in language models,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2023, pp. 5687–5711.
30.
T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal, “QASC: A dataset for question answering via sentence composition,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 5, 2020, pp. 8082–8090.

Contact IEEE to Subscribe

References

References is not available for this document.