I. Introduction
The rapid development of deep learning techniques in Natural Language Processing (NLP) has led to significant advancements in Language Models (LMs), making them fundamental to various NLP tasks, such as text classification [1], [2] and question answering [3]. Recent popular LMs, such as Google’s BERT [4] and OpenAI’s GPT family [5], are composed of multiple layers of Transformer blocks with millions of parameters [6]. Pre-training these LMs on massive text corpora collected from the Internet is a common practice [7]. Large-scale LMs can understand and generate fluent natural language [8], and minor parameter updates enable direct application to various downstream tasks. Pre-trained LMs can additionally be fine-tuned on small private datasets for domain-specific applications without incurring the high costs of training from scratch [4].