I. Introduction
The tremendous success of transformer-based models, such as BERT [1] and GPT-2 [2], have been witnessed in the past several years. They achieve state-of-the-art performance and become dominated in most NLP tasks. Their application scenarios and scale also continue to grow aggressively. Transformer-based models are mainly composed of the word embedding layer, the self-attention layer, and feed-forward layer, etc.