ARC: A Layer Replacement Compression Method Based on Fine-Grained Self-Attention Distillation for Compressing Pre-Trained Language Models | IEEE Journals & Magazine | IEEE Xplore

ARC: A Layer Replacement Compression Method Based on Fine-Grained Self-Attention Distillation for Compressing Pre-Trained Language Models


Abstract:

The primary objective of model compression is to maintain the performance of the original model while reducing its size as much as possible. Knowledge distillation has be...Show More

Abstract:

The primary objective of model compression is to maintain the performance of the original model while reducing its size as much as possible. Knowledge distillation has become the mainstream method in the field of model compression due to its excellent performance. However, current knowledge distillation methods for medium and small pre-trained models struggle to effectively extract knowledge from large pre-trained models. Similarly, methods targeting large pre-trained models face challenges in compressing the model to a smaller scale. Therefore, this paper proposes a new model compression method called Attention-based Replacement Compression (ARC), which introduces layer random replacement based on fine-grained self-attention distillation. This method first obtains the important features of the original model through fine-grained self-attention distillation in the pre-training distillation stage. More information can be obtained by extracting the upper layers of the large teacher model. Then, the one-to-one Transformer-layer random replacement training fully explores the hidden knowledge of the large pre-trained model in the fine-tuning compression stage. Compared with other complex compression methods, ARC not only simplifies the training process of model compression but also enhances the applicability of the compressed model. This paper compares knowledge distillation methods for pre-trained models of different sizes on the GLUE benchmark. Experimental results demonstrate that the proposed method achieves significant improvements across different parameter scales, especially in terms of accuracy and inference speed.
Page(s): 848 - 860
Date of Publication: 03 September 2024
Electronic ISSN: 2471-285X

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.