Abstract:
The primary objective of model compression is to maintain the performance of the original model while reducing its size as much as possible. Knowledge distillation has be...Show MoreMetadata
Abstract:
The primary objective of model compression is to maintain the performance of the original model while reducing its size as much as possible. Knowledge distillation has become the mainstream method in the field of model compression due to its excellent performance. However, current knowledge distillation methods for medium and small pre-trained models struggle to effectively extract knowledge from large pre-trained models. Similarly, methods targeting large pre-trained models face challenges in compressing the model to a smaller scale. Therefore, this paper proposes a new model compression method called Attention-based Replacement Compression (ARC), which introduces layer random replacement based on fine-grained self-attention distillation. This method first obtains the important features of the original model through fine-grained self-attention distillation in the pre-training distillation stage. More information can be obtained by extracting the upper layers of the large teacher model. Then, the one-to-one Transformer-layer random replacement training fully explores the hidden knowledge of the large pre-trained model in the fine-tuning compression stage. Compared with other complex compression methods, ARC not only simplifies the training process of model compression but also enhances the applicability of the compressed model. This paper compares knowledge distillation methods for pre-trained models of different sizes on the GLUE benchmark. Experimental results demonstrate that the proposed method achieves significant improvements across different parameter scales, especially in terms of accuracy and inference speed.
Published in: IEEE Transactions on Emerging Topics in Computational Intelligence ( Volume: 9, Issue: 1, February 2025)