Abstract:
Knowledge distillation aims to transfer knowledge from a large teacher model to a lightweight student model, enabling the student to achieve performance comparable to the...Show MoreMetadata
Abstract:
Knowledge distillation aims to transfer knowledge from a large teacher model to a lightweight student model, enabling the student to achieve performance comparable to the teacher. Existing methods explore various strategies for distillation, including soft logits, intermediate features, and even class-aware logits. Class-aware distillation, in particular, treats the columns of logit matrices as class representations, capturing potential relationships among instances within a batch. However, we argue that representing class embeddings solely as column vectors may not fully capture their inherent properties. In this study, we revisit class-aware knowledge distillation and propose that effective transfer of class-level knowledge requires two regularization strategies: separability and orthogonality. Additionally, we introduce an asymmetric architecture design to further enhance the transfer of class-level knowledge. Together, these components form a new methodology, Class Discriminative Knowledge Distillation (CD-KD). Empirical results demonstrate that CD-KD significantly outperforms several state-of-the-art logit-based and feature-based methods across diverse visual classification tasks, highlighting its effectiveness and robustness.
Published in: IEEE Transactions on Emerging Topics in Computational Intelligence ( Volume: 9, Issue: 2, April 2025)