NeuRes: Highly Activated Neurons Responses Transfer via Distilling Sparse Activation Maps

In recent years, Knowledge Distillation has obtained a significant interest in mobile, edge, and IoT devices due to its ability to transfer knowledge from the large and complex teacher to the lightweight student network. Intuitively, Knowledge Distillation refers to forcing the student to mimic the teacher’s neuron responses to improve the generalization of the student by deploying the distillation losses as the regularization terms. However, the non-linearity of the hidden layers and the high dimensionality of the feature maps make the knowledge transfer a rigorous task. Though numerous methods have been proposed to transfer the teacher’s neuron responses in the form of diverse feature characteristics such as attention, contrastive representation, and so on, to the best of our knowledge, no prior works considered feature-level non-linearity during distillation. In this work, we ask, does feature-level non-linearity-based approaches can improve student performance? For investigating those concerns, we propose a novel knowledge distillation technique called the NeuRes (Neuron’s Responses) via distilling the Sparse Activation Maps (SAMs) to transfer the highly activated Neurons Responses to the student to enhance the representation capability. Proposed NeuRes selects the highly activated neuron responses that produce Sparse Activation Maps (SAMs) while transferring the knowledge based on activation normalization. Our proposed NeuRes also transfers the translation invariant features using auxiliary classifiers and augmented data to improve students’ generalization. The detailed ablation studies and extensive experiments on model compression, transferability, adversarial robustness, and few-shot learning verify that NeuRes outperforms state-of-the-art distillation techniques on the standard benchmark datasets.


I. INTRODUCTION
Since few decades, vision-based deep learning approaches have attained immense attention by providing great performances in diverse visual-based tasks such as segmentation, detection, classification, recognition, reconstruction, and so on [1], [2], [3], [4], [5], [6], [7], [19]. However, the heavy and complex deep architecture demands high computational The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin . resources and energy consumption. This fact makes the large and complex deep architectures unsuitable to deploy at low computational resource-based and real-time applications such as mobile, IoT, and edge computing devices [20]. Moreover, 6G networks and AI-enabled edge, cloud, and fog computing-based infrastructures demand fast and rational learning systems i.e., lightweight AI experts, to facilitate realtime execution in low computational terminal devices such as smartphones, IoT, vehicular edge devices and so on [51], [52]. On the other hand, lightweight architectures perform VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ relatively poor than heavy and complex architectures due to having lower generalization capability [8], [9]. To this end, numerous approaches have been suggested to compress [12], [13], [15], prune [10], [11], quantize [14] or mimic [16], [17], [22], [23] teacher networks into or by a small and lightweight student architecture. Knowledge Distillation is a very effective technique that provides explicit supervision to the lightweight student to achieve the teacher's performance without increasing the number of parameters and computational complexity [16], [18], [22], [23]. Hinton et al. [23] first introduced the knowledge distillation (KD) technique to transfer predictive logits distributions to the student from the teacher(s). KD [23] transfers predictive logits-level knowledge via softening the logits distribution by dividing by a constant factor called temperature where no distillation loss is deployed into intermediate feature space. Due to having lower representational capacity than the teacher networks, the students struggle to mimic the intermediate feature representations i.e., latent space [24], [25]. Numerous studies show that reducing only Kullback Leibler (KL) divergence between the global logits distributions of the student and teacher is not enough. To improve this scenario, numerous feature-level knowledge distillation techniques have been proposed to force the student to mimic the characteristics of the intermediate features of the teacher such as contrastive representations, attention, instance similarity, inter-channel co-relations, and so on [21], [26], [27], [28]. Intuitively, the principle goal of the knowledge distillation approach is to improve student generalization by forcing them to learn the neuron responses of the teacher network(s). The feature-level knowledge distillation techniques perform better compared to the logits-level distillation techniques. Though transferring this intermediate representational knowledge improves the student's generalization capability, due to the nonlinearity of the intermediate activation maps, achieving optimal performance is still a rigorous task. Overview of the process of transferring contrastive knowledge on augmented data [38] and highly activated neuron responses. The blue and orange colors denote teacher and student knowledge, respectively. (a) and (d) represent the highly activated and sparse response transfer to the student. (b) and (c) denote the contrastive knowledge among the similar and dissimilar instances in teacher and student.
The activation functions improve the non-linearity. The neural networks with the mixer of activation boundaries estimate a high-performing function [39], [40]. A combination of the activation boundaries creates a better decision boundary of a neural network [41]. These properties of activation functions indicate that transferring knowledge through refinement via an activation function improves performance by creating better decisions. Heo et al., [33] showed that transferring knowledge via creating an activation boundary improves the performance in the knowledge transfer task. However, does that transferring activation boundary provide optimal performances in knowledge distillation tasks? Fig. 3 shows that though transferring the intermediate activation maps provides improved performances, they are sub-optimal. In this work, we focus on the question, does that high dimensional and dense representations transfer provide optimal performances? Are transferring this dense knowledge highly error-sensitive to input data? Stanton et al., [25] discussed that though the students struggle to learn the teacher's knowledge perfectly, they may have the capacity to learn better compared to the existing outcomes. Liu et al., [28] proved that directly transferring the activation maps to the student i.e., pixel-to-pixel knowledge transfer is very sensitive to errors and less effective [30], [33].
Inspired by these observations, we focus on transferring highly activated neuron responses via distilling sparse activation maps and propose NeuRes by shifting and scaling the activation boundary. Fig. 1 shows the overview of the process of shifting and scaling the activation boundary of the ReLU activation function where the black and orange colored axis denote the original and shifted coordinate systems, respectively. ReLU refers to the rectified linear unit. In the deep learning model, ReLU provides non-linearity. The ReLU function is defined by σ (x) = max(0, x). Fig. 1(a) and Fig. 1(b) show the conventional ReLU activation function and the process of coordinate shifting, respectively. The proposed NeuRes transfers the highly activated neurons to the student which aims to mimic the sparse and highly activated neuron responses of the teacher network. The sparse knowledge-transferring technique solves the problem of high dimensionality and provides implicit non-linearity that makes the student improve generalization by utilizing their own capacity. As the student has lower generalization capability and lower activated neuron responses are dropped during training the student, we also feed the augmented data to both teacher and student that helps the student learn the teacher's translation invariant features on augmented data for improving generalization capability (see Experiments sections (IV) for details). In Fig. 2, we show the overview of the process of feeding both augmented and original data to help the student improve generalization using contrastive knowledge. Fig. 2(a) and Fig. 2(d) indicates the highly activated neuron responses of the teacher and student networks, respectively. Fig. 2(c) and Fig. 2(d) show the contrastive knowledge between the outcomes of the auxiliary classifiers for original and augmented data. The contrastive knowledge with other instances is also deployed as the regularization term.
Let's consider, A S i and A T i denote the i − th activation maps of the student N S and the teacher N T , respectively. Firstly, we filter the highly activated neuron's responses to obtain the Sparse Activation Maps (SAMs), S T i and S S i from A T i and A S i , respectively, based on the magnitude by shifting and scaling the activation boundary depicted in Fig. 1. The highly activated and sparse neuron's responses are then transferred to the student N S from the teacher N T along with the contrastive knowledge over input instances and augmented data shown in Fig. 2. The teacher predictive logits-level knowledge (logits distribution) is improved by reducing the KL divergence among the auxiliary (that is trained using augmented data and self-supervision approach) VOLUME 10, 2022 and the main classifiers [38]. During transferring the knowledge, the student is then forced to learn the improved predictive logits-level knowledge of the teacher by deploying the distillation losses. Fig. 2 shows the overall process of distilling teacher knowledge. During distillation, we consider four losses, such as L kd , L ce , L ax−st , and L res . The overall loss function (L NeuRes ) is shown in Eq. 12. The details about the proposed technique are discussed in section III. Our core contributions can be summarized as: • We provide an extensive investigation on the hypothesis that activation boundary shifting and scaling is a very effective technique to make the student learn better along with data augmentation and self-supervision approach.
• We show that transferring sparse representations is better compared to transferring dense representations along with data augmentation.
• Based on the analysis and observation, we propose a novel technique called NeuRes to transfer highly activated neuron responses and learning translation invariant features by the student via distilling Sparse Activation Maps (SAMs).
• We provide extensive results on popular benchmark standards, CIFAR100, STL10, and TinyImageNet for image classification, transferability, adversarial robustness transfer, and few-shot training tasks with detailed ablation studies.

II. RELATED WORKS A. FEATURE-LEVEL AND LOGITS-LEVEL KNOWLEDGE DISTILLATION
Knowledge distillation was introduced by Hinton et al.
in [23]. The core concept was to reduce KL divergence among the logits between the teacher and students. To soften the logits distribution that makes the student easy to learn they divide the logits distribution by a constant factor called temperature, τ . This technique basically transfers predictive logits-level knowledge i.e., predictive distribution to the student. That kind of knowledge distillation technique does not have control over supervision at the intermediate structured knowledge [45]. So due to the student-teacher representational capacity gap, the students fail to properly mimic the intrinsic distribution [43], [44] of the teacher. For that reason reducing only KL divergence is not enough. Then feature level knowledge distillation-based technique is proposed to supervise the students for mimicking intermediate feature level knowledge such as contrastive representations [21], attention [26], similarity [27], inter-channel diversity-preserved knowledge [28], and so on [21], [38]. Though these works perform better they still suffer from the problem of fidelity [25] and generalization problems. High fidelity means faithfulness to the prediction by the networks, this is related to generalization. If the network works better in training datasets, it is not guaranteed that it will perform better also in testing datasets. If the network is generalized then the network works better in testing datasets.
Fidelity denotes the degree of generalization [25]. The overfitted models try to memorize the teacher representations in the training dataset, which does not ensure generalization over the test data. Then to improve the generalization and reduce overfitting Xu et al., [38], and Vu et al., [48] proposed techniques to improve the student's generalization using augmented datasets. Xu et al., [38] and Yang et al., [42] proposed methods to improve teachers by training auxiliary classifiers. We consider transferring highly activated neuron responses via distilling sparse activation maps along with the contrastive knowledge of the augmented and original data.

B. ACTIVATED NEURON's RESPONSES DISTILLATION
To transfer the neuron's responses, Romero et al., [30] proposed a knowledge distillation technique that reduces the MSE loss between the neuron's responses of the teacher and student. Though this technique effectively transfers the neuron's responses of the high-level layers, it struggles to transfer low-level features. Later, Yim et al., [46] transfer the neuron's responses by reducing the spatial dimension and the inter-channel correlations are transferred based on the Gram matrix [47]. Zagoruyko et al., [26] reduces channel dimension and focuses on the attention to transfer the neuron's responses. This technique of reducing dimension happens a loss of information. Heo et al., [33] proposed an activation boundary-based approach without reducing the dimension. Though we also transfer the neuron's responses, our work is dissimilar to the response-distillation-based techniques in two perspectives: 1) Instead of transferring all the activated neuron's responses we transfer the highly activated neuron's response without reducing dimension to facilitate the transferring of both low and high-level information by shifting the activation boundary, and 2) As the lower activation is dropped we improve generalization using augmented dataset and auxiliary classifiers.

III. PROPOSED NeuRes VIA DISTILLING SAMs
The overall architecture and algorithm of NeuRes are depicted in Fig. 4, and Algorithm. 1, respectively. Proposed NeuRes is composed of four fundamental concerns: A) Feature-level non-linearity, B) Highly activated neuron responses transfer via distilling Sparse Activation Maps (SAMs), C) Translation invariant feature learning via auxiliary classifiers by feeding the augmented data, and D) Deploying NeuRes non-linearity constraints during training the student.  Overview of the proposed NeuRes technique. The input data is augmented using four transformations in terms of color, rotation, cropping, and dropping similar to [38]. During training the student, both the teacher and student receive both augmented X and original X data as the input. Auxiliary classifiers receive the features of the augmented data and main classifiers receive the features of the original data as the input. The purple and orange colors at the top denote the main and auxiliary classifiers, respectively. The classifiers bordered by blue and orange colors indicate the teacher and student classifiers, respectively. F and A denote the intermediate feature maps before and after the ReLU activation function, respectively. The details are discussed in Section III and Algorithm 1. This implies that A i contains both lower and higher activated neuron responses. We restrict the transfer of the activated nodes of lower magnitudes. The candidate selection process is done by shifting the separating hyperplane of the activation function φ through transforming the activation map A i ∈ R B×C×H ×W to the Sparse Activation Map S i ∈ R B×C×H ×W through a filter function ϕ as: The filter function S i includes highly activated neurons' responses by shifting the separating hyperplane of the activation function φ. In Fig. 1, we show the output and overview of the filter function where the filter function filters out the responses of the low-activated neurons via shifting and scaling the separating hyperplane of the ReLU activation function.

2) FILTER FUNCTION (ϕ)
Firstly, we normalize the activation map A into the range of [−ξ, ξ ] to obtain the activation map A with shifted separating hyperplane. This step shifts the separating hyperplane to re-define the activeness of the neurons where the activated neurons with lower magnitude fall into the negative regions i.e., treated as inactive neurons.
The normalized activation map A is then fed into the activation function φ to obtain the Sparse Activated Maps (SAMs), S. So the filter function can be defined as: where ξ denotes the temperature to scale the magnitude of the separating hyperplane. We evaluate the performance of NeuRes for different values of ξ and experience the best performance at ξ = 20.

B. HIGHLY ACTIVATED NEURON's RESPONSES TRANSFER VIA DISTILLING SAMs
The resulting Sparse Activation Maps (SAMs), S contains highly activated neuron responses. This sparse knowledge is then distilled to transfer the knowledge of the excited neuron's responses. Let's assume N S and N T as the student and teacher networks, and the corresponding resulting SAMs are S T , and S S , respectively. The Sparse Activation Map S T is distilled and the neuron responses are transferred via minimizing the L2 distance between S T and S S . During training VOLUME 10, 2022 the student the L2 loss is used as the regularization term along with the cross-entropy loss. The deployed contraints to perform feature-level non-linearity during transferring the knowledge for the i − th layer is: The total response loss is then: where L indicates the number of SAMs.

C. TRANSLATION INVARIANT FEATURE LEARNING
Till now we have discussed our proposed NeuRes on the single task (i.e., single classifier) using the original datasets.
In this section, we explain the process of achieving richer predictive logits-level knowledge using the augmented data along with the original dataset as the input. Following SSKD [38], we perform four transformations of the original data, 1) cropping, 2) rotation, 3) dropping, and 4) color transformation.
Let's consider X and X as the original and transformed datasets, respectively. Firstly, the backbone N T b (·) and the classifier N T c (·) of the teacher network is trained by imposing only the cross-entropy loss using the original data X such as N T c (N T b (X )). After training, we then freeze the weights of the teacher backbone N T b (·) and add an auxiliary classifier N T ax (·) at the top of the teacher network to learn from the augmented dataset X such that, N T ax (N T b (X )). This training technique improves the predictive logit-level knowledge of the teacher network. The auxiliary classifiers are trained by deploying the cosine similarity-based contrastive prediction loss between the main N T c (·) and auxiliary N T ax (·) classifiers.
where the cosine function is: So the self-supervised loss is defined by:

D. DEPLOYING NeuRes CONSTRAINTS AS THE REGULARIZATION TERM
The student network N S also includes the auxiliary classifier N S ax (·) along with the basic classifier N S c (·) similar to the teacher network. During transferring knowledge the student is forced to mimic the knowledge from both the auxiliary and basic classifiers of the teacher by the corresponding classifiers i.e., (N T c (X ) → N S c (X )) and (N T ax (X ) → N S ax (X )) by reducing the KL-divergence. The auxiliary classifiers of the student network also learn the self-supervised logits knowledge of the main classifiers of the student i.e., (N S c (X ) → N S ax (X )). The auxiliary logits distribution distillation loss is defined by: The self-supervised loss L sax is estimated by the KD loss between the probability of the instance similarity of the probability matrix achieved by C i,j (Eq. 6) of the two input data examples X i and X j . The self-supervision loss and the distillation loss as: The KD [23] loss to reduce the KL-divergence on the original dataset, N S c (N S b (X )), and N T c (N T b (X )), respectively, is as follows: where M denotes the number of classes. The overall distillation loss is given by: Here, β 1 , and β 2 are the balancing hyperparameters adopted from [38] and L ce indicates the cross-entropy loss.

A. DATASETS
In this paper, the experimental results are evaluated on three standard benchmark datasets, CIFAR100, STL10, and TinyImageNet. In the CIFAR100 dataset, there is a total of 60, 000 labelled images of dimension (32 × 32) where 50, 000 are training and 10, 000 are validation images distributed to the total of 100 different classes. The STL10 dataset consists of 5000 training and 800 test images each of resolution (96 × 96) that are distributed into 10 classes and the dataset is rendered from the ImageNet [50] dataset. TinyImageNet dataset with 200 classes and the small version of the ImageNet includes 100, 000 training, 10, 000 testing, and 10, 000 validation images. The dimension of the images in the TinyImageNet dataset is (64 × 64).

B. SETUP
We perform the experimental setup and parameter tuning similar to the setup in [21] and [38]. We run 240 epochs for every experiment except training the auxiliary classifiers of the teachers where we run for 60 epochs. We investigate the performances of our proposed method on six different architecture setups. Every individual reported result is the average of three individual runs. The initial learning rate for every architecture setup is set to 0.05. The evaluation is performed on an NVIDIA GeForce GTX 1080 GPU. During training, the batch size was 128. Following state-of-the-arts, we select SGD optimizer, and the weight decay rate is 5 × 10 −4 with a momentum of 0.9. We evaluate the performances for different values of ξ and obtain the best performance for ξ = 20. We evaluate the performance of the proposed NeuRes on four vision-based tasks, 1) Image classification on the CIFAR100 dataset to verify model compression ability, 2) Transferability on CIFAR100 → TinyImageNet and CIFAR100 → STL10 dataset pairs, 3) Robustness transfer on adversarial examples on the CIFAR100 dataset, and 4) Few-shot training scenarios. Table 1 shows the significance of the NeuRes on the classification task for model compression. We experience that our method achieves state-of-the-art performances for every similar student-teacher architecture setup. NeuRes consistently outperforms existing state-of-the-art methods. We also notice that the students improve their generalization compared to the baseline approaches. This is due to the improved predictive logits-level knowledge at the auxiliary classifiers of the teacher network. These accuracy improvements indicate that NeuRes successfully improves student performances by transferring neuron responses via distilling sparse activation maps (SAMs). Fig. 7 shows the comparison of our proposed NeuRes regarding F1-score and confusion matrix compared to logits-level and feature-level state-of-the-art methods, KD [23], and CRD [21], respectively. The figure shows that NeuRes obtains the best F1 score compared to the others. We observe from the confusion matrix that the proposed NeuRes reduces the misclassifications compared to the others.

C. MODEL COMPRESSION
Metric: Let's consider, TP, FP, TN, and FN denote the true positive, false positive, true negative, and false negative predictions, respectively.
Here, Pn and Re indicate the precision and recall, respectively. And,

D. ROBUSTNESS TRANSFERABILITY AGAINST ADVERSARIAL ATTACKS
Proposed NeuRes trains the student using sparse activation maps on both original and augmented data that also transfers the robustness to the student. The sparsity and the translation invariant feature learning also improve the robustness of the student. Table 3 shows the significance of NeuRes to improve the robustness of the student against adversarial attacks. Firstly the student is trained using NeuRes approach on the CIFAR100 dataset Then the trained student is evaluated on the perturbed datasets. An adversarial attack is a technique that attempts to fool a trained network with the perturbed image (adding noise to the images). The adversarial attack is performed on the CIFAR100 dataset with the magnitude of = 8/255 using the Fast Gradient Signed Method (FGSM) [49] method. The WideResNet-40-2 and WideResNet-16-2 are selected as the teacher-student architecture setup. Table 3 indicates that the proposed NeuRes forces the student to perform better than the state-of-the-art in robustness transferability.

E. TRANSFERABILITY OF THE STUDENT ON THE CLEAN DATASETS
As the proposed NeuRes transfers the sparse and highly activated neuron responses, the student achieves better VOLUME 10, 2022  [53]. Bold results represent the best performances. The competing results are quoted from [21]. '*' shows the results of the project provided by [21] that is re-run into our environment.

TABLE 2.
Transferability of students on clean dataset We use WideResNet-40-2 and WideResNet-16-2 as the teacher and the student architectures, respectively. The competing results are quoted from [21]. Bold values indicate the best performances.  representational capacity. To evaluate the performance of the student who is trained by NeuRes, we perform the experiments of the transferability tasks from CIFAR100 to the STL10 and the TinyImageNet datasets. WideResNet-40-2 and WideResNet-16-2 are considered as the teacher and the student, respectively. Firstly, the teacher and the student both are trained using the CIFAR100 dataset. Then the representations of the trained student are transferred to the STL10 and TinyImageNet datasets. While transferring the representations, the feature extraction part of the student is frozen. Only the main linear classifier N S c (X ) is fine-tuned for learning on TinyImageNet and STL10 datasets. The top-1 and top-5 accuracy comparisons are shown in Table 2 which depict the effectiveness of the proposed method NeuRes in terms of transferability compared to the state-of-the-art methods.
From Table 2 we experience that, NeuRes outperforms existing state-of-the-art works in transferability tasks. The results in Table 2 indicate that sparsity and highly activated neuron responses along with translation invariant feature learning transfer the robustness to the student.

F. ATTENTION COMPARISON: CLASS ACTIVATION MAP
Our proposed method NeuRes is highly related to attention. To evaluate the significance of our proposed method in terms of focusing on the interesting regions i.e., optimum attention, we investigate the class activation map of the trained student. From Fig. 8 we observe that the proposed method successfully pays attention to the expected regions to provide better results compared to the KD [23] knowledge distillation method. Table 4 and Fig. 5 show that our proposed NeuRes outperforms the state-of-the-art methods in the few-shot training scenario. The training is performed on three different fewshot scenarios, we feed 25%, 50%, and 75% training data and the original test set of the CIFAR100 dataset to evaluate the significance of the NeuRes in few-shot training. More specifically, if the dataset contains a total of N number of   images, then 25%, 50%, and 75% of training data indicate that 0.25 × N , 0.5 × N and 0.75 × N number of images, respectively, are used during training. We perform the same dataset split mechanism similar to SSKD [38]. The ResNet8x4 and ResNet32x4 are selected as the student and teacher networks, respectively. From Table 4 we see that the proposed NeuRes outperforms existing state-of-the-art methods in few-shot training in all the scenarios.

G. SIGNIFICANCE IN FEW-SHOT TRAINING SCENARIO
We notice from Table 4 that our method NeuRes can achieve comparable performance using only 25% of the training dataset The reported results in Table 4 verify the effectiveness of the NeuRes in few-shot training.

H. EFFECTS ON DECISION BOUNDARY
We validate the quality of global features and the ability to match the teacher decision boundary of the proposed method through t-SNE visualization. The t-distributed stochastic neighbor embedding (t-SNE) is a method to visualize high-dimensional data in a two or three-dimensional space by performing dimensional reduction using a statistical approach. Fig. 6 shows the visualization of the teacher, KD [23] and NeuRes. ResNet32x4 and ResNet8x4 are selected as the teacher, and the student, respectively. The visualization is obtained from the pretrained corresponding student network. We evaluate the performance in improving decision boundaries on the CIFAR100 test dataset. From Fig. 6 we notice that our proposed method improves the decision boundary with compare to the original KD [23] method. We see that the decision boundary of 'Ours' method is more similar to the teacher compared to the baseline KD [23]. This outcome further proves the capacity of the proposed technique and the NeuRes improves the decision boundary by deploying activation boundary shifting and scaling.

V. CONCLUSION
We propose a novel knowledge distillation technique, called NeuRes to transfer the highly activated neuron's responses along with translation invariant features via distilling the Sparse Activation Maps (SAMs). The proposed technique, with the help of auxiliary classifiers and augmented data improves the representational capability as well as predictive logits knowledge of the teacher network to perform better knowledge distillation and transfer. NeuRes outperforms existing state-of-the-art results in transferring knowledge to the student. The proposed NeuRes loss L NeuRes can be deployed as the assistant performance enhancer to any knowledge distillation techniques for further improving student performances. We validate the effectiveness of the proposed technique through extensive experiments on model compression, transferability, robustness transfer, and few-shot training scenarios. Our proposed method requires an identical spatial dimension of the teacher and student feature maps. In our future work, we will investigate transferring knowledge to students of different architecture.