Knowledge Transferred Fine-Tuning: Convolutional Neural Network Is Born Again With Anti-Aliasing Even in Data-Limited Situations

Anti-aliased convolutional neural networks (CNNs) are models that introduce blur filters to intermediate representations of CNNs to achieve high accuracy in image recognition tasks. A promising way to prepare a new anti-aliased CNN is to introduce blur filters to the intermediate representations of pre-trained (non anti-aliased) CNNs, since many researchers have released them online. Although this scheme can build the new anti-aliased CNN easily, the blur filters drastically degrade the pre-trained representations. Therefore, to take full advantage of the benefits of introducing blur filters, fine-tuning using massive amounts of training data is often required. This can be problematic because the training data is often limited. In such a “data-limited” situation, the fine-tuning does not bring about a high performance because it induces overfitting to the limited training data. To tackle this problem, we propose “knowledge transferred fine-tuning.” Knowledge transfer is a technique that utilizes the representations of a pre-trained model to help ensure generalization in data-limited situations. Inspired by this concept, we transfer knowledge from intermediate representations in a pre-trained CNN to an anti-aliased CNN while fine-tuning. The key idea of our method is to transfer only the essential knowledge for image recognition in the pre-trained CNN using two types of loss: pixel-level loss and global-level loss. The former loss transfers the detailed knowledge from the pre-trained CNN, but this knowledge may contain “aliased” non-essential knowledge. The latter loss, on the other hand, is designed to increase when the pixel-level loss transfers non-essential knowledge while ignoring the essential knowledge, i.e., it penalizes the pixel-level loss. Experimental results demonstrate that the proposed method using just 25 training images per class on ImageNet 2012 can achieve higher accuracy than a conventional pre-trained CNN.


I. INTRODUCTION
Introducing blur filters to image recognition models often plays a crucial role in generalizing because they can relax The associate editor coordinating the review of this manuscript and approving it for publication was Saqib Saeed . the difference in the image objects' scale or position [1], [2]. For example, Neocognitron [1] achieves robustness against image position shifts by introducing blur filters to an ancestor model, Cognitron [3]. Recent studies have demonstrated that modern convolutional neural networks (CNNs) [4]- [6] can also improve their accuracies by introducing blur filters to intermediate representations. 1 For example, Zhang [7] proposed an ''anti-aliased CNN'' that introduces blur filters to conventional down-sampling operations such as pooling and stride convolution. These down-sampling operations usually ignore the sampling theorem [8], and thus high-frequency signals alias into low-frequencies during sampling. In the anti-aliased CNN, the (low-pass) blur filter suppresses such aliasing effects caused by the down-sampling. As a result, the anti-aliased CNNs can achieve higher accuracy in image recognition tasks than conventional CNNs without blur filters. Building on this success, various studies [9]- [12] have extended the anti-aliased CNN and demonstrated that blur filters work well for many image recognition tasks. To sum up, introducing blur filters is a promising way to improve the accuracy of CNNs.
While the studies above implicitly assumed the training of anti-aliased CNNs from scratch, many researchers, universities, and companies have released pre-trained (non anti-aliased) CNNs online. Therefore, it might be possible to prepare a new anti-aliased CNN model by introducing blur filters to a pre-trained one without training from scratch. Although this approach is promising, the blur filters drastically degrade the pre-trained representations. Thus, to take full advantage of the benefits of introducing blur filters, finetuning [13]- [15] is required to build new anti-aliased representations from degraded pre-trained representations with massive training data. This can be difficult because the training data in image recognition is often limited. In such a ''data-limited'' situation, the fine-tuning cannot bring about a high performance because it induces overfitting to the limited training data.
In this paper, we propose a novel fine-tuning method for anti-aliased CNNs in the data-limited situation, 2 called ''knowledge transferred fine-tuning,'' which combines fine-tuning with knowledge transferring by means of teacherstudent collaboration. Knowledge transfer [17] shares the representations and constrains the decision boundary of a student model from that of a teacher model. As a result, it avoids overfitting to the limited training data [17], [18] and effectively builds the representations in the student model [19], [20]. These advantages of the knowledge transfer are a good fit for dealing with the fine-tuning problems of the anti-aliased CNN. Therefore, with the concept of knowledge transfer as a basis, our method fine-tunes an anti-aliased CNN while transferring knowledge from a pre-trained (non antialiased) CNN, which has not been overfitted to the limited training data, as a teacher model. 1 ''Intermediate representations'' here refers to the intermediate layer outputs of a CNN. 2 A preliminary version of this paper was presented at the IEEE International Conference on Image Processing (ICIP 2021) [16]. In the current paper, we have included new experimental settings (see Subsection IV-B) and analyzed the effects of our method from viewpoints other than accuracy (Subsection IV-A5). Further, we have compared our method with other well-known knowledge transfer methods and found they are less effective in our problem setting (Subsection IV-A4). These new experiments, analyses, and comparisons may be suggestive for future research directions.
Our method transfers the knowledge from the intermediate representations in the pre-trained CNN for affecting the degraded pre-trained representations directly. Our aim is to transfer only essential knowledge from the pre-trained CNN, as these representations may contain non-essential ''aliased'' knowledge. To achieve this aim, our method transfers the knowledge using two types of loss: pixel-level loss and global-level loss. The pixel-level loss transfers knowledge by directly calculating the distance between the representations in the teacher and student. This loss is almost the same as the one used in an existing method called FitNets [20], which transfers the detailed knowledge from the teacher model. However, since the detailed knowledge may contain aliased knowledge, the pixel-level loss (i.e., FitNets) risks transferring non-essential knowledge rather than essential knowledge. To mitigate this risk, we introduce a novel loss called ''global-level loss.'' In the global-level loss, two functions, f t and f s , extract coarse knowledge that contributes to recognition from the intermediate representations of the teacher and student, respectively. When the pixel-level loss transfers nonessential knowledge while ignoring essential knowledge, the global-level loss increases because f s has difficulty extracting the knowledge that contributes to recognition from the intermediate representations of the student. In other words, this loss can penalize the pixel-level loss. As a result, the complementary behavior of these losses makes it possible to transfer the essential knowledge of the teacher model maximally. Our method prepares an anti-aliased CNN anew by introducing blur filters to a pre-trained CNN and then fine-tunes it with the recognition loss (the standard loss used in image recognition), the pixel-level loss, and the globallevel loss. We evaluated the proposed method on two natural image recognition datasets, ImageNet 2012 [21] and Caltech-256 [22], and found that it is more accurate than the simple fine-tuning using only the recognition loss.
The main contributions of this paper are as follows. 1) We present a knowledge transferred fine-tuning. This method transfers the knowledge from a pre-trained CNN to its anti-aliased version by using the pixel-level and global-level losses while performing fine-tuning. 2) We show that ResNet-18 and 34 trained by our method achieve higher accuracy than the pre-trained CNN in a 25 images/class situation evaluated with Ima-geNet 2012 [21] dataset. The simple fine-tuning could not achieve this even in a 500 images/class situation. 3) We demonstrate that, interestingly, the global-level loss works particularly well for achieving high accuracy when the datasets in the pre-training and the fine-tuning are different, i.e., in a transfer learning scenario. On the basis of this finding, we introduce a new hypothesis for aliased signals in the pre-trained representations.

II. RELATED WORK
We first introduce anti-aliased CNN and its variants in Subsection II-A. Then, we review existing methods of knowledge transfer in Subsection II-B.

VOLUME 10, 2022
A. CONVOLUTIONAL NEURAL NETWORK AND ANTI-ALIASING Local connectivity and weight sharing are central ideas of the neural networks used for image recognition. These ideas are usually embodied as convolutional neural networks (CNNs) [1], [23]. In recent studies, CNNs have been scaled up and shown impressive performance [4]- [6]. Most modern CNNs are built using the same principles: alternating convolution and down-sampling operations. Here, down-sampling operations, e.g., pooling and stride convolution, progressively reduce the resolution of intermediate representations and give CNNs efficient computation and a larger receptive field [24]. However, interestingly, they usually ignore the classic sampling theorem [8]. This theorem states that the sampling rate must be at least twice the highest frequency of the signal. If it is not, high-frequency signals alias into low-frequencies during sampling. Therefore, when down-sampling a signal, the textbook solution is to anti-alias by low-pass filtering the signal. Even though this is a wellknown fact, down-sampling operations in modern CNNs typically are without anti-aliasing [4]- [6].
Zhang [7] pointed out this issue and proposed an ''antialiased'' CNN that introduces Gaussian blur filters to conventional down-sampling operations. In the anti-aliased CNN, the (low-pass) blur filters suppress the aliasing effects caused by the down-sampling. As a result, it can obtain better intermediate representations than conventional CNNs without blur filters. One major benefit of the anti-aliased CNNs is their higher accuracy, but they are also known to obtain shift-invariant representations [7], [25], which makes them harder to affect by small spatial image shift or translation.
Various studies [9]- [12] have built on the initial success of the anti-aliased CNN through a number of extensions. Li it et al. [9] proposed wavelet integrated CNNs (WaveC-Nets) in which down-sampling operations are replaced with a discrete wavelet transform decomposition. WaveCNets have a stronger theoretical justification compared to anti-aliased CNNs because they can directly determine high-frequency signals that are removed by wavelet transform decomposition to fit the sampling theorem. Sinha it et al. [10] determined the strength of blur filters in a curriculum manner [26] and demonstrated that they work well for many image recognition tasks. Zou it et al. [11] proposed content-aware anti-aliasing, in which a low-pass filter is adaptively applied for different spatial locations instead of a single Gaussian blur filter. Moving beyond the usual image recognition, Xie it et al. [12] showed that denoising of intermediate representations is effective in the context of adversarial training [27], [28], which aims to improve robustness to adversarial examples [27], [29]. Here, note that denoising has almost the same anti-aliasing effect as blur filters. The above studies have shown that introducing blur filters is a promising way to improve CNN models. However, they have not considered training in a data-limited situation, which is the focus of this paper.

B. KNOWLEDGE TRANSFER
Knowledge transfer was originally devised as a technique for training a lightweight model, i.e., a shallow and/or narrow neural network, by means of teacher-student collaboration [30], [31]. Specifically, it trains a lightweight student model by transferring knowledge from a heavyweight teacher model. In a pioneering work, Hinton it et al. [17] proposed a knowledge transfer method in which the output of a student mimics that of a teacher for the same input image. It shares representations and constrains the decision boundary of the student model from that of the teacher model and thereby brings certain advantages to the student one. Both Hinton it et al. [17] and Kimura it et al. [18] have demonstrated knowledge transfer methods that avoid overfitting to the limited training data. In another study, Furlanello it et al. [19] used the same CNN architecture for both the teacher and student models and iteratively trained the CNN from scratch using the knowledge transfer. Interestingly, their trained CNN achieved a higher accuracy than the originally trained one, thus demonstrating that the knowledge transfer can effectively build the representations in the student model. On the basis of these studies, we combine fine-tuning for anti-aliased CNNs in the data-limited situation with the knowledge transfer.
As described in Section I, our method transfers the knowledge from the intermediate representations in the pre-trained CNN. Such knowledge transfer is a somewhat different methodology from the method proposed by Hinton it et al. [17], though its benefits are almost the same. In a well-known pioneering work, Romero it et al. [20] proposed FitNets, which transfers the knowledge from intermediate representations for effectively building a student model that has a very deep architecture. It transfers the knowledge by calculating the distance between the intermediate representations of the teacher and student and bringing them closer. This knowledge transfer method is almost the same as one of our proposed losses (the pixel-level loss). After FitNets, several variants of the knowledge transfer method from intermediate representations have been proposed [32]- [35]. 3 For example, Zagoruyko and Komodakis [32] proposed transferring the knowledge in the form of spatial attention, where spatial attention maps are computed by summing the intermediate representations along the channel dimension. Huang and Wang [33] transformed the intermediate representations to the form of the Gram matrix [36] and then transferred the knowledge. Heo it et al. [34] transferred the knowledge by using sign values of the intermediate representations rather than their exact values. Interestingly, these methods have shown that transforming the intermediate representations and mitigating the constraining effect improves the performance of the knowledge transfer. However, they only performed evaluations with massive training data and did not consider the data-limited situation, which is the main target of our work. We argue that these previous methods are insufficient for our presumed use case because, in data-limited situations, the student model needs more knowledge from the teacher for helping with generalization compared to the situation with massive training data.
In contrast to these existing knowledge transfer methods, our proposed method transfers knowledge by means of the pixel-level and global-level losses. The complementary behavior of these losses transfers the essential knowledge of the teacher model maximally. Figure 1 shows an overview of the knowledge transferred fine-tuning. We prepare an anti-aliased CNN anew by introducing blur filters to a pre-trained CNN. Therefore, the pre-trained and anti-aliased CNNs have almost the same architecture, but the anti-aliased CNN has blur filters. Our method aims to transfer only the essential knowledge from the ''aliased'' intermediate representations in the pre-trained CNN (teacher model) to the anti-aliased CNN (student model) using a ''transfer unit.'' In this paper, we define ''essential'' to mean without aliasing signals, and ''non-essential'' to mean with many aliasing signals. In our method, as in the usual knowledge transfer methods [17], [20], [32]- [34], the teacher and student models receive the same image as input, and then we transfer the knowledge with the transfer unit. The configuration of the transfer unit is shown in Figure 2.

III. PROPOSED METHOD
For transferring only the essential knowledge, the transfer unit calculates two types of loss: pixel-level loss and globallevel loss. The pixel-level loss tries to transfer the detailed knowledge, which leads the student model to inherit the recognition ability of the teacher model [20]. However, this risks transferring non-essential knowledge from the teacher model, since the detailed knowledge may contain aliased knowledge. To mitigate this risk, the global-level loss penalizes the pixel-level loss if it transfers non-essential knowledge while ignoring essential knowledge.
These two losses are formulated as follows. First, we assume t, s ∈ R C×H ×W are the tensors. Since the teacher and student models use the convolution operation for calculating the intermediate representations, t and s can be regarded as the intermediate representations in the teacher model and student model, respectively. C, H , and W represent the number, height, and width of the feature maps. The pixel-level loss calculates the distance between the representations as where d is a function that calculates the distance between two representations (e.g., mean absolute error (MAE)). The global-level loss transfers the coarse knowledge that contributes to image recognition. If the pixel-level loss transfers non-essential knowledge while ignoring essential knowledge, the global-level loss increases. We define the global-level loss as In this formulation, we use the same function d as Equation (1) for calculating the distance, but other functions could be used as well. f t and f s are trainable functions that extract and output coarse information contributing to recognition. In our implementation, we construct them as small CNNs with 1×1 filter size convolution, global average pooling [37], and fully connected layers, as shown in Figure 2. We set the number of input and output channels of the 1 × 1 filter size convolution to be the same. The global average pooling summarizes all pixel-level information of the feature maps by averaging, and thus, the representation after this pooling is a tensor consisting of 1 × 1 resolution feature maps. Therefore, the global-level loss can transfer coarse (independent of the pixel-level information) knowledge extracted from the summarized representation. If any guidance is not for f t , it cannot extract recognition information and the global-level loss may be trapped in a trivial solution, e.g., f t (t) = f s (s) = 0. To avoid this, we guide f t by ''auxiliary'' recognition loss L aux_recog , as shown in Figure 2. L aux_recog is a recognition loss calculated with the output of f t and an image label. Therefore, the dimension of the f t and f s outputs becomes the same as the number of classes. The auxiliary recognition loss forces f t to maximally extract the coarse recognition information from the representation t. If the pixel-level loss transfers non-essential knowledge while ignoring essential knowledge, f s has difficulty extracting the recognition information because the recognition information in s shrinks compared to that in t. Therefore, L global can penalize the pixel-level loss for transferring too much non-essential knowledge. In the experiments, we show that the complementary behavior of these two losses works well for fine-tuning the anti-aliased CNN. Finally, we fine-tune the anti-aliased CNN with both the pixel-level and global-level losses and the recognition loss L recog , which is calculated with the output of the student and an image label. Softmax cross-entropy [38] is usually utilized as L recog (and L aux_recog ). For example, the recognition loss L recog calculated with softmax cross-entropy is defined as Here, o = student(x), and In Equation (4), student(x) and l indicate the output of the student model and the one-hot vector corresponding to the label of input image x, respectively. As shown in Figure 1, the knowledge transferred fine-tuning uses several transfer units, so we average all losses calculated in the transfer units, as  where N is the number of transfer units and L n pixel and L n global indicate the losses calculated on the nth transfer unit. The final loss is defined as where λ is a hyper-parameter that determines the effect of the pixel-level and global-level losses. f t and f s are trained by L aux_recog and L global , respectively. Their optimizations are performed at the same time as the optimization of the student model using Equation (6). Note that the auxiliary recognition loss has nothing to do with the anti-aliased CNN, since this loss is calculated with the output of f t and an image label. As such, it is not included in Equation (6). The above method is only used in the training phase. During the inference phase, we can remove the teacher model and transfer units, and use only the student model as an antialiased CNN. Therefore, our method does not affect inference cost, e.g., inference time and memory usage.

IV. EXPERIMENTS
To evaluate the proposed knowledge transferred fine-tuning, we conduct experiments on image recognition tasks in two scenarios: one in which the pre-training and fine-tuning are done with the same dataset and the other in which they are done with different datasets. In both scenarios, we use the CNNs trained with ImageNet 2012 [21] as the pre-trained ones. Therefore, in the first scenario, we continue to use ImageNet 2012 for fine-tuning, and in the second scenario, we use Caltech-256 [22] for fine-tuning.

A. EXPERIMENTS ON ImageNet
In this subsection, we describe the experiment on the first scenario, i.e., experiments on ImageNet 2012. We evaluated the quantitative effects of our method and the simple finetuning, which only uses the recognition loss, in the datalimited situation. We further compared the proposed method with other fine-tuning methods including fine-tuning with frozen parameters, other knowledge transfer methods, and variations of the proposed method.

1) DATASET
We used the ImageNet 2012 [21] dataset for the natural image recognition task. ImageNet 2012 has a total of 1,000 classes  and consists of about 1.28 million training images and 50,000 validation images.
In the experiments, we randomly selected 10, 25, 50, 100, 200, 500, and 750 training images per class from the full training images for preparing data-limited situations. 4 Unfortunately, the test server for ImageNet 2012 could not be browsed at the time of writing, so we report recognition accuracies on the validation images, following previous studies [6], [7].

2) OPTIMIZATION DETAILS
We chose to use ResNet-18, ResNet-34 [6], and VGG-16 [5] as pre-trained CNN architectures. Since these models are simple, but have almost all the components used in modern CNNs, they are good testbeds. For the experiments, we used a pre-trained CNN on ImageNet 2012 as a teacher and the anti-aliased CNN that introduces 3 × 3 size blur filters to the pre-trained CNN [7] as a student. We placed four transfer units (N = 4) in the convolution layers just before each down-sampling layer. For example, we introduced them to the 5th, 9th, 13th, and 17th convolution layers in ResNet-18. In the experiments, we set the same optimization setting as in the pre-training, except for the learning rate and the training epoch. Therefore, the batch size and the momentum parameter were set to 256 and 0.9. The learning rate was 4 To ensure reproducibility, we used exactly the same images as those in Kayhan et al.'s work [39] for fine-tuning in the 50 images/class situation. The list of these images can be found online at https://github.com/oskyhn/CNNs-Without-Borders. set to 0.001 for ResNet and 0.0001 for VGG. These rates are 10 −2 times the initial learning rates. We set the learning rate of the transfer units (f t and f s ) to 0.01, since they train from initialized weights. The training epoch was set to 30. The hyper-parameter λ in Equation (6) was chosen from λ ∈ {50, 60, 70, 80, 90, 100}. These candidates of λ are based on previous studies [32], [33].
We used the MAE as the function d in Equations (1) and (2). We also used the softmax cross-entropy as the recognition loss L recog and the auxiliary recognition loss L aux_recog .

3) COMPARISON WITH SIMPLE FINE-TUNING
We first evaluated the quantitative effects of the proposed and simple fine-tuning methods in the data-limited situation. Note that the simple fine-tuning is distinct from ours in that λ = 0 in Equation (6). The hyper-parameter λ was tuned in ResNet-18 with the 50 images/class situation. The relationship between λ and the recognition accuracy is shown in Figure 3. As we can see in the figure, λ = 70 was the optimal. To check the feasibility of our method, we used this parameter across all experiments. Table 1 lists the recognition accuracies of the two methods. As we can see in the table, the proposed method consistently obtained a higher accuracy than the simple finetuning. Further, our method could obtain a higher accuracy than the pre-trained CNN in many cases (indicated in bold). Remarkably, ResNet-18 and 34 achieved higher accuracy than the pre-trained CNN even in the 25 images/class situation. The simple fine-tuning could not achieve this even in the 500 images/class situation. These results demonstrate that our method works well for fine-tuning the anti-aliased CNN even in the data-limited situation.
The effect of our method on VGG-16 seemed to be somewhat small. We found that this was partly due to the suboptimal hyper-parameter λ = 70 for VGG-16, as this λ was tuned in ResNet-18 with the 50 images/class situation. We searched for the optimal λ from among λ ∈ {100, 200, 300, 400, 500} again and found that it was 400 in the 25 images/class situation, where it could achieve 71.61 %, which is a higher accuracy than the pre-trained CNN. Therefore, even in VGG-16, the proposed method may have a similar effect as in ResNet-18 and 34. It is worth pointing out that even with a sub-optimal λ, the proposed method showed a gain compared to the simple fine-tuning. Determining the VOLUME 10, 2022 hyper-parameters to maximize accuracy will be the focus of future work.

4) COMPARATIVE ANALYSIS
In this subsection, we present a comparison of our method and existing methods. We examined fine-tuning with frozen parameters of certain shallow layers [14], [15], several knowledge transfer methods other than ours, and variations of the proposed method. The fine-tuning with frozen parameters contributes to avoiding overfitting in the datalimited situation. As the other knowledge transfer methods, we used transferring knowledge from the output distribution [17]- [19], spatial attention maps [32], Gram matrix of the intermediate representations [33], and sign values of the representations [34]. The formulation of these methods and the hyper-parameter settings are provided in the Appendix. One variation of the proposed method uses only pixel-level loss (λ = 60 was the optimal) and the other uses only globallevel loss (λ = 70 was the optimal). Table 2 lists the comparison results on ResNet-18 in the 50 images/class situation. The fine-tuning methods with frozen parameters (second to fifth rows) could not obtain higher accuracy than the simple fine-tuning (67.88 %). The blur filters also degrade the representations of shallow layers, so freezing the parameters makes it difficult to build new anti-aliased representations from the degraded representations. We can see in the table that by increasing the number of freezing parameters, the accuracy of the anti-aliased CNN decreased. Therefore, although freezing the parameters is a promising way to avoid overfitting in the training of conventional CNNs, it is difficult to derive the same benefit in the fine-tuning of anti-aliased CNNs.
Next, we evaluate the other knowledge transfer methods (sixth to ninth rows), some of which [32]- [34] are improved versions of the method proposed by Romero it et al. [20] (i.e., pixel-level loss in our method). Interestingly, all of them obtained higher accuracy than the simple fine-tuning but not the pre-trained CNN (69.76 %) or the fine-tuning with pixel-level loss (70.07 %). In other words, while the other knowledge transfer methods could not perfectly transfer the knowledge from the representations in the pre-trained CNN, the pixel-level loss was able to achieve this. This may be because these four methods mitigate the constraining effect of the knowledge transfer compared with the pixel-level loss. As discussed in Subsection II-B, such methods may be insufficient in data-limited situations. For example, the knowledge transferring from output distribution [17] [34] do not directly transfer the pixel-level knowledge, but rather transform it into spatial attention maps, a Gram matrix, or sign values. The results in Table 2 show that mitigating the constraining effect of the knowledge transfer by these transforming techniques is not effective in the data-limited situation.
We also compared the proposed method with its variations. As we can see in Table 2, all were more accurate than the pretrained CNN, unlike the other methods. This suggests that the proposed method can effectively build the new anti-aliased representations from the degraded ones while avoiding overfitting. We also found that using only pixel-level loss or only global-level loss achieved good results. This suggests that each loss transfers almost all the essential knowledge from the pre-trained CNN, even though they only focus on detailed or coarse knowledge. However, the best accuracy was observed when our method used both losses. This suggests that each loss by itself is not sufficient to transfer full essential knowledge, and the complementary behavior of the pixel-level loss and the global-level loss works well for achieving high image recognition accuracy.

5) MEASURING SHIFT-INVARIANCE BY CONSISTENCY
In previous studies, the anti-aliased CNNs achieved a high shift-invariant property [7], [25], i.e., they were less affected by small spatial shift or translation compared to conventional ones. A natural question about our method is whether an anti-aliased CNN fine-tuned with it can retain this shift-invariant property along with the recognition accuracy. We evaluate this point by calculating the consistency, which measures the shift-invariance of the CNN.
The consistency was used as proposed in Zhang [7]. Intuitively, consistency measures how often a CNN outputs the same recognition result, given two different shifts on the same image. It is formulated as follows: where x is the image extracted from the whole dataset D, and h 1 , w 1 (height/width) and h 2 , w 2 parameterize the shifts. o h,w is the output of the student model, i.e., student(x h,w ). In this formulation, we denote the input image that is shifted by h, w as x h,w . In the experiment, we selected the shift parameters h 1 , w 1 , h 2 , w 2 at random. 1 denotes the indicator function. Table 3 shows the consistency values of several CNN models based on ResNet-18. We selected h 1 , w 1 , h 2 , w 2 at random in this experiment, so we calculated the consistency five times and averaged them. The consistency values of the proposed and simple fine-tuning methods are from the 50 images/class situation. We used the anti-aliased ResNet-18 model provided by Zhang [7] that was trained from scratch using all training data in ImageNet 2012. It achieved a higher recognition accuracy (71.69 %) than the anti-aliased CNN fine-tuned with our method (70.22 %), but since it uses much more training data than what is available in the data-limited situation, it is not considered a direct competitor.
As we can see in Table 3, our proposed method obtained higher consistency values than the pre-trained CNN. This suggests that our method can retain the shift-invariant property of anti-aliasing. More interestingly, it also obtained a higher consistency than the anti-aliased ResNet-18. Since the simple fine-tuning and anti-aliased ResNet-18 have almost the same consistency values, this shift-invariant property of our method is probably not from fine-tuning. These findings indicate that our proposed method may have an important component for making a shift-invariant CNN model.

B. EXPERIMENTS ON CALTECH-256
In this subsection, we evaluate our method with the second scenario, where we fine-tuned the ImageNet 2012 pre-trained CNN with Caltech-256 [22]. This scenario is often referred to as ''transfer learning.''

1) DATASET
We used the Caltech-256 [22] dataset for the natural image recognition task. Caltech-256 consists of a total of 30,607 images split into 256 classes.
In the experiments, we randomly selected 5, 10, 15, 20, 30, and 45 training images per class from the full training images for preparing a data-limited situation. We used the remaining images not sampled for training as the test images.

2) OPTIMIZATION DETAILS
The optimization in this experiment is quite similar to that in Subsection IV-A, so here we explain only the optimization details that differ. We used ResNet-18 [6] as a pre-trained CNN architecture. The learning rate was set to 0.001 in all but the final fully connected layer (FC in Figure 1), where it was set to 0.01. This is because the numbers of classes between ImageNet 2012 and Caltech-256 are different, so we needed to initialize the fully connected layer to fit them. The training epoch was set to 60 to ensure sufficient converge. We used λ = 70 as the hyper-parameter of the proposed method across all experiments in this subsection. Note that this hyper-parameter was tuned with the 50 images/class situation on ImageNet 2012.

3) RESULTS
We evaluated the quantitative effects of the proposed and comparison methods in the data-limited situation. Table 4 lists the experimental results. The first row shows the accuracies of fine-tuned conventional ResNet-18 (i.e., without blur filters), and the remaining rows show the accuracies of the anti-aliased CNN. First, our method achieved the best accuracy in all experimented situations, which demonstrates that it is also effective in scenarios such as transfer learning. It is worth noting that this result of our method was obtained without hyper-parameter tuning in this experimental setting. The tuned hyper-parameter λ = 70 is the optimal value in the 50 images/class situation on ''ImageNet 2012,'' not Caltech-256. This implies that our method is robust for tuning hyper-parameters. Further, the hyper-parameter tuning with Caltech-256 may bring more accurate image recognition for our method. The simple fine-tuning did not achieve higher accuracy than the conventional (no blur) model in the 5, 10, and 15 images/class situations. This result is probably due to overfitting to the limited training data, and is consistent with the results in Subsection IV-A. In fact, by increasing the training data, the performance of the simple fine-tuning became close to that of our method and achieved higher accuracy than the conventional model.
We also evaluated the proposed method only using the pixel-level loss and found that it achieved the second-best accuracy following the one using both losses in the datalimited situation. This result is consistent with those in Table 2. However, we found that the gap between the two methods was larger than in Table 2, which implies that the complementary behavior of the pixel-level and global-level losses is more important in transfer learning. One reasonable hypothesis is that the aliased signals in the pre-trained CNN somewhat contribute to recognizing ImageNet 2012 but not Caltech-256. In the pre-training with ImageNet 2012, the CNN is trained only with the objective of maximizing recognition accuracy. Consequently, the CNNs tend to use any available intermediate representations to maximize accuracy, even those that are not interpretable to humans, such as aliased signals. Several previous studies have shown this tendency of CNN [40], [41]. Therefore, the pre-trained CNN transforms the aliased signals through the processing layer-by-layer for contributing to ImageNet 2012 recognition. However, the intermediate representations obtained by this transformation are useless for Caltech-256 since they are tuned only for ImageNet 2012 and not for general image recognition. As a result, the global-level loss, which penalizes the pixel-level loss for transferring too much nonessential knowledge, became more significant. Furthermore, by increasing the training data, the gap between our method with and without the global-level loss became small. This result is reasonable in that, by increasing training data, the anti-aliased CNN becomes able to build the new anti-aliased representations without the complementary behavior of the proposed losses.

4) EVALUATION OF ANTI-ALIASING EFFECT OF OUR METHOD
Finally, we qualitatively evaluate the anti-aliasing effect of our knowledge transferred fine-tuning. Figure 4 shows the input image and the corresponding intermediate representations of three CNNs, which are the models shown in the first (''ResNet-18''), third (''Only pixel-level loss''), and fourth (''Proposed method'') rows in Table 4. All CNNs were fine-tuned in the five images/class situation. We show some feature maps from the intermediate representation of the 9th convolution layer. Each row shows the feature map of the same channel.
We can see in Figure 4 that, in conventional ResNet-18, many aliased signals were in the intermediate representations. In contrast, the other two CNNs (Proposed method, Only pixel-level loss) clearly suppressed the aliasing effects, since they have the (low-pass) blur filters. This result is consistent with Zhang's [7], and such anti-aliasing effects lead the anti-aliased CNN to achieve high recognition accuracy. The difference between the proposed method and the method with only pixel-level loss is rather slight, but our method (i.e., using the pixel-level and global-level losses) shows more anti-aliasing effects than when using only pixellevel loss. For clarity, we show line profiles of the pixel value under the feature maps. These results demonstrate the effectiveness of our method in that the global-level loss penalizes the pixel-level loss if it transfers the non-essential ''aliased'' knowledge too much.
Overall, these results in Subsections IV-A and IV-B demonstrate that the knowledge transferred fine-tuning is simple yet effective for fine-tuning anti-aliased CNN in datalimited situations. This effect occurs because our method succeeds in building the new anti-aliased intermediate representations from the degraded ones. The simple fine-tuning, which only uses the recognition loss, induces overfitting to the limited training data and does not outperform our method. We therefore conclude that our method is essential for preparing the anti-aliased CNN in data-limited situations.

V. CONCLUSION
We have developed a knowledge transferred fine-tuning for building anti-aliased convolutional neural networks (CNNs) in data-limited situations. In such situations, the simple fine-tuning cannot achieve high performance because it induces overfitting to the limited training data. In our method, we prepare an anti-aliased CNN anew by introducing blur filters to a readily available pre-trained CNN. On the basis of the idea of knowledge transfer, our method transfers knowledge from the pre-trained CNN that has not been overfitted to the limited training data while fine-tuning the anti-aliased CNN. We aim to transfer only essential knowledge from the pre-trained CNN because the intermediate representations may include non-essential ''aliased'' knowledge. To achieve this aim, our method transfers the knowledge using two types of loss: pixel-level loss and global-level loss. The pixel-level loss transfers detailed knowledge, while the global-level loss transfers coarse recognition knowledge and penalizes the pixel-level loss if it transfers non-essential knowledge while ignoring essential knowledge.
Our evaluations on the ImageNet 2012 dataset showed that the knowledge transferred fine-tuning achieves higher accuracy than the simple fine-tuning and other comparison methods. The results on ResNet-18 and 34 indicate that our method can achieve higher accuracy than the pre-trained CNN even in the 25 images/class situation. The simple fine-tuning cannot achieve this even in the 500 images/class situation. This suggests that our method may greatly enhance the feasibility of anti-aliased CNNs because it can build a new anti-aliased CNN even in data-limited situations. Furthermore, in the fine-tuning with Caltech-256, the knowledge transferred fine-tuning achieves higher accuracy than the comparison methods, which demonstrates the effectiveness of the knowledge transferred fine-tuning in scenarios such as transfer learning.
In future work, we plan to improve the proposed method so that it can achieve higher accuracy without hyper-parameter tuning. While the hyper-parameter tuned with other models also showed good accuracy, it was found to be suboptimal. We plan to investigate a method that can determine hyper-parameters from the behavior of losses during training. There are various approaches to determining the hyperparameter [42]- [44] in multi-task learning [45]. However, these approaches aim to determine the balance of losses in multi-task learning, which makes it difficult to change the hyper-parameter significantly. As discussed in Subsection IV-A3, the optimal hyper-parameters are quite different between ResNet and VGG-16. Therefore, we consider that developing a new methodology is needed for determining hyper-parameter for the knowledge transferred fine-tuning. Finally, we also plan to investigate the generalization ability of our method to other application scenarios than those used in this paper.

APPENDIX FORMULATION OF OTHER KNOWLEDGE TRANSFER METHODS
In this appendix, we formulate the other knowledge transfer methods used in Subsection IV-A4 and describe the tuned hyper-parameters. The following four methods were used in our comparisons: 1) output distribution [17], [18], 2) spatial attention maps [32], 3) Gram matrix of the representations [33], and 4) sign values of the representations [34]. Hereafter, we formulate them in order.
First, in the knowledge transfer from the output distribution [17]- [19], the loss for fine-tuning is formulated as L out is the softmax cross-entropy loss with a temperature between the outputs of the teacher and student, and is defined as L out = − K k y k,T · log y k,T . Here T is a temperature parameter and is often used in this knowledge transfer method [17], [19]. It is known to avoid overfitting to limited training data [17], [18]. We tuned the hyper-parameters λ ∈ {1, 2, 3, 4, 5} and T ∈ {1, 2, 3, 4, 5} and found λ = 4 and T = 2 to be optimal. Second, in the knowledge transfer from the spatial attention maps [32], the loss for fine-tuning is formulated as Strictly speaking, we need to calculate the loss L at on each transfer unit. We omit this for the sake of clarity. L at transfers the knowledge from the spatial attention maps as L at = d(AT(t), AT(s)).
Here, AT(A) = F(A) A i (∈ R H ×W ) indicates the ith channel of the intermediate representation. Therefore, in Equation (15), F(A) VOLUME 10, 2022 means simply summing up the intermediate representation along the channel direction, and AT(A) normalizes it. As the distance function, the mean squared error (MSE) was used in Zagoruyko and Komodakis [32], but for a fair comparison, we used the same function as the proposed method, i.e., the MAE. We tuned the hyper-parameter λ ∈ {50, 60, 70, 80, 90, 100}, and found λ = 70 to be optimal. Third, in the knowledge transfer from the Gram matrix of the intermediate representation [33], the loss for fine-tuning is formulated as We omit the calculation on each transfer unit for the sake of clarity, the same as Equation (13). L gram transfers the knowledge from the Gram matrix as where the function Gram(·) transfers the intermediate representations to the Gram matrix. For example, Gram(t) is t T flat · t flat . Here, t flat ∈ R C×(H ·W ) is a 2-D tensor that is flattened from its 3-D version, t (∈ R C×H ×W ). This Gram matrix is known to represent the texture information of the input image [36]. As the distance function, the mean squared error (MSE) was used in Huang and Wang [33], but for a fair comparison, we used the same function as the proposed method, i.e., the MAE. We tuned the hyper-parameter λ ∈ {50, 60, 70, 80, 90, 100}, and found λ = 60 to be optimal. Finally, in the knowledge transfer from the sign values of the representations, the loss is defined as We omit the calculation on each transfer unit for the sake of clarity, the same as Equation (13). L abt transfers the knowledge from sign values of the representations as if t i,j,k > 0 and |s i,j,k | < µ, d(−µ, s i,j,k ) if t i,j,k ≤ 0 and |s i,j,k | < µ, 0 otherwise, where µ is the margin parameter and M = C × H × W . t i,j,k and s i,j,k are elements of representations existing in row j and column k on the ith channel. Equation (20) can constrain so that the sign values of the representations of the teacher and student models match each other by properly setting µ.
Heo it et al. [34] used the MSE as the distance function, but for a fair comparison, we used the same function as the proposed method, i.e., the MAE. In this method, we used the intermediate representations before the Rectified Linear Unit (ReLU) non-linearity [46], since this method needs a negative representation value in Equation (20). Please note that the representations used in our proposed method occur after ReLU. Finally, we tuned the hyper-parameter λ ∈ {50, 60, 70, 80, 90, 100}, and found λ = 80 to be optimal.