FS-BAN: Born-Again Networks for Domain Generalization Few-Shot Classification

Conventional Few-shot classification (FSC) aims to recognize samples from novel classes given limited labeled data. Recently, domain generalization FSC (DG-FSC) has been proposed with the goal to recognize novel class samples from unseen domains. DG-FSC poses considerable challenges to many models due to the domain shift between base classes (used in training) and novel classes (encountered in evaluation). In this work, we make two novel contributions to tackle DG-FSC. Our first contribution is to propose Born-Again Network (BAN) episodic training and comprehensively investigate its effectiveness for DG-FSC. As a specific form of knowledge distillation, BAN has been shown to achieve improved generalization in conventional supervised classification with a closed-set setup. This improved generalization motivates us to study BAN for DG-FSC, and we show that BAN is promising to address the domain shift encountered in DG-FSC. Building on the encouraging findings, our second (major) contribution is to propose Few-Shot BAN (FS-BAN), a novel BAN approach for DG-FSC. Our proposed FS-BAN includes novel multi-task learning objectives: Mutual Regularization, Mismatched Teacher, and Meta-Control Temperature, each of these is specifically designed to overcome central and unique challenges in DG-FSC, namely overfitting and domain discrepancy. We analyze different design choices of these techniques. We conduct comprehensive quantitative and qualitative analysis and evaluation over six datasets and three baseline models. The results suggest that our proposed FS-BAN consistently improves the generalization performance of baseline models and achieves state-of-the-art accuracy for DG-FSC. Project Page: https://yunqing-me.github.io/Born-Again-FS/.


I. INTRODUCTION
W HILE modern deep learning models achieve superior performance in many visual recognition tasks, e.g., image classification [8] and object detection [46], they require a large number of labeled data during training [50]. In contrast, in few-shot classification (FSC) [9], [49], [53], [10], the models are required to classify samples from novel categories given only a few labeled data from each category.

A. Domain Generalization FSC
Recently, meta-learning based FSC [53], [60], [10], [29] has achieved outstanding performance in the single domain setup, where the base classes for training and the novel classes for evaluation are from the same domain. However, in realworld applications, the deployed models are often required to Yunqing Zhao and Ngai-Man Cheung are with the Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore 487372 (email: yunqing zhao@mymail.sutd.edu.sg, ngaiman cheung@sutd.edu.sg). Correspondence to: Ngai-Man Cheung. Fig. 1: In this visualization, we select 5 novel classes with 200 query samples per class from an unseen domain (Places [73]). Each point indicates the feature representation of RelationNet (RN) [53] with backbone network (ResNet-10 [20]), projected by LDA [17], [40]. We use the linear regression prediction accuracy ("LR-Acc") to demonstrate the improved decision boundaries: baseline RN (left), BAN episodic training applied to RN (mid), and our further proposed FS-BAN applied to RN (right). See numerical results and comparisons in Sec. VI. classify objects from domains that are unseen during training, given limited labeled data (e.g., recognize rare bird species in a fine-grained setup [57]). In particular, our work addresses this challenging domain generalization (DG) FSC: to recognize samples from novel classes of unseen domains given only a few labeled data of each class. We follow recent DG-FSC works [57], [51] and assume to have several seen domains during training; however, we do not have access to samples from the unseen domains which will be encountered during evaluation. DG-FSC has attracted a fair amount of attention recently [6], [57], [51], [39]. Due to the significant discrepancy between the seen domains used in training and the unseen domains encountered in evaluation, existing FSC models designed only for the single domain setup often perform poorly [6]. Therefore, DG-FSC still has much room for improvement in generalization under the domain shift setup.

B. Born-Again Networks (BANs)
In their pioneer work, Breiman and Shang [3] proposed born-again trees. Given a complex predictor, e.g., a model with multiple trees ensemble, they train a single tree which outputs (decisions) match that of the complex predictor. This single born-again tree is simple and more interpretable compared to the complex predictor while it still maintains a decent decision performance [59]. More recently, [12] investigated the knowledge transfer [22] from one model (the teacher) to another model (the student). Focusing on the conventional image classification tasks, they first train the teacher network to convergence using the standard cross-entropy loss; then, they train the student network with the dual goals of prediction of correct label and matching of teacher's probability prediction.

arXiv:2208.10930v4 [cs.CV] 8 May 2023
Surprisingly, despite that the teacher and the student models have identical network structure, and the same training data is used for teacher training and knowledge transfer process, they reported that with this BAN approach, the student outperforms the teacher network accuracy consistently in various conventional image classification setups, e.g., DenseNets [24] on CIFAR-10 and CIFAR-100 [28]. The student models were found to have better generalization. This is attributed to the distillation of dark knowledge, i.e., teacher's prediction on the wrong outputs, and importance weighting, i.e., teacher's confidence on the correct outputs. Recently, Zhu and Li [2] presented a rigorous analysis on this improved generalization. From the perspective of multi-view data structure, they argue that the BAN approach can be viewed as a combination of implicit ensemble and knowledge distillation [22], enabling the student model to learn multi-view features and eventually achieve better generalization compared to the teacher models which have identical structures. Besides the conventional image classification, BAN has been applied in other areas, e.g., multi-task natural language processing [7].

C. Motivation and Our Contributions
This work is motivated by the empirical results and theoretical analysis presented by [12] and [2]. Both works suggested BANs can achieve improved generalization without modification to the network structure, which could be extremely useful for existing FSC models, especially, under domain shift. In particular, our first contribution is to propose BAN episodic training for DG-FSC. Note that previous work has focused on applying BAN in conventional supervised training [12], [2], [55], and our work on applying BAN in episodic training is novel. In Sec.IV, we discuss the subtleties in BAN episodic training, perform a rigorous study to show that BAN can lead to models with improved generalization on novel tasks sampled from an unseen domain. Furthermore, we also validate that BAN enables learning of more compact features with a lower intra-class to inter-class variance ratio which is useful for few-shot learning as discussed in [17] (sSee Linear Discriminant Analysis (LDA) [40] of features in Figure 1).
Based on the encouraging results investigated in Sec. IV, our second contribution is to propose Few-Shot BAN (FS-BAN) that addresses the unique issues in DG-FSC. Specifically, different from conventional image classification, DG-FSC poses unique challenges that inhibit the improvement of BAN episodic training: (i) Because of limited labeled data in FSC, the teacher model in BAN training may suffer from overfitting, and this degrades the knowledge transferring to the student model; (ii) In DG-FSC, the student model needs to handle unseen domains during the evaluation stage.
To address the above challenges in BAN for DG-FSC, we propose FS-BAN (Sec. V) that builds upon the baseline BAN method (Sec. IV). FS-BAN consists of novel multitask learning objectives: (i) Mutual Regularization (MR): We extend BAN with additional feedback from the student to the teacher, encouraging the teacher to continue to improve using soft predictions from the student. The student's soft prediction provides additional regularization to alleviate overfitting in the teacher model. This technique achieves significant improvements in all experiments. (ii) Mismatched teachers (MM): To address domain shift, we propose a technique of mismatched teacher: a teacher model which is trained on a domain different from that of the current training task. Our proposed mismatch teacher is an imitation procedure so that an FSC model has exposure to domain shift during the training stage. We show in experiments that this imitation in training leads to a better generalization of unseen domains and achieves better domain robustness. (iii) Meta-control temperature (MCT): Temperature is an important parameter to control the distillation of knowledge in BAN training [22]. It is usually regarded as a hyperparameter and manually pre-set to a fixed value for the entire training (regardless of different domains and tasks). In contrast, we propose to meta-learn the temperature during training to improve adaptation to diverse domains.
The proposed FS-BAN can be readily applied to existing FSC models without modification of the structure. Experiment results show that FS-BAN achieves new state-of-the-art results for DG-FSC on six benchmark datasets, with three popular FSC baseline models. We further show in comprehensive ablation studies that the different learning objectives in FS-BAN indeed address these challenges proposed above.
Our contributions in this paper are summarized as: 1) As a pioneer work, we propose BAN episodic training as our first contribution (Sec.IV). We carefully study its effectiveness for DG-FSC and compare it to related work. We empirically validate its improved generalization. 2) As our second contribution, we propose FS-BAN for DG-FSC (Sec.V). FS-BAN consists of multi-task learning objectives that can better address the unique challenges posed by DG-FSC: few labeled support data in an episode and domain shift in the testing phase. FS-BAN overcomes the challenges, and such efforts have not been done before. 3) We conduct extensive experiments and show that FS-BAN consistently improves three baseline FSC models on six public datasets. Our approach outperforms the state-of-the-art in both the conventional FSC and DG-FSC setups. We also perform detailed ablation studies to demonstrate the effectiveness of FS-BAN.

II. RELATED WORKS
In this section, we perform a literature review from different perspectives, as our work involves FSC, domain generalization, and effective knowledge transfer. We highlight the different and challenging problem setups compared to closely related traditional FSC and domain generalization tasks.

A. Metric Learning for Few-Shot Classification
FSC [10], [49] models aim to recognize novel classes given few labeled data. Among them, metric learning based methods [49], [60], [53], [41] learn to compare the relation between the unlabeled query data and the labeled support data. The prediction result of each query image is a confidence (probability) distribution assigned to each category that belongs to a training task. Metric learning based ideas have attracted a image batch dataset f θ 0 ( ⋅ ) … y gen-0 gen-1 gen-k Born-Again Networks for conventional supervised learning Fig. 2: In conventional supervised learning, BAN samples a batch of images {(x, y) ∈ (X , Y)} of all categories in the dataset and distills the knowledge from the teacher model to the student in each generation. In related work, Tian et al. [55] conducted the born-again process in generations to obtain a powerful backbone network and transferred it to the downstream FSC task.
fair amount of attention on FSC tasks. Meanwhile, there is no need to further fine-tune the model parameters or select the hyperparameters in test time [55], [6].
In this paper, we set our experiments to focus on three popular metric-based FSC models as baseline methods, similar to a recent work [57]: MatchingNet [60], RelationNet [53] and Graph Neural Network (GNN) [13] due to their simplicity and easy implementation. However, these models often fail to make predictions on novel tasks from unseen domains, due to the domain shift [6], [67], [56] and overfitting on the base classes data from the source domains seen in training. Therefore, our proposed FS-BAN builds up on their models and aims to obtain further improvement and generalization.

B. Domain Generalization FSC
Traditional domain adaptation (DA) problem often enables the model to learn with sufficient unlabeled data from the target domain [36], [34], [66], [1], [19] in the training stage. Therefore, the domain discrepancy between the source and the target domains could be explicitly reduced. Different from DA, domain generalization (DG) [30], [42] aims to learn good feature representations that generalize well on unseen domains in test time [70], [69], [71]. For the traditional supervised classification tasks, [31], [14] propose to add regularization objectives in the training stage to improve the generalization performance. However, the label space for training and testing is shared therefore there is still prior knowledge of the target domain.
In DG-FSC, models are needed to recognize samples of novel categories from unseen domains, given only few (e.g., 5-shot) labeled support data. Very recently, [57] applies the learned feature-wise transformation layer (LFT) [44] to modulate the channel-wise scale and shift parameters, trying to produce diverse and entangled feature representations of different domains. [51] applies the explanation-guided layer-wise relevance propagation (LRP) to enhance the discriminative features during training with multiple seen domains. [68] address a similar problem but they use the unlabelled data from target domains in the training phase.
Our proposed FS-BAN, differently, aims to improve the generalization of FSC models for episodic training in DG-FSC setup, by less overfitting to hard targets and it is more robust to arbitrary unseen domains, with disjoint label space (e.g., train on Cars domain [27] but test on Birds species [21]) during evaluation. Our setup is more challenging compared to conventional supervised learning but closer to the real-world applications and model deployment environment.

C. Knowledge Distillation and Born-Again Network
Knowledge distillation (KD) [22], [4] often aims to transfer the "knowledge" of a larger and stronger machine learning model (the teacher) learned on a large-scale dataset, to another compact model (the student) with a small training dataset [33], [72]. KD has shown empirical benefits in some applications, e.g., model compression [62] and transfer learning [65]. Usually, KD method can train the student network that benefits from the teacher's knowledge and obtains a good performance.
Born-Again Network (BAN) [12] is a special case of KD that transfers knowledge from well-trained teacher(s) to the student with an identical network structure and training data. Taking advantage of this, BAN can generate multiple generations by repeating the knowledge-transfer process (we discuss this Sec. III). Surprisingly, previous works [12], [7] discover that the student can outperform the teacher consistently in terms of prediction accuracy on conventional supervised learning tasks, which suggests improved generalization to the test data. Recently, [55] applies BAN in conventional classification task (i.e., Figure 2) to obtain a backbone network. Then, they apply the standard transfer learning pipeline on the student model to handle the downstream single-domain FSC tasks. In this work, we design FS-BAN for DG-FSC that takes the advantage of BAN the improved generalization without the need for modification to the network structure and additional training data. Compared to the similar work [55], our designs are clearly different, as shown in Figures 3 and Figure 5. Comparison results with [55] show the superiority of our designs (see Table V).

III. PRELIMINARY
In this section, we discuss the concepts of BAN and DG-FSC. Concretely, in Sec. III-A, we review the mechanism of BAN in conventional supervised image classification; in Sec. III-B, we formulate the DG-FSC problem setup and the episodic training process of existing FSC models. Query Fig. 3: Our proposed BAN episodic training. A task T with N w categories is sampled (here N w = 3). The support set of T is applied to adapt the teacher and student models. Then, the teacher conditioning on the support set predicts the query samples of T and transfers the knowledge to the student. Compared to [55] (see Sec. IV) that adopts the transfer learning approach, we directly apply BAN in episodic training that simulates the realistic setting in the evaluation phase for FSC.

A. BANs for Conventional Supervised Image Classification
We follow the definition of BAN in conventional classification problems [12]. Consider a dataset containing image samples X and true labels Y. Generally, the prediction of the input samples X is parameterized by a network f θ0 (X ). f θ * 0 (·) is called the teacher network and it can be obtained by minimizing the cross-entropy loss to the ground truth labels: whereŶ θ0 = SS(f θ0 (X ), τ ). SS(·, τ ) is the SoftMax function with a temperature τ over N training classes: where we assume z is input to the SoftMax layer. Normally, Eqn. 2 is considered to soften or harden the soft predictions when τ > 1 or τ < 1. As Figure 2, BAN enables another model (f θ1 (·), the student) to exploit the rich information contained in the predicted probability distribution of the teacher, by minimizing the distance (D) between the output distribution of the teacher f θ * 0 (·) and that of the student f θ1 (·): where the first term is the classification loss to the one-hot ground truth, and the second term employs the soft prediction of the fixed teacher model for knowledge transfer.
Since the student has the identical structure and training data of the teacher, this born-again process can be applied sequentially with multiple generations: In k-th generation (gen-k, k > 1), the student f θ k (·) is trained to optimize a sum of cross-entropy loss and the distance between its prediction and the soft targets from the student obtained in gen-(k-1): ). (4) The student f θ * k−1 (·) obtained in (k-1)-th generation now becomes the new teacher. In particular, f θ * 0 (·) indicates the first teacher that is trained with only cross-entropy loss to one-hot labels in gen-0. Interestingly, previous work [12], [55] reported improved generalization of student network with BAN training in this conventional supervised learning setup, which motivates us to investigate BAN for DG-FSC. We discuss it in Sec. IV.
Episodic training. We denote the input images as X and the corresponding labels as Y. In each training iteration, instead of sampling a batch of images with their true labels directly (as in conventional supervised learning), we sample an N w -Way (number of classes) N s -Shot (number of labeled samples per class) task T of a source domain D from several seen domains The support set S and the query set Q are formed by randomly selecting N s and N q samples of each of N w categories (usually, N w = 5), respectively. In this context, the batch size is one task (or an episode), and the samples in S and Q are pseudo-labeled which will change in different episodes.
Metric learning based FSC. Suppose a metric-based FSC model f is parameterized by θ. For each sampled task T , f θ (·) firstly extracts the feature embeddings of both support S and query samples Q, then it predicts the label of each query sample by comparing its relation to support sample features (i.e., conditioned on the labeled support set): whereŶ θ q is the prediction results of query samples over N w classes. Generally, we aim to minimize the prediction error on the query set with cross-entropy loss w.r.t. one-hot labels: In the testing phase, we evaluate the accuracy of the query set of tasks sampled from novel classes of unseen domains. We follow the DG setup [57], [51], [31] that we do not approach any samples from unseen domains in the training phase. Therefore, our FSC models are expected to learn robust and discriminative knowledge that can be well transferred to other domains. Note that in this DG-FSC setup, the label spaces of source domains and target unseen domains are disjoint, different from some recent DG literature [35], [32], [31].
IV. BORN-AGAIN EPISODIC TRAINING FOR DG-FSC BAN episodic training. Motivated by the theoretical analysis in [2], and improved generalization observed in conven-   Table I. It is clear that the major gain is obtained at gen-0 → gen-1. The deeper generations come with expensive training costs and lead to diminishing increment, and the negative impact is observed after the empirical optimal generation (gen-3 in Table I). We note that this observation is consistent with that of BAN in conventional supervised learning [12], [55].
tional supervised learning [12], in this section, we propose BAN episodic training for DG-FSC. We conduct a rigorous study and show the effectiveness of BAN for the existing FSC model under domain shift, which motivates us to propose FS-BAN (discussed in the next section). As Figure 3 and description in Sec. III-B, in each training iteration of DG-FSC during the k-th generation of BAN, rather than sampling a batch of images of all classes, we instead sample a task T with N w categories. We apply the support set of T to adapt both the teacher and student models. Then, the models conditioning on the support set are used to predict query samples of T . After that, similar to Eqn. 4, we optimize the student network f θ k (·) by leveraging the one-hot label and the soft targets predicted by the teacher network f θ * k−1 (·) on the same query set Q: where λ 1 and λ 2 are coefficients of the weighted sum, and We use JS divergence [11] as the distance metric. Meanwhile, since the magnitudes of the gradients produced by the soft targets are scaled by 1 τ 2 , we multiply the second term of Eqn. 7 by τ 2 to maintain the balance [22]. In the meta-testing phase, the temperature is set to τ =1 to evaluate the accuracy of novel tasks. The teacher is discarded, hence the outcome of BAN episodic training is the student model without any additional parameters.
Experiment setups. To validate the effectiveness of the proposed BAN episodic training for DG-FSC, we design To enable the episodic training, in each iteration, we sample a 5-Way 1-Shot task from the base classes of miniImageNet [45]. In the testing stage, we randomly sample 1000 tasks from novel classes of either miniImageNet or different unseen domains to evaluate the performance of BAN in setup (a) and setup (b), with the average accuracy reported. We include the detailed dataset information in Sec. VI.
Results and analysis. The experiment results are shown in Figure 1 (qualitatively), Table I, and Figure 4 (quantitatively). Empirically, our observations can be summarized as follows: 1) diminishing improvements. Compared to the born-again learning process of gen-0 → gen-1, the improvement becomes small in deeper generations. We even observe the performance drop after the empirical optimal generation. Similar observations are also found in other applications of conventional supervised learning [55], [63]. See detailed analysis in Figure 4. 3) Visualization: We extract and analyze the features by the backbone network of a novel task during evaluation.
Compared to the baseline model, we observe that BAN can lead to more discriminative features with better decision boundaries, see details in Figure 1. Comparison with BAN transfer learning for FSC. Recently, Tian et al. [55] proposed to adopt BAN training in conventional supervised learning (as Figure 2) to obtain a powerful backbone network as the feature encoder. Then, in evaluation, they transfer it to the unseen FSC task (T ), extract features of the support set of T , fit a new classifier, and predict query samples. In Table II, we compare the proposed BAN To overcome the potential overfitting of teachers due to limited labeled data in an episode, we propose to regularize the teacher to match the soft distribution from the student. 2 Mismatched teacher. To explicitly consider domain-shift in training, for a task sampled from domain D i , we propose to select a mismatched teacher trained on D j for the knowledge transfer, where i = j. 3 Meta-control the temperature. The temperature τ is meta-updated in different iterations by evaluating the performance of the updated student on task from D j (i = j).
episodic training with [55]. For a fair comparison, for both methods, ResNet-10 [20] is the feature encoder, and ProtoNet [49] is the classifier, which computes the feature distance between the query and the center of support samples of each class (i.e., the "prototype") for prediction. We show that our proposed BAN episodic training can achieve competitive performance as [55] in different (DG-)FSC setups. On the other hand, episodic training attempts to simulate a realistic setting in evaluation by learning to solve FSC tasks, and it has been shown very useful to tackle novel, unseen classes given limited labeled data [49], [53], [60], [13]. Therefore, we are motivated to apply BAN directly in episodic training for (DG-)FSC, as Figure 3. Critically, in contrast to Tian et al. [55], taking advantage of episodic training, we do not modify the network structure or remove/add any layers during the entire training/test phase, and the classifier of our proposed method is compatible with many existing FSC models, which potentially can achieve better performance (see experiments in Sec. VI). Next, we propose our improved method of BAN episodic training to tackle unique tasks in DG-FSC. In order to pursue an efficient learning process and prevent the computationally expensive sequential training, we exploit the major gain of BAN episodic training at gen-0 → gen-1 and only train one generation student (i.e., k=1 in Eqn. 7) in the rest of the paper.

V. FEW-SHOT BAN
We show in Sec. IV the promising results of BAN episodic training for DG-FSC, which indicate better generalization on novel class tasks from unseen domains during evaluation. However, the improvement of baseline BAN could have been inhibited due to several unique challenges of DG-FSC: 1) The particularity of BAN lies in that the teacher network is trained with an identical structure and the same training data as the student. In this few-shot scenario, overfitting of the teacher network could degrade the knowledge transferred to the student. 2) DG-FSC requires the FSC model to recognize novel tasks from unseen domains that are not accessible during training. Inspired by a recent DG work [31] for conventional image classification, it is useful to imitate such domain shift during training so that domain robustness can be improved.
3) The key hyper-parameter temperature τ in BAN is often pre-set to be a fixed value for different source domains, which could be sub-optimal. For DG-FSC tasks, we expect to find a proper temperature that is suitable for various seen domains and such that the student model can be better generalized to unseen domains. To address the issues, we propose few-shot born-again networks (FS-BAN), including novel multi-task learning objectives with different teacher-student interactions, as Figure  5. We show in experiments that, these challenges are greatly mitigated with a marginal increment of the training cost.

A. Mutual Regularization
We show in Table I that BAN improves DG-FSC before the optimal generation. We attribute this to that the teacher at gen-k (k > 1) learned the cross-category knowledge [63] in the last generation. However, since we train f θ k (·) only if f θ k−1 (·) converges, BAN suffers from the sequential training and it severely reduces the training efficiency.
To make the teacher reap the benefits of the soft knowledge, [63] emphasizes the importance of high-quality secondary information in prediction distribution, and Top Score Difference (TSD) regularization is proposed to make the prediction distribution less peaked to the primary class of the input samples. Differently, to avoid sequential training, we propose an alternative method that makes use of the student prediction distribution of each task. Concretely, we add a feedback path from the student to the teacher and mutually regularize (MR) both the teacher and the student with the soft prediction from each other. Besides the student is learned by Eqn. 7 (k=1), the well-trained teacher network is further fine-tuned by As improvements on unseen domains and TSD analysis shown in the ablation study, L M R reliably counteracts the  Figure 5). We follow the experiment setup as in [57]. We let All={miniImageNet, CUB, Cars, Places, Plantae} be the union of all domains for training and testing. In training phase, we sample tasks from multiple seen domains, e.g., All\ {CUB}. In testing phase, we evaluate the model on tasks sampled from the leave-one-out selected unseen domain, e.g., CUB. miniImageNet is always the source domain. FS-BAN-lite indicates that we do not include L M CT in FS-BAN since it requires more GPU memory for training. overfitting of the teacher. Since N w classes are randomly selected for each FSC task, the true categories corresponding to the pseudo-label vary on different tasks. This allows the teacher to learn meaningful cross-category information from samples instead of remembering the pseudo-label. Besides, the student can leverage the one-hot label (in Eqn. 7) to ensure that both teachers and students are on the correct updating directions, resulting in non-degenerate solutions. Design Choices. There are several potential variants to regularize the teacher network, including using both classification loss and L M R . However, our goal is to make the teacher a regulator to provide soft knowledge and guide the approximate training direction such that the overfitting will not be passed to the students. Updating the teacher with cross-entropy loss may retain overfitting. Moreover, applying L M R to intermediate network layers is also possible, but it is computationally complex and hard to find the perfect design. Therefore, we choose a simple way to perform L M R in the output space.

B. The Mismatched Teachers
One core issue in DG-FSC is that we cannot get access to statistics from the target domain during training. Therefore, the reasonable way to improve the performance of an FSC model on unseen domains is to improve the robustness and make it produce more stable predictions on various seen domains.
To formulate a training scheme that is similar to the test phase on unseen domains, inspired by the recent work [31], we train the student in a way that exposes it to domain shift, making it robust to the mismatched source domain on which the current teacher is trained. Concretely, for each source domain D i , we train a teacher network using the training data of D i via Eqn. 6, and we denote the teacher obtained on D i as f Di θ * 0 (·). In each iteration, for a task T sampled from D i , the student is updated in the same way as Eqn. 7 but the teacher is obtained from a different domain D j that has never seen D i before. We update the student using the ground truth and the mismatched soft outputs of f How can a mismatched teacher (MM) help DG-FSC? Our insight is, if the student can be adapted to predict accurately on tasks from domain D i while guided by a mismatched teacher obtained on domain D j (i = j), then its robustness to domainshift in the testing phase has increased. As minima quality analysis [31] shown in Sec. VI, L M M improves domainrobustness compared to the baseline model. We further note that L M M improves the generalization by a large margin on fine-grained unseen domains.
Design Choices. In each task, the teacher is randomly selected which is mismatched to the domain of the current task. Compared to the teacher from the same source domain applied conventionally, the student f θ1 (·) is penalized for the wrong prediction given the mismatched teacher that performs poorly on the current source domain. To minimize the total loss, the student model must learn to solve the task from the correct labels, but regularized by the teacher that is under domainshift. In the ablation study, we show the separate benefit of L M M that it outperforms the baseline BAN consistently.

C. Meta-Control the Temperature
A fixed temperature (e.g., τ = 4) is often applied in the BAN and KD training process to soften the prediction probability distribution. Therefore, the student model can learn the inter-class relationships predicted by the well-trained teacher network. However, in the DG-FSC setup, the fixed temperature is applied to various source domains, of which there may be large differences and it leads to sub-optimal performance. For example, in some tasks, a higher temperature may result in less difference between classes. We propose to use metalearned temperature tuning on different source domains. Our idea is that with the adaptively tuned τ that is proper to various seen domains the student can learn appropriate inter-class knowledge and improve the performance on unseen domains.
Instead of directly updating τ , we propose a meta-learning scheme [10], [18], [38], [1] to efficiently tune the temperature (MCT): In iteration t, we sample two subtasks from two different source domains: T 1 ∈ D i , and T 2 ∈ D j . Firstly,  we update the student network f (θ1,t) (·) on T 1 , given a predetermined τ t . Then, for task T 2 , we fix the weights of the student f (θ1,t+1) (·) and evaluate the effectiveness of the temperature τ t that is applied to train the student on T 1 , by testing the performance of f (θ1,t+1) (·). In this step, we use only cross-entropy loss (τ =1), which is the same as the testing phase. The temperature τ t+1 is obtained by evaluating f (θ1,t+1) (·) on query set Q 2 = {X (q,2) , Y (q,2) } of T 2 : ).
τ t+1 is used in the next iteration t+1. With the adaptively finetuned temperature, we obtain a meta-learned hyperparameter that is trained to adapt to diverse domains. Design Choices. Potentially, we have several ways to tune the temperature: (i) The simplest way is that the temperature is updated directly as a normal learnable parameter in episodic training. (ii) We update the student on T 1 and evaluate the effectiveness of the temperature on T 2 , but both tasks are from the same domain, i.e., T 1 , T 2 ∈ D 1 . (iii) (Proposed L M CT in FS-BAN) We update the student on T 1 and evaluate the effectiveness of the temperature on T 2 , and the two tasks are from different source domains i.e., T 1 ∈ D i , T 2 ∈ D j , as Figure 5. In Sec. VI, the temperature with setup (iii) converges gradually in the training process and gains better performance in the evaluation stage, indicating that we find a temperature suitable to diverse domains and training tasks. Therefore, we choose this setup for FS-BAN.

D. Multi-Task Learning Objectives
The final learning objective of FS-BAN is: In the next, we conduct comprehensive experiments to evaluate the effectiveness of FS-BAN on public datasets with popular FSC models as baselines. Detailed ablation studies and analyses are performed both qualitatively and quantitatively.

VI. EXPERIMENTS
In this section, we discuss the experiment settings and evaluate the proposed FS-BAN on six publicly available datasets with three popular metric baseline FSC models. We also conduct detailed ablation studies.

A. Datasets
We evaluate the proposed FS-BAN on six publicly available datasets: miniImageNet [45], tierdImageNet [47], Caltech-UCSD Birds 200 (CUB) [61], Stanford Cars (Cars) [27], Places [73] and Plantae [58]. We follow the dataset split protocol as previous work [57] for a fair comparison, and we summarize it in Table VII. In the meta-training phase, we use the standard data augmentation skills, including image jittering, random crop, random horizontal flip, and normalization for better generalization. In the meta-valid and meta-test stages, we do not use data augmentation.

B. Baseline Models
Since FS-BAN does not require additional learnable parameters and can be readily used to existing FSC methods, we apply FS-BAN to three popular metric-based FSC models to validate the effectiveness of FS-BAN: MatchingNet [60], RelationNet [53] and Graph Neural Network (GNN) [13]. All these baseline models share the same feature extractor as the backbone network and only differ in the metric-based classifier head for prediction. For other DG-FSC methods, we compare with [57] that applies feature-wise transformation layers (LFT) to improve the generalization. We also compare with layerwise relevance propagation (LRP, [51]) and more state-of-theart FSC models in both single domain and DG setups.

C. Experiment Setups
For a fair comparison, we follow [57]  For all experiment setups, follow [57], [51], we use ResNet-10 [20] as the backbone network for baseline models and our method. We initialize the temperature with τ = 4, and it is activated by a SoftPlus function to ensure it is non-negative: where τ is updated in each iteration, as described in Sec. V-C.

D. Implementation Details
We strictly follow the standard FSC setups [48], [64], [53], [49]: either 5-Way 1-Shot or 5-Way 5-Shot tasks are sampled in training and testing stages. In each task, we sample N q = 16 query images per category to compute the loss and accuracy. We train FS-BAN with 800 epochs (100 tasks are sampled from a random source domain in each epoch). We apply the Adam optimizer [26] to train the models with default hyperparameters, e.g., learning rate 0.001. In the testing phase, we sample 1,000 tasks of novel classes from the unseen target domain in setup 1) and 2) and the same source domain in setup 3) for evaluation, respectively. We select the model checkpoints with the best validation accuracy and report the average accuracy on the test set with 95% confidence interval.
On the other hand, we follow the prior works [57], [48] to pre-train the backbone model (ResNet-10 [20] feature encoder with a linear layer as the classifier) on 64 base classes of mini-ImageNet, by minimizing a standard cross-entropy loss, as Eqn. 1. After that, we remove the classifier head and use the pre-trained backbone weights to initialize the student network for episodic training for DG-FSC. Therefore, at the beginning of the meta-training stage, the student is equipped with a feature encoder that can extract discriminative features. We use this technique in all our experiments as it has been shown very useful in FSC in prior works [57], [48], [15], [37].

E. Experiment Results
The results of experiment setup 1), 2) and 3) are shown in Table III, Table IV, and Table V, respectively. In all setups, our proposed FS-BAN consistently improves the different baseline FSC models to state-of-the-art, presenting desirable performance on unseen domains. Since there is no true label VIII: Meta-test accuracy (%) with different implementation techniques for mutual regularization. Model is trained on several seen source domains and evaluated on the leave-one-out selected unseen domain with 5-Way 5-Shot tasks. The feature encoders of all models are pre-trained on mini-ImageNet.

Method
All\  in episodic training for DG-FSC (i.e., samples are pseudolabeled in different tasks), the results imply that our obtained models indeed learn generalizable knowledge that can help tackle different tasks on novel classes of unseen domains, as analysis in Table XII, where we show the improved accuracy and lower top-score difference in the prediction distributions. Compared to the prior state-of-the-art method that introduces additional learnable parameters [57], our proposed FS-BAN can address unique issues of DG-FSC, including overfitting and domain shift, benefits the network generalizability on unseen target domains, and improves the performance without additional inference cost in the deployment stage. We further note that in setup 3) where base classes for training and novel classes for evaluation are from the same domain (hence the domain gap is reduced), our method can still achieve considerable improvement consistently, even with fewer parameters of the backbone network, as Table V. As we show in the ablation studies in the next, our proposed learning objectives in FS-BAN successfully address the unique challenges posed in DG-FSC, and the generalizability is greatly improved on unseen domains.

F. Ablation Study
Ablation study of learning objectives of FS-BAN. To evaluate the effectiveness of each individual component in the multi-task learning objectives of the proposed FS-BAN, we conduct comprehensive ablation studies and observe the empirical performance of FS-BAN in the DG-FSC setup. We use MatchingNet as the baseline model. We sample 5-Way 5-Shot tasks for training and evaluation, and other settings are the same as setup 1). The results are shown in Table VI.
We show that each separate learning objective (L M R , L MM , L M CT ) in FS-BAN improves the baseline models effectively and they are complimentary to each other. On the other hand, the FS-BAN with the full multi-task learning ob-jectives achieves a good balance and performance on different unseen domains (the last row in Table VI).
In practice, to ensure that the feedback from the student prediction for (L M R ) is reasonable and will not mislead the fine-tuning of the teacher network, esp., at the beginning of student training, we introduce a "student warmup" process to let the student train 10 epochs with randomly sampled tasks before its feedback to the well-trained teacher network. We note that the backbone of the student model is pre-trained on mini-ImageNet training classes. We also reduce the learning rate (lr) of the teacher network by a factor of 5 compared to that of the student network, such that the teacher model is only moderately updated. In Table VIII, we study the impact of "student warmup" and "reduced lr for teacher" for L M R , and we show that the adopted techniques can improve the performance for L M R by a considerable margin.
On the other hand, interestingly, when we only use FS-BAN with the mismatched teacher (L M M ), the performance is better than all baselines and the state-of-the-art models with Cars being the unseen domain. This suggests improved generalization on the fine-grained datasets.
Ablation study of coefficients in learning objectives. We perform a grid search to select the coefficients of the loss objectives of λ 2 in L M R and λ 3 in L M M in our work. We fix λ 1 = 1 for cross-entropy loss and tune λ 2 for L M R and λ 3 for L M M . In Table IX, we show the accuracy with different choices of coefficients. The student is trained using setup 1) with 5-Way 5-Shot tasks. MatchingNet is the baseline model. We observe that the performance of our proposed model is not very sensitive to different coefficients, and our methods can outperform the baseline method (where λ 2 = λ 3 = 0) significantly. Nevertheless, we note that it is possible that the mismatched teacher may harm the accuracy of the student, especially when such domain-shift training in L M M is overemphasized. Here, we found that λ 3 = 0.5 is the best weight of L M M , which indicates that it is not over-emphasized. Based  on the empirical results in Table IX, we choose λ 1 = 1, λ 2 = 0.8, λ 3 = 0.5 as the coefficients in L M R and L M M , which performs well in most experiments.
Mutual Regularization leads to better separation boundaries. In this analysis, we validate the effectiveness of L M R . For simplicity, models are trained on base classes of miniIm-ageNet with 5-Way 1-Shot tasks.
In Figure 7, we visualize the performance of the teacher and the student in the meta-valid phase on tasks sampled from novel classes of miniImageNet. Compared to the baseline model and the original BAN (i.e., teacher without L M R ), we observe that both the teacher and the student gain better generalization performance on novel classes. Meanwhile, the student can consistently outperform the improved teacher network, which suggests that L M R maintains the advantage of the baseline BAN and brings non-degenerate solutions.
Where does this improved performance come from? Qualitatively, in Figure 6, we follow [17] to sample tasks from novel classes (miniImageNet) and unseen domain (Places) and project the extracted features (via backbone network) of query samples onto the first two components of LDA [40], on directions that minimize the intra-class to inter-class variance ratio. In the plots, we observe that L M R obtains better class separability, which leads to better generalization ability on novel classes of unseen domains.
Quantitatively, we further follow [17] to analyze the quality of the learned features for few-shot tasks, via feature clustering (R F C ) and hyperplane variation (R HV ). For measurement of R F C , we explicitly compute the intra-class to inter-class variance ratio. Denote the data in class i and index j by {x i,j }, feature extractor by E. µ i is the centroid feature of class i and µ is the centroid feature of all classes, we have: We show that, regularized by L M R , the well-trained teacher network can be continually improved and gain better performance on unseen novel classes. It in turn leads to a better student with improved generalizability that even outperforms the teacher network consistently. where N w and N q are number of classes and query samples per class. When R F C =0, samples of the same category are mapped to a single point, and there is no uncertainty of hyperplane when separating arbitrary samples from two classes. Similarly, Hyperplane Variation (R HV ) measures the sensitivity of separating hyperplanes to data sampling. For both R F C and R HV , the lower value corresponds to better class separation. We compute R F C and R HV by sampling 200 query images per category, averaging over 1000 novel 5-Way 1-Shot tasks of the unseen domain. These numerical results are shown in Table X. Furthermore, as TSD analysis in Sec. VI.G, it is clear that the improvement comes from the awareness of cross-category information of the teacher network, thus a simple L M R brings better class separation and feature clustering performance on unseen domains. Temperature convergence and analysis with L M CT . We visualize the meta-learned temperature (initialized by τ =4) trained with experiment setup i), as Figure 8. In both 1-Shot and 5-Shot training processes, the meta-controlled temperature gradually converges, finding its own equilibrium. Therefore, the adaptively tuned hyperparameter on diverse domains is the reason that we obtain the improvements, as numerical results in Table VI. Meanwhile, in Table XI, we further note that directly updating the temperature as a learnable parameter does not bring improvement and introduces overfitting.   Mismatched Teachers improve solution robustness. How to understand that a randomly selected and mismatched teacher of FS-BAN improves robustness to DG-FSC (see Table III)? One ideal case is that converging to a 'wide' minima leads to a more robust solution of the model. Recently, some DG literature on conventional supervised learning [5], [31], [25] analyze the model robustness in terms of evaluating the solution minima quality.
Following [31], [25], we compare the model robustness, by adding Gaussian noise to model parameters and observe the accuracy change in the testing phase, as Figure 9. In most cases, we can observe that FS-BAN (with only L M M ) brings higher robustness facing perturbation, which suggests better minima quality and generalization on held-out unseen domains. Another interesting observation is that in some cases we obtain an incremental improvement by introducing noise to model weights, which is a by-product that is related to a recent work [57].

G. Top-score Difference Analysis
In their previous work [63], they show that a better student model can be obtained with a more tolerant teacher, which is less focused on the primary class when making predictions. That is, the teacher passes the reasonable inter-class knowledge to the student (i.e., probability prediction to all categories). Following their findings, in episodic training, we measure the Top-score difference (TSD) of the probability predictions produced by the teacher network: where f θ0,am (·) is short for m-th largest value in the probability distribution f θ0 (·). We set a fixed M = 3 which represents the number of potential semantically similar classes for each image in the episode, including the primary class (the class assigned the highest probability). Then, we calculate the gap between the prediction probabilities of the primary class and the average of other M − 1 classes with the highest scores. TSD for L M R . In FS-BAN, L M R requires the teacher network, i.e.f θ0 (·), to match the soft distribution produced from the student, i.e.f θ1 (·). Therefore, the teacher can learn the cross-category similarity information from the student. Here, we quantify these benefits via statistical measurements during training. As shown in Table XII, in the training process, L M R indirectly reduces TSD of the teacher network, which suggests that the produced soft predictions are less picked and the similarity knowledge is well preserved. In the testing phase, L M R for FS-BAN has higher accuracy. Therefore, the teacher network is less overfitting and preserves the meaningful soft knowledge transferred from the student.
TSD for L M M . What does the student learn from the mismatched teacher? To understand the working mechanism of FS-BAN mismatched teachers, a potentially ideal way is to observe the behavior of the mismatched teacher. We select the different source domains, then we observe the performance of the teacher training on the miniImageNet (hence, when  miniImageNet is not the source domain, the teacher becomes a mismatched teacher). We sample 5-Way 5-Shot tasks from the novel classes of each domain, and we measure the TSD and the accuracy of the teacher. As Table XIII, when we evaluate the teacher network on a mismatched source domain, we find that the accuracy is far beyond the random prediction. Therefore, the mismatched teacher is at least meaningful since it is better than randomly guessing. On the other hand, it has apparently lower TSD compared to that of miniImageNet in meta-testing. In this DG-FSC scenario, the attention of the mismatched teacher has transited to predicting the inter-class similarity, and the student is trained to adapt unseen domain by adapting to the "unseen" (mismatched) teacher. At the same time, the student model can be optimized by cross-entropy loss to the ground truth, which guarantees its correct updating directions.
In literature, to improve the model generalizability of unseen samples for classification tasks, several regulators such as Label Smoothing [54] or Confidence Penalty [43] have been proposed to penalize the overconfidence prediction of the classifier, such that the overfitting for the training data is mitigated. However, we note that these regulators have a common drawback: they encourage the probabilities to be uniformly distributed over all training classes, regardless if these classes are really similar to each other. In contrast to this, in our proposed method, the student in L M R and L M M is regularized to match a soft and better confidence prediction, which is designed specifically for overfitting and domain-shift for DG-FSC, and they achieve considerable improvement.

H. Training Student with a Stronger Teacher
In Sec. V, to find a good balance between the performance and the training cost, the proposed FS-BAN does not involve sequential training in generations, and we only train one generation of the student. Therefore, the architecture and size of the student are not limited to being the same as the teacher.
Ideally, one possible way to further improve the performance of the student network is to introduce a stronger teacher network, i.e., more parameters with higher capacity.
In Table XIV, we conduct a study to empirically validate this assumption: We set the scale of the teacher backbone equal (born-again networks setup) or larger (common knowledge distillation setup) than that of the student. We consider different types of backbone networks that are popular in FSC [55], [57], [10], [49], [53] as the feature encoder: Conv-4/6 (4/6-layer convolutional networks), and ResNet-10/18 [20]. We use the same setting as experiment setup 1): the student is trained on multiple seen source domains and tested on the leave-oneout selected target domain with 5-Way 5-Shot tasks. We use MatchingNet [60] as the baseline model.
As can be observed in Table XIV, when the backbone networks of the teacher and the student are the same, our proposed FS-BAN improves the performance of the student by a considerable margin. On the other hand, if we choose a teacher network with a larger backbone, the performance of the student network can be further improved.

I. Comparison to BAN with Transfer Learning
In this section, we compare our proposed method with the simple baseline [55] that leverages BAN with transfer learning approach for FSC, using the experiment setup 1), where we have multiple source domains: for each seen source domain, we follow [55] to initialize a linear layer as the classifier head and they share the feature encoder (ResNet10 [20]). In each training epoch, we randomly select the source domain and the corresponding classifier head, and the model is optimized by minimizing a standard cross-entropy loss as Eqn. 1. In DG-FSC evaluation, we follow [55] to transfer the obtained feature encoder on novel tasks and fit a new linear classifier for the prediction of query samples. For a fair comparison, we apply the same backbone and data augmentation skills as our method. The results are in Table XV. We show that our proposed method can achieve competitive performance on different DG-FSC setups. Moreover, we note that we follow [55] to conduct BAN training for two generations, therefore their training cost is higher than that of our method.
VII. DISCUSSION Conclusion. In this work, we first propose Born-Again Network (BAN) episodic training for domain generalization few-shot classification (DG-FSC) and reveal that BAN leads to more discriminative features and generates better decision boundaries on novel tasks from unseen domains. This suggests that similar to the observation in conventional supervised learning, BAN is also promising for DG-FSC tasks. To the best of our knowledge, this is the first study of BAN for episodic training. Motivated by this, we propose Few-Shot BAN (FS-BAN) as our main contribution. FS-BAN consists of multitask learning objectives: Mutual Regularization, Mismatched Teacher, and Meta-Control of the Temperature. They aim to address the unique challenges posted specifically in DG-FSC: overfitting and domain shift. The effectiveness of FS-BAN is demonstrated by competitive accuracy on six benchmark datasets, three baseline FSC models, and qualitative and quantitative ablation studies.
Limitation. We follow exactly previous work (e.g., [57]) in the choice of domains and datasets for a fair comparison. However, given the extremely wide range of domains to which DG-FSC can be applied, it is not feasible for us to validate our findings for all possible domains. On the other hand, our comprehensive qualitative and quantitative experiment results supported by our analysis provide supportive evidence that our method could be generalized to other domains. Meanwhile, FS-BAN does not impact the inference stage since we do not modify the model structure. Therefore, the effectiveness of FS-BAN on other domains in the open world can be easily validated with existing FSC models.
Future Work. While the performance of the state-of-theart FSC algorithms has been largely improved within the single domain and unseen domains that include diverse classes, the accuracy on fine-grained domains remains poor, and an example can be observed in the results of the Cars domain in Table III. Future work will consider the different types of unseen domains, including this fine-grained setup that is challenging for all current FSC models.