Calibration of Few-Shot Classification Tasks: Mitigating Misconfidence from Distribution Mismatch

As many meta-learning algorithms improve performance in solving few-shot classification problems for practical applications, the accurate prediction of uncertainty is considered essential. In meta-training, the algorithm treats all generated tasks equally and updates the model to perform well on training tasks. During the training, some of the tasks may make it difficult for the model to infer the query examples from the support examples, especially when a large mismatch between the support set and the query set exists. The distribution mismatch causes the model to have incorrect confidence, which causes a calibration problem. In this study, we propose a novel meta-training method that measures the distribution mismatch and enables the model to predict with more precise confidence. Moreover, our method is algorithm-agnostic and can be readily expanded to include a range of meta-learning models. Through extensive experimentation, including dataset shift, we show that our training strategy prevents the model from becoming indiscriminately confident, and thereby helps the model to produce calibrated classification results without the loss of accuracy.


I. INTRODUCTION
W ITH the great success of deep learning over the last decade, there has been growing interest in investigating more intuitive methods for mimicking human intelligence. One desirable characteristic of human cognition is the ability to learn new information quickly. Few-shot learning is a problem that requires machines to learn with only a few examples, and meta-learning algorithms have been studied as the solver. Meta-learning algorithms learn how to acclimate rapidly to new data with the help of prior across-task knowledge that is obtained through training on several other few-shot tasks. Noteworthy approaches include learning metric space [1]- [3], learning update rules [4], [5], or learning an initialization [6], [7]. For the meta-training to solve the few-shot classification (FSC) problem, it is generally assumed that each task randomly chooses classes and randomly samples the support and query examples of the selected classes, where all the generated tasks are treated equally. The meta-training process tries to extract the prior information so that the meta-learned model can perform well on every training task.
However, for some tasks, the model may have difficulty inferring the queries from the support examples, and this difficulty may cause a calibration problem. Fig. 1 depicts such a challenging example of a sampled batch of tasks from a real image set. In Task 1, each query image is easily matched to its corresponding support image (well-defined task), whereas Task 4 includes somewhat unclear matching between the support and the query images (ill-defined task). Due to the distribution mismatch, the model may learn wrong confidences for the predictions of the queries because common meta-learning approaches force the model to learn anyway.
Calibration in FSC is critical yet challenging. The over-/under-confidence of a network can result in dramatically different outcomes in the decision-making process, such as autonomous driving [8] or medical diagnoses [9]. Nevertheless, common calibration methods in classification, such as   qry , and wτ indicate the support set, query set, and task calibration weight of τ -th task, respectively. ci is an i-th class in each task. Typical FSC algorithms force the model to classify the query set given the support set, without considering the distribution match.
Platt scaling [10], histogram binning [11], or isotonic regression [12], are not applicable in FSC because they require an assumption that the target examples are from known (trained) classes. In FSC tasks where few training data are given, a model is vulnerable to misconfidence, but recent studies on FSC are focused on boosting accuracy and overlook the calibration of misconfident predictions. We thus design an algorithm-agnostic calibration method for an FSC model by reducing the learning signal from the ill-defined tasks. As Fig. 1 describes, our method gives a different weight to each task regarding the support-query discrepancy. The computed weight in the well-defined Task 1 is high. On the other hand, in Task 4, even the adapted model may not confidently distinguish the queries, which results in a lower weight. This novel training allows the model to predict with precise confidence and ultimately obtain better calibration ability.
To summarize, our contributions are as follows. 1) We propose a simple and straightforward method by which a few-shot classification model learns the ill-defined tasks in a conservative way, so as not to place the wrong confidence. The experiments demonstrate the improved calibration ability of our method. 2) We apply the novel meta-update to other few-shot learning algorithms, including a metric-based framework, and this confirms that our method is versatile and can easily be expanded to work across a range of few-shot learning models. Moreover, we validate our work through a dataset shift experiment, which is designed to be challenging due to the degradation of estimation ability [13].
Our work is the first that quantifies task difficulty and applies task calibration in the FSC problem. Whereas the "difficulty" is indeed an ambiguous term, and several different factors may create a difficult task, such as observation noise, data ambiguity, and data imbalance, in this study by "difficulty" we mean the support-query discrepancy within a task. The work of Xu et al. [14] is somewhat related to our work in that the researchers used rank-based meta-update weights in the reinforcement learning context by prioritizing each sampled trajectory. However, they naïvely utilized the scale of task-specific loss to estimate difficulty.

II. RELATED WORKS A. FEW-SHOT CLASSIFICATION
Few-shot classification is an extremely challenging task for managing high degrees of uncertainty [15]. In an environment where there are sparse training examples for each task, models suffer from severe instability due to high uncertainty [16]. The estimation and judgment of uncertainty become more challenging when high-variance data, such as a largescale image set, are used. To overcome this issue, several recent studies have relied on designing Bayesian neural networks. These networks often use posterior approximation to quantify parameter uncertainty [17]- [22].
Probabilistic MAML [23], a variant of MAML [7], approximates the distribution over model parameters by opti-mizing the variational lower bound. Bayesian MAML [17] uses the Stein variational gradient descent (SVGD) for more flexible modeling of posterior. ABML [18], ML-PIP [24], and ABPML [20] all amortize the approximate inference. Neural Processes [25] and its variants [26], [27] are a combination of a stochastic process and a neural network that learns the distribution of functions while representing uncertainty, although Neural Processes have yet to be proven capable of solving large-scale classification problems. Ghadirzadeh et al. [21] applied the idea of Probabilistic MAML to the fewshot policy adaptation.
The Bayesian models must compute posterior inference, and most approaches use empirical approximation, which usually requires heavy computation. Conversely, our method is computationally advantageous and does not rely on posterior approximation. The reader may also refer to Section IV-D.

B. DISTRIBUTION MISMATCH
Our work was initially inspired by Malinin & Gales [28], which builds a Dirichlet Prior Network to explicitly model distributional uncertainty. Distributional uncertainty is ascribed to a mismatch between the training and test data [28]; in our setup, the mismatch corresponds to the discrepancy between the support and query set and is often caused by intra-class variations [29]. We quantified this discrepancy without requiring any additional parameters or networks. In parallel, Sensoy et al. [30] parameterized Dirichlet, using the evidence vector to regularize the predictive distribution by penalizing divergence from the uncertain state.
Bohdal et al. [31] proposed a novel meta-learning loss that directly minimizes the differentiable expected calibration error (ECE), but the loss function is not widely applicable to existing few-shot learning algorithms. The work of Xu et al. [14] is somewhat related to our work in that they gave rankbased weights to the sampled trajectories. By evaluating the importance of each trajectory and balancing the loss, they prevented the model from showing bias on some specific tasks. Our task calibration, in contrast, quantifies the distribution mismatch within each task and balances the loss to mitigate the misconfidence for the ill-defined tasks.

1) Problem Setup
A meta-learning problem is composed of tasks from a task distribution p(τ ), and for each meta-training task τ ∼ p(τ ), there is a training set and test set. We denote each as a support set, D spt . This is called the N -way k-shot problem. Moreover, there is a disjoint query set, qry . Meta-training is an iterative process of optimization on D qry after having seen D spt . This setup is reminiscent of learning a novel task with only a few examples of training data. The adaptation on D spt corresponds to training with the novel task data. Given a few examples, a meta-learner is expected to perform predictions on unseen data, which in our setup is D qry . That is, the metalearner evolves by acquiring general information across the tasks by optimizing on D qry .

2) Model-Agnostic Meta-Learning (MAML)
Before we introduce our new approach, we briefly review model-agnostic meta-learning (MAML) [7] to facilitate readers' understanding. MAML learns a proper initialization of a globally shared parameter θ that is quickly adjusted to each task τ using a few gradient descent steps. After it is adapted to the support examples, MAML updates θ by optimizing on the query set, and it minimizes task-specific loss L τ with a task-adapted parameter ϕ τ . The bi-level optimization is computed as follows: inner-level (task adaptation): outer-level (meta-update): With a task difficulty measure, we can easily modify the bi-level optimization such that task-specific losses are weighted adaptively according to difficulty. Note that the meta-update in (2) aggregates gradients over multiple tasks. By measuring the task difficulty, we can simply multiply a difficulty-dependent weight w τ to each task τ , i.e., L = T τ =1 w τ L τ (ϕ τ ), where T is the number of tasks in a metabatch. If a task τ has a large distribution mismatch, a small value of w τ can decrease the contribution of ∇ θ L τ .

B. MEASUREMENT OF DISTRIBUTION MISMATCH 1) Inter-Class Similarity as Distribution Mismatch
Our first intuition for a good metric to estimate task difficulty is inter-class similarity. As the similarity between classes VOLUME 10, 2022 becomes greater, it becomes more difficult to distinguish one from another.
We compute the inter-class query embeddings, which are measured in the embedding space mapped by h ϕτ , the feature extractor before the classifier. With the pairs of query inputs and labels (x (τ ) q , y (τ ) q ), we collect the average features for every class c, where h maintains the feature dimensions. h ϕτ contains information about the support and query data because the average features of the query inputs are parameterized by the task parameter ϕ τ , while ϕ τ is derived from the adaptation on the support data as (1). Therefore, the relation between the support and query data determines h ϕτ . By normalizing and concatenating N -way query embeddings, we obtain a matrix H τ ∈ R N ×m , where m is the embedding dimension.
If we assume L as the cross-entropy loss, ϕ τ from the task adaptation (1) is an approximation of the maximum likelihood estimate (MLE) with observed support examples.
Let ϕ τ * be the parameter from a task where there is no distribution mismatch. This is satisfied when D (τ * ) spt has exactly the same distribution as D (τ * ) qry . We aim to maximize p(D (τ * ) qry |θ) in meta-learning, which is achieved when θ = ϕ τ * , from (4) and D qry . That is, the parameter that was adapted to the support set sufficiently distinguishes the query set. We can assume that to make a clear distinction between different classes, the embedded representation vectors should have low correlation. Therefore, in task τ * , the embedded vectors of the query inputs have low correlation because the classification boundary that ϕ τ * establishes optimally distinguishes D (τ * ) qry ( Fig. 2 left). In contrast, if there is a significant distribution mismatch between D spt . This implies a higher correlation between the embedded queries ( Fig. 2 right).
Thus, we can compare the inter-class cosine similarity between task τ * and τ using the average features.
where H τ H ⊤ τ is the inter-class cosine similarity matrix because for all i, j ∈ {1, . . . , N }. From now, we will simply use its norm as a similarity score s τ := ∥H τ H ⊤ τ ∥. For a general task τ , a significant mismatch between the distribution of D   We use cosine similarity as it was used in the numerous contrastive learning literature (e.g., SimCLR [32] and MoCo [33]) to measure the distance between the feature representations. Contrastive learning studies use cosine similarity to embed positive samples close together and negative samples far apart. This has been shown to improve the alignment and uniformity of the representations on the hypersphere [34]. As such, we supposed the cosine similarity to be an appropriate criterion for measuring the distance between the features; its magnitude is bounded to [−1, 1]. We also experimented with Euclidean distance rather than cosine similarity, but the results were not significant in our experiments. We suppose that this is because the magnitude of Euclidean distance is not bounded, and the similarity scores were not accurate because some tasks have very large distance scales whereas others have very small distance scales. Fig. 3 shows the similarity scores for tasks from different distributions of real-world images. We measured the interclass similarity scores of 1000 training tasks, where the tasks are sampled from some manipulated distribution. The dataset we used here was mini-ImageNet [35], and we followed a 5way 1-shot MAML experimental protocol. We next describe the way that we manipulated the task sampling distribution.

2) Similarity Scores of Few-Shot Tasks
• Standard is a general 1-shot experiment. For the support set, 1 image is sampled from each class. For the query set, 15 images are sampled from the corresponding class. • Well-defined tasks are sampled so that there is no distribution mismatch as τ * . Although it is difficult to perfectly eliminate the mismatch, it is possible to make the distribution of D qry similar. To this end, we sampled 100 support images per class and 15 corresponding query images. By changing the few-shot tasks into data-rich many-shot tasks, we could simulate in the low uncertainty regime. The parameter was adapted to 100 support images (but we did not change the number of adaptation steps) to allow ϕ τ to be near-optimal to maximize the likelihood of the query set.

end
• Poorly-defined tasks are constructed to create an extremely large support-query discrepancy and a large distribution mismatch. Similar to the Standard setting, a Poorly-defined task samples 1 support image and 15 query images, but they are from different classes. For instance, the class labels of the support images could be {hair slide, carousel, wok, photocopier, jellyfish}, and the class labels of the query images could be {ant, vase, dalmatian, school bus, king crab}.
In Fig. 3, in the extremely discrepant case where the classes of the support images do not correspond to those of the query images, inter-class similarity is high because the model cannot infer the query set by observing the support set. On the other hand, in well-defined tasks, the adapted model is good at distinguishing the query representations because they are less correlated. General FSC tasks lie between these welldefined and poorly-defined tasks. The similarity score thus indicates whether a given input task has a large distribution mismatch or not.

C. TASK CALIBRATION MAML
As we emphasized, our method is applicable to various few-shot learning algorithms; nevertheless, for better understanding, we begin by applying our method to the MAML algorithm [7]. We propose Task Calibration model-agnostic meta-learning (TCMAML), a variant of MAML, to esti-mate the task difficulty during meta-training and utilize it to modulate the optimization of global parameters. Specifically, we measure the distribution mismatch for each task and constrain the feedback from highly mismatched tasks (see Fig. 2). In Section III-B, we elaborated on how the similarity between the average features is related to the distribution mismatch.
The aggregated meta-loss in (2) assumes equal reliability for uniformly sampled tasks: 1/T . We alternatively use different weights for each task, which allows for training with prior knowledge about task configuration. TCMAML estimates w τ as a normalized −s τ value. Consequently, we modify the meta-update rule (2) as follows: From (6) and (7), the inequality (5) leads to smaller weight values on the loss function of the misaligned tasks. Designing w τ as above implies there is a large weight used for the well-defined tasks, where the measured inter-class feature similarity is low, and a small weight used for the ill-defined tasks, where the similarity is high. In this sense, TCMAML considers the difficulty of a task by measuring the similarity between the query examples.
The full procedure of TCMAML is described in Algorithm 1. Starting from the randomly initialized model θ, we sample a few-shot task batch where each task poses an N -way classification (Line 6). Recall that we described an example of a task batch where T = 4 and N = 5 in Fig. 1. After we obtain ϕ τ using D (τ ) spt (Line 9), we aggregate five feature vectors using ϕ τ and D (τ ) qry (Lines 10-13), and then we compute the similarity score (Lines 14-16). After we repeat this procedure four times, we obtain the similarity scores of Task 1 to Task 4. Then, when we update the meta-parameter θ, we use calibration weights (Lines 17-18) to under-weight the ill-defined tasks (e.g., Task 4 in Fig. 1) and over-weight the well-defined tasks (e.g., Task 3 in Fig. 1).
Fortunately, in the complexity view, our task calibration does not add a large burden to the computation. Note that Algorithm 1 is based on the MAML algorithm, and the difference between TCMAML and MAML is on Lines 10-15 and Lines 17-18. (i.e., other operations are the same as MAML). Computing the average feature vectors for every class c is implemented in matrix form; thus, it does not result in additional O(N ) complexity. Computing the similarity score is also done by matrix multiplication. Moreover, in few-shot learning, N is typically small (e.g., 5), so the space complexity O(N 2 ) can be ignored.
We can use any normalization function when computing w τ . One of the common choices for normalization is the softmax function. The softmax uses exponential scaling (ES), VOLUME 10, 2022 Sample meta-batch Overall pipeline of our algorithm. We generally used the meta-parameter θ as a Conv-4 backbone. The red color indicates the task calibration procedure in which we compute the similarity score (sτ ) and calibration weight (wτ ) for each task. The weighted sum of the task-specific losses is minimized.
however, linear scaling (LS) is also possible.
where T is a scaling factor. ES needs to tune the scaling factor, although LS does not require any tuning. Unless T is very large, ES imposes much larger weight variations than LS. LS may be more similar to the original update, i.e., equal weighting, than ES. In our experiments, we did some searching for T , and we empirically found that ES performed better than LS. Every result is based on ES by default. Refer to Section IV-C2 for the TCMAML with linear scaling.

D. HOW DOES TASK CALIBRATION WORK?
The example of a batch of tasks and their task calibration weights in Fig. 1 shows that for a difficult task, TCMAML imposes small weights to mitigate overconfident learning. To validate if the task calibration weighting strategy truly helps the calibration of a meta-learner, we investigated what happens during the meta-training of MAML. We sampled a meta-batch of size 4 (T = 4) from a training set and selected the most difficult task (with highest similarity score) and the easiest task (with lowest similarity score). We compared the prediction outcomes of the query set before and after the meta-update. For MAML, the meta-update corresponds to (2), as for TCMAML, it corresponds to (7). In Fig. 5, we collected the prediction probabilities of the queries, sorted them in descending order, and then averaged over the queries: for 5-way k-shot learning the average prediction vector is 5-dimensional. Therefore, the leftmost value (p 1 ) on the x-axis of Fig. 5 is equivalent to the average confidence. Before the meta-update, the learner was adapted only to the support examples, so the confidence of the query predictions is low. After the meta-update, both MAML (After) and TCMAML (After TC ) classify the queries more confidently. However, TCMAML's behaviors on the two tasks are different. TCMAML obtained lower confidence than MAML on the difficult task, whereas on the easy task, TCMAML predicted with higher confidence. This implies that TCMAML learned the difficult task with a smaller step, which mitigated the overconfidence.

E. ALGORITHM-AGNOSTICISM
We emphasize that our task calibration approach can be easily extended to a metric-based approach. Unlike optimization-based models such as MAML, a metric-based model does not require a process of explicit adaptation to the task. Instead, the model calculates the distances between the query and support data in a non-parametric way. In this study, we use a prototypical network, i.e., ProtoNet [2], which is a high-performance metric-based few-shot learning method, and propose and apply our task-calibrated version, TCProtoNet.
Using the distance of each query to the support prototypes p = [p 1 , . . . , p N ] ⊤ , we collect the average distance vectors for each class, d ϕτ ,c := 1 update weights are computed based on the similarity score s τ = ∥D τ D ⊤ τ ∥. We empirically demonstrate that TCPro-toNet is better calibrated than the vanilla ProtoNet.
In a similar manner, we extended the task calibration approach to several other algorithms, including Bayesian methods [17], [18] and convex optimization-based learning [36]. The details of the implementation for each algorithm are described in Appendix A. The empirical result implies that the task calibration is versatile and applicable to a range of few-shot learning algorithms.
Thanks to the work they conducted on fair comparison in the FSC setup, most of our implementation details follow Chen et al. [29], both for 1-shot and 5-shot tasks. All accuracies reported are test accuracies, which were evaluated from 1000 tasks randomly drawn from the test set. By default, we used a Conv-4 backbone, CNN with four convolutional blocks [35], so as to not overestimate the performance by using deeper networks. Furthermore, all the MAML-based algorithms were trained with a first-order approximation of derivatives for the efficiency [7].
Dataset Shift. Additionally, we conducted dataset shift experiments [29], in which the train set was the entire mini-ImageNet dataset and the validation and test sets were from CUB-200. The shift mini-ImageNet → CUB-200 was implemented with a ResNet-18 backbone [39] for better convergence. Dataset shift is significantly more challenging than a typical FSC setting because the model has to predict the examples from distributions that are completely different from those of the train set. This is an intriguing situation from a practical perspective because for most real-world problems, it cannot be assumed that new tasks are sampled from the distribution which the model is familiar with.
Hyperparameters. In most experiments, we used a Conv-4 backbone, except for the dataset shift experiments. The Conv-4 architecture [35] stacks 4 blocks that are each comprised of (Convolution + BatchNorm + ReLU + Max-Pool). During training, we used data augmentation (Random-ResizedCrop, ColorJitter, RandomHorizontalFlip), and each image was resized to a resolution of 84 (224 in the dataset shift). For MAML and TCMAML, meta-training lasted 1600 epochs for 5-shot and 2400 epochs for 1-shot experiments, with 25 episodes per epoch. One episode sampled 4 tasks (T = 4) and each task sampled 15 queries per class (k ′ = 15). We used the Adam optimizer [40] with a learning rate of 5e-4. The learning rate in the inner-loop update was 1e-2, and the number of inner-loop steps was 5. In addition, computing weights for each task required a scaling factor, according to (8). T is a scaling hyperparameter that needs to be tuned, depending on the dataset, type of feature extractor, and the size of a matrix (number of classes), etc. Note that T → ∞ implies an equally weighted sum, leading to the original meta-update. For TCMAML, we searched T from VOLUME 10, 2022 Baselines. Our work aims to decrease the misconfidence that occurrs in various few-shot learning algorithms. However, most few-shot learning studies have focused on ways to increase the few-shot learning accuracy, so they have evolved methods by using a sophisticated data augmentation technique [41]- [43], evaluating in a transductive manner [44], [45], or utilizing a self-supervised learning strategy [46], [47]. Due to analytical complexity, we had to choose representative methods for our experiments, especially those for which open-source code was available. Among the chosen methods, MetaOptNet [36] shows a few-shot performance comparable to that of the state-of-the-art algorithms in an inductive setting with a simple architecture. In addition, there are few works that have considered miscalibration in a fewshot learning framework, which makes it difficult to conduct comparative analysis.

1) Evaluation
Our main concern is calibration error, rather than classification performance. The model is well-calibrated when it makes decisions based on the predicted uncertainty. The correctness of a perfectly calibrated model will match the confidence level (the largest value of the softmax output). For example, the model is miscalibrated or overconfident when the confidence is 0.
where acc(B m ) and conf(B m ), respectively, are the empirical accuracy and average confidence for the set B m . Partitioning the range [0, 1] into M segments of identical lengths, we collect each sample's prediction confidence, which is bounded by 0 and 1, and assign the sample to one of the M partitions according to the confidence level. Here, B m , referred to as bin, is a subset of the samples such that their confidence falls into the m-th partition. Thus, high calibration error implies that the model does not have confidence for the samples as the level of the model's classification ability; high ECE means misconfidence in average, and high MCE means misconfidence for certain subset of samples. Refer to [48] for a detailed derivation. We chose ECE and MCE for the evaluation metric for the purpose of finding a model that has a confidence that matches its predictions. Although the model shows poor classification performance on the target examples, we confirm that the model is well-calibrated if the calibration error is low: the model's predictions and its predictive confidence are aligned. There is another view of uncertainty estimation that considers model uncertainty. This is often evaluated using proper scoring rules, such as NLL (negative log-likelihood) and the Brier score [49]. However, this kind of metric is related to the model uncertainty, which implies that a model with low prediction error is likely to have a low uncertainty estimate [13]. In FSC, classification accuracy is most of the time very low, and it largely differs depending on the meta-learning algorithm and backbone architecture. Therefore, we focus on lowering the gap between this accuracy and confidence, instead of reducing the prediction error.

2) Calibration Improvements
In addition to MAML [7] and ProtoNet [2], we also compared with ABML [18], BMAML [17], and MetaOptNet [36]. ABML (amortized Bayesian meta-learning) and BMAML (Bayesian MAML) are Bayesian methods for few-shot learning, and both are based on the MAML algorithm. MetaOpt-Net is an optimization-based model, where in the adaptation step a linear classifier trains with the embeddings of the support set and returns the classification score on the query set. Table 1 summarizes the error rate on the mini-ImageNet and CUB-200 experiments. In most experiments, the task calibration method (TC-) exhibited a significantly decreased error rate than its counterpart; for example, TCProtoNet outperformed ProtoNet. Even the Bayesian methods, when equipped with task calibration (TCABML and TCBMAML), further reduced the miscalibration rate.
It is well known that Bayesian methods are capable of calibration. As for ABML [18], if the parameters sampled from the approximate posterior effectively classify the query set, the posterior is updated in the direction of producing point estimators. Otherwise, the approximate posterior produces a diverse parameter set, which results in uncertain predictions. Although our method does not deal with parameter distribution, the idea of mitigating overconfident learning on an uncertain task is similar. Whereas the Bayesian methods implicitly lead to calibration via parameter uncertainty, our task calibration directly targets the support-query discrepancy. Notably, in almost every experiment, the task calibration method yielded consistent accuracy without incurring a decrease in classification performance (refer to Section IV-B4). In contrast, as [18] denoted, ABML often produced poor test accuracy.
We could observe that different learning strategies highly influence the model's calibration abilities, and our method orthogonally improves upon the diverse learning algorithms. Specifically, ProtoNet and TCProtoNet have better calibration ability than other algorithms. ProtoNet updates the parameters using the distance between the support prototypes and the query set. Therefore, ProtoNet actually warps the embedding space toward reducing the support-query discrepancy. Table 2 exhibits the calibration results for the dataset shift experiment. Although some experiments produced large miscalibration rates, task calibration consistently reduced them. From the perspective of accuracy, TCMAML is much worse than TCProtoNet and TCMetaOptNet (refer also to Table 3). The general belief that a network with lower training error tends to be overconfident [48] is not always true. In some cases, sufficient amount of prediction capability is needed for better calibration.

4) Classification Accuracy
Calibrated models generally degrade classification ability for the sake of accurate risk prediction. However, our method does not degrade the classification accuracy. In Table 3, we display the test accuracy for every experiment. ProtoNetbased methods outperform MAML-based methods and do so by large margin in the dataset shift experiments. MetaOptNet is implemented using the ResNet-12 backbone, which is the reason for the high accuracy in mini-ImageNet and CUB-200 experiments.
Unfortunately, TCMetaOptNet shows an accuracy drop beyond the confidence interval in the CUB-200 and the dataset shift experiments. We suppose this happens because MetaOptNet itself has high miscalibration error and TCMetaOptNet reduces the error significantly (refer to Table  1 and Table 2). Although largely improving the calibration can possibly cause a drop in accuracy, we should use a calibrated model when we need a reliable model even if we give up some accuracy.

1) Temperature Scaling: A Classic Calibration Approach
Temperature scaling (TS) [48] is a calibration method that is widely used in general classification, and it learns a temperature parameter to control confidence. It functions as postprocessing on the validation set. However, in a few-shot learning setup, the train, validation, and test set all have nonoverlapped classes, and the unseen classes are given in future tasks. Existing calibration methods such as TS find the best parameter for known classes, which is not applicable in FSC. This is why in Fig. 6, +TS produced somewhat random results; a few results are good, but the rest are extremely poor. Note that +TS has the same accuracy as its counterpart, because TS is a post-processing method on the validation set and does not change the prediction label (it changes only the smoothness of the predictive distribution).

2) TCMAML with Linear Scaling
In every experiment so far, we computed w τ using a softmax function over the similarity scores. While the softmax function uses an exponential scaling (ES), normalizing by linear scaling (LS) is also possible. In this case, we do not need a scaling factor.

3) Inter-Class Support Embeddings
In Section III-C, we claimed that h ϕτ is determined by the relation between the support set and the query set, because the support-adapted parameter ϕ τ embeds queries. To further verify this, we experimented with a different embedding method. Here, we used only the support set for computing inter-class similarity scores (TCMAML-SPT). Table 4 summarizes the results. As we expected, using only the support set to compute inter-class similarity did not capture the distribution mismatch. Therefore, TCMAML-SPT did not calibrate MAML effectively and showed a few poor results. The support set was well-classified by ϕ τ due to the adaptation process, and the support set did not solely reflect the distribution mismatch. Therefore, TCMAML-SPT incorrectly estimated the task difficulty.

4) Reverse TCMAML
What would happen if TCMAML computes w τ as a normalized +s τ value; large weight on the ill-defined tasks and small weight on the well-defined tasks? We hypothesized that this method would degrade the calibration in the reverse way of TCMAML. However, Table 5 shows an unexpected result: TCMAML-R had a still lower miscalibration rate than MAML in 1-shot tasks. Our conjecture is that assigning larger weights to difficult tasks moves the meta-parameter toward better solving these tasks. Then, those tasks will have smaller similarity scores than previous iterations and consequently get smaller weights. This happens alternatively and produces inconsistent calibration results. Table 6 shows the invariance of our method to the change in backbone architecture and the resolution (image size). Note that the convolutional backbone (e.g., Conv-4 and Conv-6) matches with the image size 84 and that the residual backbone (e.g., ResNet-10 and ResNet-18) matches with the image size 224. Recall that in Section IV-B2, we only presented the results on the Conv-4 backbone with the image size 84, except for MetaOptNet and TCMetaOptNet. The results in Table 6 demonstrate that TCMAML is consistently better than MAML for Conv-6, ResNet-10, and ResNet-18. In addition, there is a tendency for deeper backbones to have a higher miscalibration rate, and TCMAML is more likely to improve in these cases.

D. COMPARISON WITH BAYESIAN METHOD
Bayesian meta-learning models are well known to capture uncertainty and calibrate the model, but task calibration can still improve them. In Table 1, we showed that task calibration on both ABML and BMAML further decreased the calibration error rate. In this Section, we compare our method with a Bayesian method, ABML [18]. ABML approximates the posterior distribution by K SGD steps and obtains variational parameters {µ (K) , σ 2 (K) }. This variational distribution should capture the uncertainty, that is, if the model is uncertain, it should produce a large standard deviation parameter. Then, the prediction probability can be largely different depending on the posterior samples.
To determine if ABML really captures uncertainty, we tracked σ 2 (K) values during the training of the ABML. However, contrary to our expectations, the variational standard deviation parameters were nearly 0 (see Fig. 7 (a)). This implies that posterior sampling produces similar samples and outputs, i.e., reducing the model to a point estimator. Moreover, noting that ABML sees 100 tasks per epoch, σ 2 (K) values hardly differed across 100 tasks. This observation implies that ABML does not consider the uncertainty of each task but rather captures global model uncertainty.
However, TCMAML shows large a difference in similarity scores between 100 tasks (see Fig. 7 (b)). We can interpret that TCMAML captures the uncertainty of each task in a more sophisticated manner than ABML.
Furthermore, we did not implement ABML in the dataset shift experiments due to its inefficient memory and computation. ABML is a Bayesian method, and this implies that ABML is trained by marginalizing over model parameters. During implementation, this requires a significant computational overhead because it should hold several models and receive their gradients in parallel. During the meta-training of ABML, 5 ensemble networks work in parallel, and during the meta-testing, 10 ensemble networks are employed. In Fig.  7 (c), TCMAML and ABML show an enormous difference between their GPU memory usage, even though they are both MAML-based methods. We also find that ABML takes a long time to compute during the meta-train phase. Because of ABML's inefficiency, we could not apply it to a ResNet-18 model, which is the backbone for the dataset shift experiments. ABML with ResNet-18 exceeds the capacity of the single GPU we own, and also it requires an excessive amount of time to converge.

E. CORRUPTED META-LEARNING
Our task calibration method is robust against data corruption. Meta-training with corrupted data is challenging to the metalearner because the corruption introduces large noise that impedes the accumulation and transfer of meta-knowledge. Nevertheless, robustness is crucial because the model should still learn meta-knowledge, even on those tasks, to be transferred to the target tasks. Uncertainty-aware model as task calibration can prevent noisy learning and achieve robustness [50].
To this end, we experimented by training MAML and ProtoNet under data corruption. We manipulated the task sampling process in the meta-train phase. Specifically, we corrupted the tasks to include random Gaussian images using some fixed probability p. That is, a corrupted task consisted of 5 clean support images from the mini-ImageNet train set and 15 corrupted query images that had pixel values from N (0, 1).
We show in Table 7 that MAML is already robust against this attack due to informative feature reuse [51] and overfitting to the uncorrupted images; thus, there was no improvement with TCMAML. However, a metric-based model such as ProtoNet is fragile because this type of model learns directly from matching the query examples to the support set. If the support and query set are heterogeneous, ProtoNet is easily damaged by unwanted updates. VOLUME 10, 2022   Therefore, we observed that TCProtoNet gained considerable robustness. In Table 7, when 25% of tasks were corrupted (p = 0.25), the ProtoNet performance dropped to 61.38%. Not surprisingly, 50% corruption (p = 0.5) led to a further decrease, which was 24.01%p lower than when trained without corruption. TCProtoNet managed to exceed its counterpart, 63.45% and 48.65% for 25% and 50% corruption, respectively.

V. CONCLUSION
In this study, we asserted that the few-shot image classification problems should be examined by carefully estimat-  ing task difficulty. We detected that the FSC contains a large discrepancy between the support and the query set, although typical meta-learning algorithms do not consider this mismatch. During meta-training, the distribution mismatch causes the model to have wrong confidence, which causes a calibration problem. Therefore, we modified the meta-update rule based on similarity scores, and a simple experiment verified that giving different weights to the tasks actually affects the model's prediction confidence, which consequently helps calibration. The calibration results verified reliable decisionmaking without degrading the model's ability to classify. Although our method modifies the meta-training process, it would be helpful to develop a post-processing method in the future that could enhance calibration in the meta-test phase.

APPENDIX A REIMPLEMENTATION OF ABML, BMAML, AND METAOPTNET
A. ABML ABML [18] is a Bayesian approach that estimates parameter uncertainty in meta-learning. We reimplemented the ABML algorithm because 1) open-source code was not available, 2) the authors only reported on the mini-ImageNet 5-way 1shot experiment result, and 3) our experimental setup was slightly different from theirs. Therefore, we implemented ABML based on our setup. First, we sampled 4 tasks within an episode, as we did in the TCMAML experiments. Second, the authors of [18] did not use data augmentation, but we used the same data loaders as in other experiments of ours. This includes the same data augmentation scheme. Third, we used a first-order approximation of derivatives as TCMAML. For the other setups, such as the number of ensemble networks to train and validate, learning rates, and KL-divergence reweighting factor, we followed the original study (Table 8).
We implemented TCABML in a manner similar to that of TCMAML. After adaptation to a task, the model learns taskspecific parameters, which are variational parameters for the approximate posterior distribution. Given these parameters, it samples network parameters that are used to predict the query examples; it samples M times. Here, we can obtain feature vectors, as we did in computing the class-wise features. Then, we compute the similarity scores with M different samples, and take the largest score out of the M scores. We repeat this process for every task in a meta-batch, and then compute the weight in the meta-update. We use a scaling factor T = 1/2 for mini-ImageNet and T = 2 for CUB-200.

B. BMAML
BMAML [17] is also a Bayesian method that is based on the MAML algorithm. It also holds ensemble networks (they are denoted as "particles") that are each updated in SVGD steps. We did not implement BMAML in the dataset shift experiment for the same reason on efficiency. As with TCABML, TCBMAML was implemented by obtaining the feature vectors from each particle and taking the largest value from 5 training particles. Most setups followed those of ABML and TCABML, e.g., 4 tasks per episode, data augmentation, and first-order approximation. We used a scaling factor T = 1 for every experiment.

C. METAOPTNET
Recent studies have achieved excellent few-shot results by using a linear classifier that solves a convex optimization problem; the authors of [36] named it MetaOptNet. Specifically, we implemented MetaOptNet-SVM, which uses a multi-class SVM as the linear classifier. MetaOptNet-SVM learns the parameters of the SVM by optimizing on the embeddings of the support examples. The learned SVM parameters produce score logits for each query example. TCMetaOptNet uses these logit vectors when computing the class-wise average vectors, as we did in the TCProtoNet implementation.
We followed the protocol from [36], which used 15 support examples in meta-training, and then evaluated using 1-shot and 5-shot tasks in meta-testing. We trained for 30 epochs, where 1 epoch contained 1000 episodes. We set the scaling factor as T = 0.1. MetaOptNet produces highly uncorrelated logit vectors, due to the optimization process of the SVM. Therefore, we used a small scaling factor to create the weight variations between tasks. Also, we used the ResNet-12 architecture for every experiment, including the dataset shift.

APPENDIX B RELIABILITY DIAGRAMS
Reliability diagrams [52] visually present model calibration and plot empirical accuracy versus average prediction confidence, which is the largest value of the output distribution. Fig. 8 describes the reliability diagrams by comparing between MAML, ABML, and TCMAML. We can observe that TCMAML has its bars closer to the y = x line than MAML and ABML, which implies more reliable predictions. Furthermore, the CUB experiments show slightly different results. MAML is diffident in its predictions, with the empirical accuracy higher than the confidence. A model that is not sufficiently confident can cause another type of problem. As opposed to overconfidence, low confidence often occurs when there are ambiguous images having low accuracy. This situation is vulnerable to adversarial attack or the perturbation of images [53], readily available to decrease overall accuracy due to the model's low confidence. In this case, a model that can accurately tell easy tasks apart is needed. TC-MAML again succeeds in calibration with enough confidence in such tasks. VOLUME 10, 2022