Robust Meta-Representation Learning via Global Label Inference and Classification

Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.


Introduction
Deep neural networks have facilitated transformative advances in machine learning in various areas [e.g.5,19,22,29,41,58].However, state-of-the-art models typically require labeled datasets of extremely large scale, which are prohibitively expensive to curate.When training data is scarce, neural networks often overfits which degrades performance significantly.Few-shot learning (FSL) aims to address this loss in performance by developing algorithms and architectures capable of learning from few labeled samples.
Meta-learning [23,64] is a popular class of algorithms created for tackling FSL.Broadly, metalearning seeks to learn transferable knowledge over many FSL tasks, and to apply such knowledge to novel ones.For instance, Model Agnostic Meta Learning (MAML) [14] learns a prior over the model initialization that is suitable for fast adaptation.Existing meta-learning methods for tackling FSL may be loosely classified into three categories; optimization [e.g.4,14,70], metric learning [e.g.59,60,65], and model-based methods [e.g.20,47,55].The diversity of existing strategies poses a natural question: can we derive any "meta-insights" from them to facilitate the design of future methods?
Among the existing methods, several trends have emerged for designing robust few-shot metalearners.Chen et al. observed that data augmentation and deeper networks significantly improves generalization performance [6].The observations have since been widely adopted [e.g.2,32,62].On the other hand, network pre-training has also become ubiquitous [e.g.12,53,67,70], and dominates state-of-the-art models.Sidestepping the task structure and episodic training of meta-learning, pre-training learns (initial) model parameters by merging all FSL tasks into one "flat" dataset of labeled samples followed by standard multi-class classification.The model parameters may be further fine-tuned to improve performance.
Despite its popularity, the limited theoretical understanding of pre-training leads to diverging interpretations of existing methods.Most works consider pre-training as nothing but a standard pre-processing step, and attribute the observed performance almost exclusively to their respective algorithmic and network design choices [e.g.55,71,73].However, extensive empirical evidence suggests that pre-training is crucial for model performance [68,70].Tian et al. demonstrated that simply learning task-specific linear classifiers over the pre-trained representation outperforms a number of various meta-learning strategies [62].Wertheimer et al. further showed that earlier FSL methods may also benefit from pre-training, resulting in improved performance [70].
In this work we show that pre-training directly relates to meta-learning by minimizing an upper bound on the meta-learning loss.In particular, we show that pre-training achieves a smaller expected error and enjoys a better convergence rate compared to its meta-learning counterpart.More broadly, we connect pre-training to conditional meta-learning [8,67], which has favorable theoretical properties including tighter bounds.Our result provide a principled justification of why pre-training yields a robust meta-representation for FSL, and the associated performance improvement.
Motivated by this result, we propose an augmentation procedure for pre-training in order to improve representation learning for FSL.The augmentation procedure quadruples the number of training classes by considering rotations as novel classes and classifying them jointly.This significantly increases the size of training data and leads to robust representations.We empirically demonstrate that the augmentation procedure consistently performs better across different benchmarks.
The standard FSL setting [e.g.4,14,42] assumes access to a collection of tasks (i.e. the metatraining set) for training data.To perform pre-training, meta-training tasks must be merged into a flat dataset (see Sec. 2.3 for a formal definition), which implicitly assumes access to some notion of global labels shared across all tasks.However, global labels may be non-existent or inaccessible, such as when each task is independently labeled with only local labels.This renders naive task merging and pre-training infeasible (see Fig. 1a).Independent task annotation is a more realistic and general assumption, capturing scenarios when training tasks are collected organically from different sources rather than generated synthetically from a base dataset.Practical scenarios where naive task merging is infeasible include non-descriptive labels across tasks (e.g.numerical labels) or concept overlap (e.g.sea animals vs mammals) among different task labels.
To tackle independent task annotation, we propose Meta Label Learning (MeLa), a novel algorithm that automatically infers a notion of latent global labels consistent with local task constraints.The inferred labels enable us to exploit pre-training for FSL, and to bridge the gap between experimental settings with or without access to global labels.Empirically, we demonstrate that MeLa is competitive with training on oracle labels.
For experiments, we introduce a new generalized FSL (GFSL) setting.In addition to independent task annotation, we also adopt a fixed-size meta-training set and enforce no repetition of samples across tasks.This challenging setting evaluates how efficiently meta-learning algorithms generalize from limited number of tasks, and prevents the algorithms from trivially uncover task relations by implicitly matching identical samples across tasks.We empirically show that MeLa performs robustly in both standard and GFSL settings, and clearly outperforms state-of-the-art models in the latter.
We summarize the main contributions below: • We prove that pre-training relates to meta-learning as a loss upper bound.Consequently, minimizing the pre-training loss is a viable proxy for tackling meta-learning problems.Additionally, we identify meta-learning regimes where pre-training offers a clear improvement with respect to sample complexity.This theoretical analysis provides a principled explanation for pre-training's empirical advantage.
• We propose MeLa, a general algorithm for inferring latent global labels from meta-training tasks.It allows us to exploit pre-training when global labels are absent or ill-defined.
• We propose an augmented pre-training procedure for FSL and a GFSL experimental setting.
• Extensive experiments demonstrate the robustness of MeLa.Detailed ablations provide deeper understanding of the model.

Extension of [68]
. This paper is an extended version of [68] with the following contributions in addition to those of the original work: i) a deeper theoretical insight into the role of pre-training from the perspective of the risk (rather than the empirical risk as in [68]), and quantifying its benefit in terms of sample complexity, ii) the augmented training procedure for FSL, iii) the GFSL experimental setting, iv) significantly more empirical evidence to support the proposed algorithm.

Background
We formalize FSL as a meta-learning problem and review related methods.We also discuss the pre-training procedure adopted by many FSL methods.

Few-shot Learning using Meta-learning
FSL [13] considers a meta-training set of tasks T = {(S t , Q t )} T t=1 , with support set S t = {(x j , y j )} ns j=1 and query set Q t = {(x j , y j )} nq j=1 sampled from the same distribution.Typically, S t and Q t contain a small number of samples n s and n q respectively.We denote by D the space of datasets of the form S t or Q t .
The meta-learning formulation for FSL aims to find the best base learner Alg(θ, •) : D → F that takes as input support sets S, and outputs predictors f = Alg(θ, S), such that predictions y = f (x) generalize well on the corresponding query sets Q.The base learner is meta-parametrized by θ ∈ Θ. Formally, the meta-learning objective for FSL is where Eq. ( 1) is sufficiently general to describe most existing methods.For instance, model-agnostic metalearning (MAML) [14] parameterizes a model f θ : X → Y as a neural network, and Alg(θ, D) performs one (or more) steps of gradient descent minimizing the empirical risk of f θ on D. Formally, given a step-size η > 0, Clearly, base learners Alg(θ, •) is key to model performance and various strategies have been explored.Our proposed method is most closely related to meta-representation learning [4,15,32,48], which parametrizes the base learner as A(θ, D) = w(g θ (D))g θ (•), separating it into parts of a global feature extractor g θ : X → R m and a task-adaptive classifier w : D → {f : R m → Y} resulting in the optimization problem where g θ (D) {(g θ (x), y) | (x, y) ∈ D} is the embedded dataset.Eq. ( 4) specializes (1) by learning a feature extractor g θ shared (and fixed) among tasks.Only the classifier returned by w(•) adapts to the current task, in contrast to having the entire model f θ : X → Y adapted (e.g. ( 3) for MAML).While this may appear to restrict model adaptability, [48] has demonstrated that meta-representation learning matches MAML's performance.Moreover, they showed that feature reuse is the dominant contributor to the generalization performance rather than adapting the representation to the task at hand.The task-adaptive classifier w(•) may take various forms, including nearest neighbor [59], ridge regression classifier [4], embedding adaptation with transformer models [73], and Wasserstein distance metric [74].In particular, the ridge regression estimator where • F is the Frobenius norm, admits a differentiable closed-form solution and is computationally efficient for optimizing (4).

Conditional Meta-Learning
Conditional formulations of meta-learning [8,67] extends Eq. ( 1) by considering base learners of the form Alg(τ (Z), S), where the meta-parameters θ = τ (Z) is conditioned on some "contextual" information Z ∈ Z about the task S. Assuming each task in the meta-training set T to be equipped with such contextual information, (1) can be re-expressed as namely the problem of learning a function τ : Z → Θ which, given some contextual information Z in a suitable space Z, returns a good task-specific meta-parameter θ = τ (Z).While the contextual information could encode virtually any information available on individual tasks (e.g. a textual meta-description of the task/dataset), most recent work on the topic focus on the case where Z is the task's support set itself, namely Z = S, since this is always available by construction.
The conditional formulation seeks to capture complex (e.g.multi-modal) distributions of metatraining tasks, and uses a unique base learner tailored to each one.In particular, [55,66,72] directly learn data-driven mappings from target tasks to meta-parameters, and [27] conditionally transforms feature representations based on a metric space trained to capture inter-class dependencies.Alternatively, [26] considers a mixture of hierarchical Bayesian models over the parameters of metalearning models in order to condition on target tasks.In [67], Wang et al. showed that conditional meta-learning can be interpreted as a structured prediction problem and proposed a method leveraging recent advances in the latter field.From a more theoretical perspective, Denevi et al. [8,9] proved that conditional meta-learning is theoretically advantageous compared to unconditional approaches by incurring smaller excess risk and being less prone to negative transfer.As we will discuss in Sec. 3, conditional meta-learning is closely related to our theoretical analysis on feature pre-training.

Feature Pre-training
Feature pre-training has been widely adopted in the recent meta-learning literature [e.g.2,6,39,44,51,67,70,71,73,74], and is arguably one of the key contributors to performance of state-of-the-art models.Instead of directly learning the feature extractor g θ by optimizing (4), pre-training first learns a feature extractor via standard supervised learning.
Formally, the meta-training set T is "flattened" into D global by merging all tasks: where we have re-indexed the (x i , y i ) samples from i = 1 to N (the cumulative number of points from all support and query sets) to keep the notation uncluttered.Pre-training then learns the embedding function g θ on D global using the standard cross-entropy loss ce for multi-class classification: where W is the linear classifier over all classes.After pre-training, the feature extractor is either fixed [e.g.55,62,67,73,74] or further adapted [e.g 2, 3, 17, 51] via meta-learning.
There is limited theoretical understanding and consensus on the effect of pre-training in FSL.In [2,55,71], the pre-training is only considered a standard pre-processing step for encoding the raw input and model performance is predominantly attributed to the proposed meta-learning algorithm.
In [17] the authors similarly argued that meta-trained features are fundamentally better than pretrained ones, observing that adapting the pre-trained features with several base learners resulted in worse performance compared to the meta-learned features.In contrast, however, several works also empirically demonstrated that pre-training contributes significantly towards performance.[62] first showed that combining the pre-trained features with suitable base learners already outperforms various meta-learning methods, while [12] observed that pre-training dominates top entries for the 2021 Meta-learning Challenge.

Pre-training as Meta-learning
In this section, we characterize how feature pre-training relates to meta-learning as a loss upper bound.More precisely, we show that pre-training induces a special base learner with its corresponding meta-learning loss upper bounded by the cross-entropy loss ce .Consequently, pre-training already produces a meta-representation suitable for FSL, matching the empirical results from [62,73].In addition, we show that pre-training incurs a smaller risk compared to its meta-learning counterpart, and more generally induces a conditional formulation that exploits contextual information for more robust learning.

Notation
We consider a few-shot classification setting with a total of C classes (global labels).Denote by µ the meta-distribution sampling distributions (a.k.a.tasks) ρ, from which we sample support and query sets (S, Q) for each task.Each task distribution ρ is associated with k ≤ C class labels y

Meta-learning Expected Error
We first define the expected error incurred by a meta-learning algorithm solving (4) This is the meta-learning risk incurred by a meta-parameter θ, namely the error incurred by training the classifier via w(g θ (S)) (e.g.Eq. ( 5)) and testing it on the query set g θ (Q), averaged over (S, Q) pairs sampled from tasks ρ, which in turn are sampled from meta-distribution µ.The risk is the ideal error we wish to minimize.Let θ T denote the meta-parameters learned by the algorithm minimizing (4), where we recall T as the number of tasks in the meta-training set T .By making suitable assumptions on the meta-parameter space and following standard arguments from statistical learning theory (e.g. using Rademacher complexity [57]), it is possible to guarantee that (see e.g.[1,18]) namely that the risk incurred by the meta-learning algorithm becomes closer to that of the ideal meta-learning parameters as the number T of observed tasks grows.Here the term O(1/ √ T ) denotes how fast the learning error converges to the ideal solution as a function of the size of the meta-training set.The big-O notation abstracts away quantities related to different parametrization of Θ.

Global Label Selection
Let us consider a special FSL scenario where global labels are available to the model (in contrast to the standard setting where only local labels are available, see Sec. 2).Since we have access to global labels, we can design a new algorithm that learns a single global multi-class linear classifier W at the meta-level (i.e.shared across all tasks), and simply select the required rows W [S Y ] when tackling a task.More formally, we can define a special base learner called global label selector (GLS) such that Illustrated in Fig. 1b, this "algorithm" does not solve an optimization problem on the dataset S, but only selects the subset of rows of W corresponding to the classes present in S as the task-specific classifier.Similar to (9), we define the meta-risk for GLS as the error incurred by using the meta-representation g θ and the global linear classifier W to tackle the meta-learning problem associated with µ.
Analogously to standard meta-learning (where we only learn θ), since W and θ are now both shared across all tasks, we may learn them jointly by solving the following minimization problem min This strategy, to which we refer as meta-GLS, learns both the representation and linear classifier at the meta-level, with the sole task-specific adaptation process being the selection of columns of W using the global labels.Following the same reasoning for obtaining (10), we can analogously conclude that ) being the minimizer of (11).Hence, the error of meta-learning the GLS parameters would decrease with the same rate as that of meta-learning θ only.

GLS finds a good meta-representation.
As the intuition suggests, learning a global W shared among multiple tasks (rather than having each classifier w(g θ (S)) accessing exclusively the tasks' training data), can be very advantageous for generalization.This is evident when the (global) classes are separable for a meta-representation g θ .In this case, for any inner algorithm w(•), we have that min namely that, given the same representation, finding a global classifier is more favorable than solving each task in isolation.Therefore, we conclude that Solving meta-GLS provides a good representation for standard meta-learning problem.
However, as mentioned above, it would appear that meta-GLS and standard meta-learning have similar rates with respect to the number T of observed tasks and that therefore the former does not offer any advantage in practice.In the following we will see that pre-training can be interpreted as indirectly tackling meta-GLS in a more sample-efficient manner, hence justifying its performance in practice.

Expected Error for Pre-training
We now show that pre-training offers a strategy to obtain a pair of GLS parameters more efficiently than meta-GLS, under mild assumptions.
Assumption 1.The meta-distribution µ samples tasks ρ.Sampling from each ρ is performed as follows: 1.For each j ∈ {1, . . ., k} and class y ρ ) shared across all tasks.All generated pairs are collected in the support set S = (x and Unif ρ the uniform distribution over the labels in ρ Y . In essence, the assumption characterizes the standard process of constructing meta-training tasks for FSL.In particular, let π µ (x, y) be the marginal probability of observing (x, y) in the meta-training tasks, i.e. firstly sampling a task ρ from µ, followed by sampling a class y uniformly by Unif ρ (•) and finally x by π(•|y).It then follows that sampling a dataset D global from π µ is equivalent to sample a meta-training set T from µ and flatten it into D(T ) according to the pre-training procedure described in (7).Let L(W, g θ (•), π µ ) = E (x,y)∼πµ ( ce (W g θ (x), y) be the global multi-class classification risk of W g θ (•).Since pre-training amounts to minimizing L(W g θ , D global ) on the dataset D global sampled i.i.d.from π µ , we again apply the reasoning adopted for (10) to conclude that where N is the number of samples in D global .If N T , which is typically the case, this implies that pre-training converges much faster than meta-GLS to its corresponding ideal risk.

Pre-training and GLS
Given the discussion above, it appears that, from a statistical viewpoint, pre-training enjoys better error rates than meta-learning given the same data.However, the two methods solve in principle two different problems and it remains to be shown that pre-training offers a similar advantage when applied in the meta-learning context.The following result relates the two problems to each other.Theorem 1.Under Assumption 1, let π µ (x, y) be the marginal distribution of observing (x, y) in the meta-training set.Then, for any (global) classifier W , Moreover, if the global classes are separable, The result shows that the GLS error is upper bounded by the global multi-class classification error.Hence, minimizing the global multi-class classification error also indirectly minimizes the meta-learning risk.In practice, this implies that pre-training implicitly learns a meta-representation suitable for FSL.
Additionally, by combining ( 14) and ( 16) we have Comparing the above rate with that of meta-GLS, we conclude that Given exactly the same data (T for meta-GLS and D(T ) for pre-training), pre-training achieves a much smaller error than meta-GLS.
To our knowledge, this is a novel and surprising result.
Given the relation between GLS and standard meta-learning that we highlighted in Sec.3.3, Thm. 1 provides a strong theoretical argument in favor of adopting pre-training in meta-learning settings where the number of points per task is small in comparison to the number of tasks, so that N T , such as it is the case in FSL.

Connection to Conditional Meta-Learning
More generally, we observe that GLS is also an instance of conditional meta-learning: the global labels of the task provide additional contextual information about the task to facilitate model learning.Global labels directly reveal how tasks relate to one another and in particular if any classes to be learned are shared across tasks.GLS thus simply map global labels of tasks to task classifiers via W [S Y ].In contrast, unconditional approaches (e.g.R2D2 [4], ProtoNet [42]) learn classifiers by minimizing some loss over support sets, losing out on the access to the contextual information provided by global labels.As discussed in Sec.2.3, the benefits offered by the global labels has been extensively validated empirically.
In addition to our result, [8,9] also proved that conditional meta-learning is advantageous over the unconditional formulation by incurring a smaller excess risk, especially when the meta-distribution of tasks is organized into distant clusters.We refer readers to the original papers for a detailed discussion.In practice, global labels provide clustering of task samples for free and improve regularization by enforcing each cluster (denoted by global label y j ρ ) to share classifier parameters W [y j ρ ] across all tasks.This provides further explanation to why pre-training yields a robust meta-representation with strong generalization performance.

Leveraging Pre-training in Practice
The discussion above suggests that pre-training should be sufficient when meta-training and meta-test data are sampled from the same distribution.However, practical FSL scenarios assume that the meta-testing set shares no class labels with the meta-training set, since the goal of meta-learning is precisely to generalize to novel classes unseen during training.For these practical scenarios, extensive evidences indicate that the pre-trained representation is also robust for learning novel classes by simply replacing the GLS selector with regular classifier w(g θ (S)) [12,61].Formally, the connection between meta-training and meta-testing classes may also be captured by the assumption that that they share a common representation, the theoretical setting analyzed in [10].Moreover, while pre-training might offer a powerful initial representation θ, it may be advisable to further improve θ by directly optimizing (4) using the desired classifier to tackle novel classes.The general strategy of pre-training, followed by what we refer to as meta fine-tuning in the following is extensively used in state-of-the-art methods [e.g.53,73,74].Empirical results suggest that careful meta fine-tuning outperforms standalone pre-training.We investigate this aspect empirically in Sec.4.3 and Sec. 5.

Methods
In this section, we propose three practical algorithms motivated by our theoretical analysis.In Sec.4.1, we introduce an augmentation procedure for pre-training to further improve representation learning in image-based tasks.In Sec.4.2, we tackle the scenario where global labels are absent by automatically inferring a notion of global labels.Lastly, we introduce a meta fine-tuning procedure in Sec.4.3 to investigate how much meta-learning could improve the pre-trained representation.

Augmented Pre-training for Image-based Tasks
In general, pre-training is a standard process with well-studied techniques for improving the final learned representation.Many of these techniques, including data augmentation for image-based tasks [6], auxiliary losses [39] and model distillation [62], are also effective for FSL (i.e. the learned representation is suitable for novel classes during meta-testing).In particular, we may interpret data augmentation techniques as increasing N in (14), thus improving the error incurred by pre-training and consequently the learned representation g θ .
Beyond standard augmentations (e.g.random cropping and color jittering) investigated in [6], we further propose an augmented procedure for pre-training via image rotation.For every class y i in the original dataset, we create three additional classes by rotating all images of class y i by r ∈ {90 • , 180 • , 270 • } respectively.All rotations are multiples of 90 • such that they can be implemented by basic operations efficiently (e.g.flip and transpose) and prevent pre-training from learning any trivial features from visual artifacts produced by arbitrary rotations [16].Pre-training is then performed normally on the augmented dataset.Additionally, standard augmentations are also applied on the augmented dataset.
The augmented dataset quadruples the number of samples and classes compared to the original dataset.According to (14), pre-training on the augmented dataset may yield a more robust representation.Further, we also hypothesize that the quality of the representation also depends on the number of classes available in the pre-training dataset, since classifying more classes requires learning increasingly discriminating representations.Our experiments show that 1) augmented pre-training consistently outperforms the standard one, and 2) quality of the learned representation depends on both the dataset size and the number of classes available for training.

Meta Label Learning
The ability to exploit pre-training crucially depends on access to global labels.However, it is problematic to assume easy access to global labels.As discussed in Sec. 1, global labels may be unavailable or inaccessible in practical applications, when meta-training tasks are collected and annotated independently.As we will illustrate with the experiment in Sec.5.4, different tasks may present conflicting labels over the same set of data based on different task requirements.This leads

Algorithm 1 MeLa
to ill-defined global labels and makes pre-training not directly applicable.Therefore, we consider the more general setting where only local labels from each task are known.This setting was also adopted by most of earlier meta-learning methods [e,g 4, 14, 32, 59, 65].In the local label setting, we propose Meta Label Learning (MeLa) in order to automatically infer a notion of latent global labels across tasks.Naturally, the inferred labels enables pre-training and bridges the gap between the experiment settings with and without global labels.
Alg. 1 outlines the general strategy for learning a few-shot model using MeLa.We first meta-learn an initial representation g sim θ .Secondly, we cluster all task samples using g sim θ as a feature map while enforcing local task constraints.The learned clusters are returned as inferred global labels.Using the inferred labels, we can apply pre-training to obtain g pre θ , which may be further fine-tuned using meta-learning objectives to derive the final few-shot model g * θ .We present in Sec.4.3 a simple yet effective meta fine-tuning procedure.
For learning g sim θ , we directly optimize (4) using ridge regression (5) as the base learner.We use ridge regression for its computational efficiency and good performance.Using g sim θ as a base for a similarity measure, the labeling algorithm takes as input the meta-training set and outputs a set of clusters as global labels.The algorithm consists of a clustering routine for sample assignment and centroid updates, and a pruning routine for merging small clusters.
Clustering.The clustering routine leverages local labels for assigning task samples to appropriate global clusters and enforcing task constraints.We observe that for any task, the local labels describe two constraints: 1) samples sharing a local label must be assigned to the same global cluster, while 2) samples with different local labels must not share the same global cluster.To meet constraint 1, we assign all samples {x ρ to a single global cluster by with V being the current number of centroids.
We apply (17) to all classes y (1) ρ , . . ., y ρ within a task.If multiple local classes map to the same global label, we simply discard the task to meet constraint 2. Otherwise, we proceed to update the centroid g v * and sample count N v * for the matched clusters using Pruning.We also introduce a strategy for pruning small clusters.We model the sample count of each cluster as a binomial distribution N v ∝ B(T, p).We set p = 1 V , assuming that each cluster is

Algorithm 2 LearnLabeler
Input: embedding model g sim θ , meta-training set T = {S t , Q t } T t=1 , number of classes in a task k Initialization: sample tasks from T to initialize clusters G = {g v } V v=1 , While |G| has not converged: equal likely to be matched by a local class of samples.Any cluster with sample count below the following threshold is discarded, where Nv is the expectation of N v , Var(N v ) the variance, and q a hyper-parameter controlling how aggressive the pruning is.Alg. 2 outlines the full labeling algorithm.We first initialize a large number of clusters by setting their centroids with mean class embeddings from random classes in T .For V initial clusters, V k tasks are needed since each task contains k classes and could initialize as many clusters.The algorithm then alternates between clustering and pruning to refine the clusters and estimate the number of clusters jointly.The algorithm terminates and returns the current clusters G when the number of clusters does not change from the previous iteration.Using clusters G, local classes from the meta-training set can be assigned global labels with nearest neighbor matching using (17).For tasks that fail to map to k unique global labels, we simply exclude them from the pre-training process.
The key difference between Alg. 2 and the classical K-means algorithm [36] is that the proposed clustering algorithm exploits local information to guide the clustering process, while K-means algorithm is fully unsupervised.We will show in the experiments that enforcing local constraints is necessary for learning robust meta-representation.
Alg. 2 also indirectly highlights how global labels, if available, offers valuable information about meta-training set.In addition to revealing precisely how input samples relate to one another across tasks, global labels provides an overview of meta-training set, including the desired number of clusters and their sizes.In contrast, Alg. 2 needs to estimate both properties when only local labels are given.

Meta Fine-Tuning
As discussed in Sec. 3, while pre-training already yields a robust metra-representation for FSL, GLS, the base learner associated with pre-training is inapplicable for meta-testing when novel classes are presented.It is thus desirable to adapt the pre-trained representation by directly optimizing (4), such that the new meta-representation better matches the base learner intended for novel classes.We call this additional training meta fine-tuning, which is adopted by several state-of-the-art FSL models [33,67,73,74].
For meta fine-tuning, existing works suggest that model performance depends crucially on preserving the pre-trained representation.In particular, [33,55,67,73] all keep the pre-trained representation fixed, and only learn a relatively simple transformation on top for the new base learners.Additionally, [17] showed that meta fine-tuning the entire representation model using MetaOptNet [31] or R2D2 [4] lead to worse performance compared to standard meta-learning, negating the advantages of pre-training completely.
Given the observations above, we present a simple residual architecture that preserves the pretrained embeddings and allows adaptation for the new base learner.Formally, we consider the following parameterization for a fine-tuned meta-learned embedding g * θ , where g pre θ is the pre-trained representation and h a learnable function (e.g. a small fully connected network).We again use (5) as the base learner and optimizes (4) directly.Our experiments show that the proposed fine-tuning process achieves results competitive with more sophisticated base learners, indicating that the pre-trained representation is the predominant contributor to good test performance.

Experiments
We evaluate MeLa on various benchmark datasets and compare it with existing methods.The experiments are designed to address the following questions: • How does MeLa compare to existing methods for generalization performance?Additionally, we also introduce the more challenging GFSL setting in Sec.5.2.
• How do different model components (e.g.pre-training, meta fine-tuning) contribute to generalization performance?
• Does MeLa learn meaningful clusters?Can MeLa handle conflicting task labels?
• Given the importance of pre-training, how can we improve the quality of the pre-trained representation?
• How robust is MeLa to hyper-parameter choices?
Variants of mini/tiered-ImageNet.We introduce several variants of mini/tiered-ImageNet to better understand MeLa and more broadly the impacts of dataset configuration on pre-training.Specifically, we create mini-60 that consists of 640 classes and 60 samples per class.The base dataset of mini-60 contains the same number of samples as the base dataset of miniImageNet, though with more classes and fewer samples per class.Mini-60 is deliberately constructed so that the classes of the meta-train, validation and test sets of miniImageNet are contained in the classes of the meta-train, validation and test set respectively, of mini-60, enabling a fair comparison of test performance of model trained on each dataset in turn.We designed mini-60 to investigate the behavior of MeLa when encountering a dataset with a high number of base classes and low number of samples per base class.We also use mini-60 to explore how data diversity present in the training data affects the learned representation.Analogous to mini-60, we also introduce tiered-780 as a variant to tieredImageNet, where we take the total number of samples in tieredImageNet and calculate the number of samples over the full 1000 ImageNet classes while avoiding meta-test set overlap between the two datasets.Meta-Dataset.[63] is a meta-learning classification benchmark combining 10 widely used datasets: ILSVRC-2012 (ImageNet) [54], Omniglot [30], Aircraft [38], CUB200 [69], Describable Textures (DTD) [7], QuickDraw [28], Fungi [56], VGG Flower (Flower) [43], Traffic Signs [24] and MSCOCO [34].
We use Meta-Dataset to construct several challenging experiment scenarios, including learning a unified model for multiple domains and learning from tasks with conflicting labels.

Experiment Settings
The standard FSL setting [3,14,59,70,73] assumes that a meta-distribution of tasks is available for training.This translates to meta-learners having access to an exponential number of tasks synthetically generated from the underlying dataset, a scenario unrealistic for practical applications.Recent works additionally assume access to global labels in order to leverage pre-training, in contrast with earlier methods that assume access to only local labels.We will highlight such differences when comparing different methods.GFSL Setting.This is a more challenging and realistic FSL setting.Specifically, we only allow access to local labels, since global labels may be inaccessible or ill-defined.In addition, we employ a no-replacement sampling scheme when synthetically generating tasks from the underlying dataset1 .This sampling protocol limits the meta-training set to a fixed-size, which is a standard assumption for most machine learning problems.The fixed size also enables us to evaluates the sample efficiency of different methods.Secondly, no-replacement sampling prevents MeLa and other meta-learners from trivially learning task relations, a key objective of meta-learning, by matching same samples across tasks.For instance, an identical sample appearing in multiple tasks would allow MeLa to trivially cluster local classes.Lastly, the sampling process reflects any class imbalance in the underlying dataset, which might present a more challenging problem.

Performance Comparison in Standard Setting
We compare MeLa to a diverse group of existing methods on mini-and tieredImageNet in Tab. 1.
We separate the methods into those requiring global labels and those that do not.We note that the two groups of methods are not directly comparable since global labels provides a significant advantage to meta-learners as discussed previously.The method groupings are intended to demonstrate the effect of pre-training on generalization performance.Tab. 1 clearly shows that "global-labels" methods leveraging pre-training generally outperform "local-labels" methods except MeLa.We highlight that the re-implementation of ProtoNet in [70] benefits greatly from pre-training, outperforming the original by over 10% across the two datasets.Similarly, while RFS and R2D2 both learn a fixed representation and only adapt the classifier based on each task, RFS's pre-trained representation clearly outperforms R2D2's meta-learned representation.We further note that state-of-the-art methods such as DeepEMD and FEAT are heavily reliant on pre-training and performs drastically worse in GFSL setting, as we will discuss in Sec.5.4.51.9 ± 0.2 68.7 ± 0.2 65.5 ± 0.6 80.2 ± 0.4 MetaOptNet [32] 62.6 ± 0.6 78.6 ± 0.5 66.0 ± 0.7 81.5 ± 0.6 Shot-free [49] 59.0 ± n/a 77.6 ± n/a 63.5 ± n/a 82.6 ± n/a MeLa (pre-train only) 64.5 ± 0.4 81.5 ± 0. In the local-labels category, MeLa outperforms existing methods thanks to its ability to still exploit pre-training using the inferred labels.MeLa achieves about 4% improvement over the next best method in all settings.Across both categories, MeLa obtains performance competitive to state-of-the-art methods such as FRN, FEAT and DeepEMD despite having no access to global labels.This indicates that MeLa is able to infer meaningful clusters to substitute global labels and obtains performance similar to methods having access to global labels.We will provide further quantitative results on the clustering algorithm in Sec.5.6.

Performance Comparison in Generalized Setting
We evaluate a representative set of few-shot learners under GFSL.For this setting, we also introduce two new experimental scenario using Meta-Dataset to simulate task heterogeneity.
In the first scenario, we construct the meta-training set from Aircraft, CUB and VGG flower, which we simply denote by "Mixed".Tasks are sampled independently from one of the three datasets.For meta-testing, we sample 1500 tasks from each dataset and report the average accuracy.The chosen datasets are intended for fine-grained classification in aircraft models, bird species and flower species respectively.Thus the meta-training tasks share the broad objective of fine-grained classification, but are sampled from three distinct domains.A key challenge of this scenario is to learn a unified model across multiple domains, without any explicit knowledge about them or the global labels within each domain.The results show that MeLa outperforms all baselines under GFSL setting.In particular, MeLa achieves a large margin of 10% improvement over the baselines, including state-of-the-art models FEAT, FRN and DeepEMD, the methods equal to MeLa in Tab. 1.In particular, FEAT and DeepEMD performed noticeably worse, indicating the methods' reliance on pre-trained representation and the difficulty of meta-learning robust representations from scratch with complex base learners.FRN is designed to also work without pre-training, and outperforms FEAT and DeepEMD as expected.
In the second scenario, we consider meta-training tasks with heterogeneous objectives, leading to conflicting task-labels and consequently ill-defined global labels.For the Aircraft dataset, each sample from the base dataset has three labels associated with it, including variant, model and manufacturer2 that form a hierarchy.We sample tasks based on each of the three labels and creates a meta-training set containing three different task objectives: classifying fine-grained differences between model variants, classifying different airplanes, and classifying different airplane manufacturers.To differentiate from the original dataset, we refer to this meta-training set as H-Aircraft.The training data is particularly challenging given the competing goals across different tasks: a learner is required to recognize fine-grained differences between airplane variants, while being able to identify general similarities within the same manufacturer.The training data also exhibits class imbalance.For instance, the dataset is dominated by samples from Boeing and Airbus and the meta-training set reflects that in GFSL setting.
Tab. 3 shows that MeLa outperforms all baselines for H-Aircraft.To approximate the oracle performance when ground truth labels were given, we optimize a supervised semantic softmax loss [52] over the hierarchical labels.Specifically, we train the (approximate) oracle to minimize a multi-task objective combining individual cross entropy losses over the three labels.MeLa performs competitively against the oracle, indicating the robustness of the proposed labeling algorithm in handling ill-defined labels and class imbalance.
The experimental results suggest that MeLa performs robustly in both the standard and GFSL settings.In contrast, baseline methods perform noticeably worse in the latter, due to the absence of pre-training and limited training data.
Connection to theoretical results.We comment on the empirical results so far in relation to our theoretical analysis.The empirical results strongly indicate that pre-training produces robust metarepresentations for FSL by exploiting contextual information from global labels.This is consistent with our observation that pre-training would achieve a smaller error than its meta-learning counterpart.On the other hand, the results also validate our hypothesis that the pre-trained representation can be  further improved, since the pre-trained representation is not explicitly optimized for handling novel classes.In particular, FEAT, FRN, DeepEMD and MeLa all outperform the pre-trained representation from [62] by further adapting it.

Ablations on Pre-training
Given the significance of pre-training on final performance, we investigate how the rotation data augmentation and data configuration impact the performance of the pre-trained representation.For dataset configuration, we focus on the effects of dataset sizes and the number of classes present in the dataset.
Rotation-Augmented Pre-training.In Sec.4.1, we proposed to increase both the size and the number of classes in a dataset via input rotation.By rotating the input images by the multiples of 90 • , we quadruple both the size and the number of classes in a dataset.In Tab. 4, we compare the performance of standard pre-training against the rotation-augmented one, for multiple datasets.We use the inferred labels from MeLa for pre-training.
The results suggest that rotation-augmented pre-training consistently improves the quality of the learned representation.It achieves over 3% improvements in both miniImageNet and H-aircraft, while obtains about 0.5% in tieredImageNet.It is clear that rotation augmentation works the best with smaller datasets with fewer classes.As the dataset increases in size and diversity, the additional augmentation has less impact on the learned representation.Effects of Class Count.We further evaluate the effects of increasing number of classes in a dataset while maintaining the dataset size fixed.For this, we compare the performance of miniImageNet and tieredImageNet with their respective variants mini-60 and tiered-780.
Tab. 4 suggests that given a fixed size dataset, having more classes improves the quality of the learned representation compared to having more samples per class.We hypothesize that classifying more classes lead to more discriminative and robust features, while standard 2 regularization applied during pre-training prevents overfitting despite having fewer samples per class.Overall, the experiments suggest that pre-training is a highly scalable process where increasing either data diversity or dataset size will lead to more robust representation for FSL.In particular, the number of classes in the dataset appears to play a more significant role than the dataset size.

Ablations on The Clustering Algorithm
The crucial component of MeLa is Alg.2, which infers a notion of global labels and allows pre-training to be exploited in GFSL setting.We perform several ablation studies to better understand the proposed clustering algorithm.
The Effects of No-replacement Sampling.We study the effects of no-replacement sampling, since it affects both the quality of the similarity measure through g sim θ and the number of tasks available for inferring global clusters.The results are shown in Tab. 5.
In Tab. 5, clustering accuracy is computed by assigning the most frequent ground truth label in each cluster as the desired target.Percentage of tasks clustered refers to the tasks that map to k unique clusters by Alg. 2. The clustered tasks satisfy both constraints imposed by local labels and are used for pre-training.
For both sampling processes, MeLa achieves comparable performances across all three datasets.This indicates the robustness of Alg. 2 in inferring suitable labels for pre-training, even when task samples do not repeat across tasks.This shows that Alg. 2 is not trivially matching identical samples across task, but relying on g sim θ for estimating sample similarity.We note that mini-60 is particularly challenging under no-replacement sampling, with only 384 tasks in the meta-training set over 640 ground truth classes.
Effects of Pruning Threshold.In Alg. 2, the pruning threshold is controlled by the hyperparameter q.We investigate how different q values affect the number of clusters estimated by the labeling algorithm and the corresponding test accuracy.
The results suggest that MeLa is robust to a wide range of q and obtains representations similar to that produced by the ground truth labels.While it is possible to replace q with directly guessing the number of clusters in Alg. 2, we note that tuning for q is more convenient since appropriate q values appear to empirically concentrate within a much narrower range, compared to the possible numbers of global clusters present in a dataset.Inferred Labels vs. Oracle Labels.From Tabs. 5 and 6, we observe that it may be unnecessary to fully recover the oracle labels (when they exists).For mini-60, MeLa inferred 463 clusters over 640 classes, which implies mixing of the oracle classes.However, the inferred labels still perform competitively against the oracle labels, suggesting the robustness of the proposed method.The results also suggest that we may improve the recovery of the ground truth labels by sampling more tasks from the meta-distribution.
The Importance of Local Constraints.The clustering process enforces consistent assignment of task samples given their local labels.To understand the importance of enforcing these constraints, we consider an ablation study where Alg. 2 is replaced with the standard K-means algorithm.The K-means algorithm is fully unsupervised and ignores any local constraints.We initialize the K-means algorithm with 64 clusters for miniImageNet and 351 clusters for tieredImageNet, the true numbers of classes in respective datasets.Tab.7 indicates that enforcing local constraints is critical for generalization performance during meta-testing.In particular, test accuracy drops by over 5% for tieredImageNet, when the K-means algorithm ignores local task constraints.Among the two constraints, we note that (17) appears to be the more important one since nearly all tasks automatically match K unique clusters in our experiments (see tasks clustered in Tab. 5).Domain Inference for multi-domain tasks.In addition to inferring global labels, We may further augment Alg. 2 to infer the different domains present in a meta-training set, if we assume that all samples within a task belongs to a single domain.Given the assumption, two global clusters are connected if they both contain samples from the same task.This is illustrated in Fig. 3a.Consequently, inferred clusters form an undirected graph with multiple connected components, with each representing a domain.We apply the above algorithm to the multi-domain scenario from Sec. 5.4, where the meta-training set consists of Aircraft, CUB and VGG datasets.
Fig. 3b visualizes the inferred domains on the multi-domain scenario.For each inferred cluster, we project its centroid into a 2-dimensional point using UMAP [40].Each connected component is assigned a different color.Despite some mis-clustering within each domain, we note that Alg. 2 clearly separates the three domains present in the meta-training set and recovers them perfectly.Domain inference is important for multi-domain scenario as it enables domain-specific pretraining.Recent works [e.g.11,33,35] on Meta-Dataset have shown that combining domain-specific representation into a universal representation is empirically more advantageous than training on all domains together.Lastly, we remark that multi-domain meta-learning is also crucial for obtaining robust representation suitable for wider range of novel tasks, including cross-domain transfer.

Conclusion
In this work we focused on the role played by pre-training in meta-learning applications, with particular attention to few-shot learning problems.Our analysis was motivated by the recent popularity of pre-training as a key stage in most state-of-the-art FSL pipelines.We first investigated the benefits of pre-training from a theoretical perspective.We showed that in some setting this strategy enjoys significantly better sample complexity than pure meta-learning approaches, hence offering a justification for its empirical performance and wide adoption in practice.
We then proceeded to observe that pre-training requires access to global labels of the classes underlying the FSL problem.This might not always be possible, due to phenomena like heterogeneous labeling (i.e.multiple labelers having different labeling strategies) or contextual restrictions like privacy constraints.We proposed Meta-Label Learning (MeLa) as a strategy to address this concern.We compared MeLa with state-of-the-art methods on a number of tasks including well-established standard benchmarks as well as new datasets we designed to capture the above limitations on task labels.We observed that MeLa is always comparable or better than previous approaches and very robust to lack of global labels or the presence of conflicting labels.
More broadly, our work provides a solid foundation for understanding existing FSL methods, in particular the vital contribution of pre-training towards generalization performance.We also demonstrated that pre-training scales well with the size of datasets and data diversity, which in turn leads to more robust few-shot models.Future research may focus on further theoretical understanding of pre-training and better pre-training processes.
We observe that for any g : X × Y → R, E (x,y)∼πµ g(x, y) = E ρ∼µ E (x,y)∼πρ g(x, y). ( Given Assumption 1, S and Q are sampled independently by the task ρ.In particular the marginal ρ Q of ρ with respect to S corresponds to π ρ .Similarly, we denote ρ S the distribution over support sets obtained by marginalizing out the query set.We report one remark following the assumption above. Remark 1.For any task ρ and any algorithm D → f D returning functions f S : X → R k , we have We can now apply (21) where we take g(x, y) = ce (W ψ θ (x), y).Then, E ρ∼µ E (x,y)∼πρ ce (W ψ θ (x), y) = E (x,y)∼πµ ce (W ψ θ (x), y) = L ce (W, θ, π µ ), which concludes the proof for (15).
• SGD: SGD with an initial learning rate of 0.05, weight decay factor of 0.0005 and momentum of 0.9 • AdamW: AdamW [37] with learning rate of 0.0001, weight decay factor of 10 −6 .
For each optimization algorithm we use the torch multi-step learning rate scheduler which anneals the learning rate of the optimization algorithm by γ = 0.1 at selected epochs in lr_decay_epochs.
• MultiStepLR: Learning rate annealing scheduler, which multiplies the learning rate by γ at the beginning of epochs in the list lr_decay_epochs.
Augmentation.We use two instances of augmentation (and one option of no augmentation) • DataAug: Data augmentation where we use a pipeline of 1. Random cropping using a shape of 84 × 84 with padding of 8 2. Color jittering with the PyTorch arguments of (brightness=0.4,contrast=0.4,saturation=0.4) 3. Randomly flip the image horizontally.Residual Adapters for Meta Fine-Tuning.In Sec.4.3, we introduced a residual adapter for meta fine-tuning.The learnable network h is a three-layer MLP with ResNet{12/18}+ResFC: MLP with residual connection and layer-normalization applied to the output.Both the input and output dimensions are the same as the feature representation from either ResNet12 or ResNet18 backbone.

Learning the Similarity Measure (RepLearn).
For training embedding g sim θ , when given a task D = (S ∪ Q) we one-hot encode the outputs and scale them using f (y) = 2y − 1.We get the classifier w(g sim θ (S)) using ( 5) on the embedded support set g sim θ (S) (we add a column of ones to the embeddings for a bias term) with a regularization strength of λ MetaLS = 0.001.As an inner loss we use = FS where FS is the few-shot loss using mean-squared error inner loss (5).We train for a fixed number of epochs, where each epoch is a full sweep over the meta-train set in the GFSL setting or some predefined number of tasks T tasks in the standard setting.Number of tasks in each batch is set to 1.We use meta-validation set for early stopping and model selection.
Global Label Inference (LearnLabeler).Given a trained backbone g sim θ we use the clustering algorithm of Sec.4.2 with the hyperparameters q and K init where q is the pruning aggression parameter

( a )Figure 1 :
Figure 1: (a) Colored squares represent samples.Tasks A and B can be "merged" meaningfully using global labels, but not local ones.(b) A global classifier can be used as local classifiers given the indices Y of the intended classes to predict.
the empirical average over the meta-training set T .The task loss L : F × D → R is the empirical risk of the learner f over query sets, based on some inner loss : Y × Y → R, where Y is the space of labels, . . ., C}. Denote by ρ Y = {y (1) ρ , . . ., y (k) ρ } the corresponding subset of {1, . . ., C}.Given a matrix W ∈ R C×m and a vector Y ∈ {1, . . ., C} k of indices, we denote by W [Y ] = W [ρ Y ] ∈ R k×m the submatrix of W obtained by selecting the rows corresponding to the unique indices ρ Y in Y .Lastly, Given a dataset D = (x i , y i ) n i=1 we denot by D Y ∈ {1, . . ., C} n the vector with entries corresponding to the labels y i .

Figure 3 :
Figure 3: (a) The coloured clusters (red, green, blue and yellow) are connected since they both contains samples from the same task.Domains can be inferred by computing the connected components of the inferred clusters.(b) UMAP visualization of the three inferred domains from the 5-shot Mixed dataset containing Aircraft, CUB and VGG.Circles are the means (using the pretrained features) of the instances in each task averaged per local class while triangles are the learned centroids, all vectors are embedded using UMAP.The three domains are recovered perfectly.

4 .
Normalization: (channel-wise) using ImageNet sample channel mean and standard deviation for ImageNet type datasets, min-max scaling to [−1, 1] for Mixed and H-aircraft datasets • RotateAug: Rotation-class augmentation as laid out in Sec.4.1 together with DataAug • None: No augmentation.Backbone Architecture.We use ResNets[22] for the backbone throughout the experiments • ResNet12: ResNet with block sequence [1, 1, 1, 1], using adaptive average pooling, drop-blocks for the final 2 ResNet layers and a drop rate of 0.1, with output dimension being 640 • ResNet18: ResNet with block sequence [1, 1, 2, 2], using adaptive average pooling, drop-blocks for the final 2 ResNet layers and a drop rate of 0.1, with output dimension being 640

Table 1 :
Test accuracy of meta-learning models on miniImageNet and tieredImageNet.

Table 2 :
Test Accuracy on Aircraft, CUB and VGG Flower (Mixed dataset).A single model is trained for each method over all tasks.

Table 4 :
Test accuracy comparison between pre-trained representations: standard vs. rotationaugmented

Table 5 :
The effects of no-replacement sampling on the clustering algorithm

Table 6 :
Test accuracy (pre-train only) and cluster count for various pruning thresholds, 5-shot setting