Meta-learning Amidst Heterogeneity and Ambiguity

Meta-learning aims to learn a model that can handle multiple tasks generated from an unknown but shared distribution. However, typical meta-learning algorithms have assumed the tasks to be similar such that a single meta-learner is sufficient to aggregate the variations in all aspects. In addition, there has been less consideration on uncertainty when limited information is given as context. In this paper, we devise a novel meta-learning framework, called Meta-learning Amidst Heterogeneity and Ambiguity (MAHA), that outperforms previous works in terms of prediction based on its ability on task identification. By extensively conducting several experiments in regression and classification, we demonstrate the validity of our model, which turns out to be robust to both task heterogeneity and ambiguity.


Introduction
Although deep learning models have shown remarkable performance in various domains, they have consistently been criticized because of their sensitivity to the amount of data [8,39,9,56,21].Despite all available public data, the data scarcity issue is still not negligible.In many cases, the actual data that is worth analyzing is quite limited for many different reasons, for example, concerns about data privacy [36] and noisy data with anomalies [50].Along with transfer learning, few-shot learning, and multi-task learning, meta-learning has recently been highlighted as a way to overcome this deficiency with its adaptive behavior using a few data points [60,23].
Meta-learning aims to handle multiple tasks by efficiently organizing the acquired knowledge.However, typical algorithms have been assessed based on a solid assumption which lacks the representative potential in real-world scenarios.Among many tackles [58,32], we mainly focus on the following two assumptions.First of all, the tasks are regarded to be similar such that a single meta-learner is sufficient to aggregate the variations in all aspects.It implies that there has been little effort to compactly abstract notions within heterogeneity, one of the essential factors characterizing human intelligence, which is advantageous in decision-making to query the associated information to solve the problem.In addition, there has been less consideration on uncertainty for identifying particular task with a few data points.It is therefore not easy to analyze or transfer the acquired knowledge of the model, which is critical in the growing AI industries, such as a medical diagnosis [1,5,62] and autonomous vehicles [26,52,6], because a certain level of interpretability is required for greater safety.
In this respect, we hypothesize that a disentanglement in task representation is advantageous, which frequently appears in studies to analyze the inherent factors of variation within the dataset.This is to i) uncover the distinctive properties as a tool for interpretability and to ii) explicitly separate the dataset into several clusters, which would have been detrimental when trained altogether.However, as a trade-off for interpretability, the overconfident nature of deep learning may strictly assign the tasks into certain clusters without considering ambiguity, which requires an additional treatment to cope with the anomalies.To this end, we propose a new meta-learning framework, Meta-learning Amidst Heterogeneity and Ambiguity (MAHA), that performs robustly on the following two huddles.Task heterogeneity: there is no clear discrimination between the tasks that are sampled from the faraway modes of task distribution [64,67,68].Task ambiguity: too few data points are given to infer the task identity [13,49].Specifically, we devise a pre-task built upon the neural processes [15,16,25] to obtain well-clustered and interpretable representation.Then, an agglomerative clustering is applied to the representation without any external knowledge such as the number of clusters and separately train a different model for each cluster.Please refer to Figure 7 for the overall training process of MAHA.
To summarize, the main contributions of this paper are the following 4-folds: • We propose a simple yet powerful architecture design for the neural processes to better leverage the latent variables and be applicable in classification.(See Section 5.1) • We resolve the information asymmetry in the neural processes and construct well-clustered and interpretable representations.(See Section 5.2) • We validate MAHA through both regression and classification, by which the experimental results demonstrate its ability to cope with the heterogeneity and ambiguity.(See Section 6) • We devise an additional regularization term for the low-shot regime that distills an obtainable knowledge from relatively various training samples and variations.(See Appendix B)

Related work
Gradient-based meta-learning, represented by MAML [12], aims to learn the prior parameters that can quickly adapt to certain tasks through several gradient steps.It consists of the inner-loop for the task adaptation and the outer-loop for the meta-update over tasks.Many variants have emerged to balance generalization and customization in a task-adaptive manner.To begin with a generalization perspective, [13,27] suggested probabilistic extensions through the hierarchical Bayesian model and Stein variational gradient descent (SVGD) [37].In addition, [49] conducted the inner-loop on the low-dimensional latent embedding space, and [70] proposed the meta-regularization that was built on information theory.From a customization perspective, [35] divided the parameters into two categories, one of which is shared across tasks, and the other can be modulated task-specifically.[74] was informed by the layer-wise adaptive units, and [64,67,68,69] considered the auxiliary networks that modulate the initial parameter before the inner-loop.
The family of neural processes, also known as contextual meta-learning, is devised to imitate the flexibility of the Gaussian Process [44] while resolving the scalability issue.Rather than explicitly modeling the kernel to conduct the Bayesian inference like [65], it learns an implicit kernel directly from data which overcomes the design restrictions.Task-specific information is extracted from the subset of data through an encoder, which is then aggregated for utilization in the decoder to predict the corresponding outputs of the remaining data.Starting from the conditional neural process (CNP) [15], which was built solely on a deterministic path, the neural process (NP) [16] applies the addition of a stochastic path.The attentive neural process (ANP) [25] further applies an attention mechanism to resolve the underfitting issue in NP by enlarging the locally adaptive behavior.More complex modules, such as a graph structure [38] and recurrent neural network [30,53], were further considered to capture the dependencies on latent variables and the complex temporal dynamics.
However, many problems remain unsolved.Firstly, the neural processes yet rely on a complex feature extractor to enable task-specific modulation, which requires various regularization techniques with additional hyperparameters [47].Furthermore, whereas the neural processes are able to obtain an explicit task representation, the existing approaches have investigated little regarding interpretability.Finally, the performance analysis has been mainly focused on regression [31,25,53,57,19], and some are not even directly applicable for classification [34].

Problem setting
Let C = {C x , C y } be the context set, and let T = {T x , T y } be the target set, where both C and T are sampled from the same task T ∼ p(T ).A common goal in meta-learning is to devise an algorithm for the model f (•) that appropriately uses the model parameter θ to obtain the task-specific parameter φ according to the input-output pairs in C such that when T x is given, T y can be accurately estimated with high confidence.For example, in MAML [12], a task-specific parameter can be computed by using a gradient step φ = θ − α • ∇ θ L(f (C x ; θ), C y ).On the other hand, in CNP [15], θ and φ no longer share the same parameter space.Here, the model parameter is divided into an encoder and a decoder part θ = {θ enc , θ dec }, and the task-specific parameter can be computed by the encoder output φ = f enc (C; θ enc ).Hereafter, we omit θ for brevity.
For model training, φ is iteratively updated using batchs.Here, each batch is constructed through multiple tasks that are characterized by way and shot.If there are N classes, each of which contains K input-output pairs, we call it an N-way K-shot problem.The class labels are shuffled in classification whenever a task instance is created, which encourages a meta-learning algorithm to learn how to classify images even when the configuration of unseen classes occurs.
4 Preliminary : (Attentive) Neural Process In Figure 2, we summarize how a basic family of neural processes has evolved in terms of the graphical model.The encoder comprises a deterministic path and stochastic path computing the task-specific parameter φ = {r, z} of the variational distributions which we denote by q(r|{X, Y }) = N (r, 0) and q(z|{X, Y }) = N (µ z , 0.1+0.9•sigmoid(ωz )). 1 Here, {X, Y } indicates a set of input-output pairs and a reparameterization trick is applied at the end of the stochastic path for differentiable non-centered parameterization [28].
For both paths, NP is constructed by: where MeanPool(•) is a mean-pooling operation along the subscripted dimension, rFF(•) can be any row-wise feedforward layer, such as Multi-Layer Perceptron (MLP), and [•] denotes the concatenation.On the other hand, ANP exploits the multihead attention, connecting T x to r in graphical model, and self-attention, both of which are proposed in [61].As in NP, the value of z is same for every shot of T x , however, based on the attention score with each element of X, r is now computed in shot-dependent manner.
Then, conditioned on the encoder outputs, r and z, with the target input T x , the decoder computes the parameters of predictive distribution on the target output T y : where the predictive distribution is expressed as p(T y |T x , r, z) = N (µ Ty , 0.1 + 0.9 • softplus(ω Ty )).
Eventually, relying on the variational inference, one can obtain the loss function which approximates the negative ELBO by replacing an intractable p(z|C) with the variational distribution q(z|C) following [16]: As a result, based on the Kolmogorov extension and de-Finetti theorems, the neural processes become a stochastic process that satisfies the exchangeability and consistency [16].However, when trained using the deterministic path, the neural processes with latent variables is empirically shown to have difficulty capturing the variability of the stochastic process [31], of which causes are investigated and resolved in Section 5.2.
This section describes our algorithm MAHA whose primary focus is to devise a pre-task to cope with task heterogeneity and ambiguity in meta-learning.We first introduce an encoder-decoder pipeline of MAHA, namely FELD, of which effects are examined by substituting the correspondent within NP in Section 6.Then, a dimension-wise pooling and an auto-encoding structure are proposed to obtain well-clustered and interpretable representation.Finally, the training process of MAHA is described, which applies to both regression and classification.

Encoder-decoder pipeline
Flexible Encoder Although the attention mechanism proposed in ANP was a key to resolve the underfitting in NP, there is less incentive for r to focus on task identity that is shared across shots.
As a result, in Figure 3, ANP appears to strongly fit the given input-output pairs, which leads to a wiggly prediction.Particularly within task heterogeneity and ambiguity where the prediction space is prone to be highly variable, the wiggly prediction of ANP leads to a poor generalization performance (See Figure 8).Therefore, the graphical model of NP is rather considered in MAHA since its latent variables are shot-independent.Then, based on analysis in [10], the problematic underfitting is dealt with by substituting the encoder with the flexible and permutation-invariant Set Transformer (ST) [33].Note that the Set Transformer can incorporate the rFF(•) and MeanPool shot (•) in the encoder of NP.See Appendix A for a more detailed explanation about the modules in Set Transformer.
Figure 3: Qualitative comparison between NP, ANP, and NP with the flexible encoder (NP+FE) on functions generated from Gaussian Process.The shaded areas correspond to the ±2 standard deviations.Prediction of ANP turns out to be wiggly, while NP and NP+FE are relatively smooth following Occam's razor.Note that quantitative comparison can be looked up in Table 1.
Linear Decoder We avoid using a complex decoder such as [41] and apply feature-wise linear modulation to the target input T x .Inspired by [72], we composite the latent variables using a skip connection.Among the many normalization techniques, a layer normalization [3] is applied since the statistic is computed independently for each batch instance such that only z can capture the heterogeneity in accordance with the pooling proposed in Section 5.2.
[µ Ty , ω Ty ] or logit = g(T x ) Here, g(•) implies any feature extractor, LN(•) indicates a layer normalization, and the transpose operation T permutes the last two dimensions of the tensor.It is aligned with the previous approaches [4,51,66,20] which weaken the decoder to allow the latent variables to be appropriately leveraged.Also, it relates to studies on few-shot classification [18,47] where each column of W is computed by shots within the same way.However, when accompanied with the pooling in Section 5.2, the columns are no more independent by one another and share information across way.For NP and ANP trained on functions generated from GP, we illustrate the weight norm of the decoding layer right behind the latent variables in Figure 5.The sparsely-coded decoder implies the redundancy of the stochastic path due to the component collapsing behavior referred to in [40,24].This phenomenon can be explained by the information preference problem [7,73] where the information flow is concentrated on the deterministic path with the tendency to ignore the stochastic path.
In order to handle the information asymmetry, several solutions were proposed in studies on the generative model, such as the KL annealing scheduler [4,14] and expressive posterior approximation [48,29], but these are generally not robust to changes in model architecture.Instead, we propose a simple method to avoid redundancy of the stochastic path by encouraging it to acquire multi-modality within heterogeneity and ambiguity.
Dimension-wise pooling We explicitly capture the distinct variations within the information flow by pooling each path across different dimensions, batch for r and way for z: Then, the deterministic representation r becomes identical not only across shot, but also across batch.
Then, whenever it is insufficient to handle all variations across tasks within the same batch i.e., facing task heterogeneity, the model should resort to the stochastic representation z since the deterministic representation only captures the average properties.On the other hand, the stochastic representation z allows the different way to share information and becomes class-invariant.We illustrate how the latent variables r and z are computed in Figure 6.Note that the value of way is set to 1 in regression such that pooling on z is negligible.Auto-encoding structure Empirically, we observe that the KL collapse [4,2,55,73] does not occur whenever the pooling operations is used (see Appendix D).This implies that the posterior q(z|T ) does not simply converge to the approximate prior q(z|C) so that the decoder gets dependent on the stochastic path.However, there is still an incentive for r to be underutilized during the decoding because it is inferred by small C not large T [22] and neural networks exploiting set representation is known to poorly perform in low-shot regime [11,71] i.e., facing task ambiguity.
Thereby, we resort to the conditional auto-encoding structure [54] on top of the dimension-wise pooling to cope with the lack of training samples.As a result, the following loss function is derived which differs from Equation 2 on i) whether the pooling operations are used or not and ii) which set is used to compute the deterministic representation, each of which is the result of the dimension-wise pooling and the auto-encoding structure:

Training process
See Figure 7. Initially, the dimension-wise pooling and the auto-encoding structure proposed in Section 5.2 are used along with FELD to minimize the loss function in Equation 5. Next, an agglomerative clustering is applied to the disentangled representation from the stochastic path to estimate the number of clusters with the highest purity value. 2 Finally, for each cluster, separate FELD is trained from the beginning by Equation 2where the tasks are no longer uniformly sampled but statistically skewed based on the ratio of heterogeneous tasks within the cluster.According to the Euclidean distance to the cluster centers, FELD in correspondence to the closest cluster is exploited for evaluation.

Experiment
We first experiment on frequently appearing benchmark datasets in meta-learning and investigate the role of the encoder-decoder pipeline (FELD) by gradually adjusting NP.Those datasets are generally regarded to be homogeneous such that the MAHA is equivalent to FELD when assuming a single cluster as noted in Section 5.3.After that, MAHA is evaluated on heterogeneous datasets following the experimental setting of [67] with the dimension-wise pooling and the auto-encoding structure in Section 5.2, of which roles are examined in both quantitative and qualitative manner.Please refer to Appendix C for details about the data-split, architecture design, and the hyperparameter search.
Overall, we are to answer the following three questions: • Does MAHA outperform the previous baselines in terms of prediction?(See Table 1 to 5) • What are the benefits of using the flexible encoder and the linear decoder?(See Section 6.1) • How does the dimension-wise pooling and the auto-encoding structure contribute to obtaining well-clustered representation within heterogeneity? (See section 6.2) Gaussian Process Following the basic neural processes [15,16,25], we consider functions generated from GP with squared exponential kernel k(x, x ) = σ 2 exp −0.5(x − x ) 2 /l 2 .The experimental result in Table 1 states that although ANP performs better than NP in terms of flexibility, the dominance no longer holds when NP is equipped with the flexible encoder.However, a degradation in performance is shown when using the linear decoder in NP.This is empirical evidence that NP strongly relies on the complexity of the decoder in regression, by which the model is prone to ignore the latent variables [7,73].

Homogeneous dataset
By exploiting the flexible encoder to obtain more informative latent variables by themselves such that the (shallow) linear decoder is just enough for prediction, FELD performs better than any other models with the (deep) conventional decoder.We find the Set Transformer is the perfect choice whose improvement can not be caught up by simply stacking MLPs.Moreover, it is noticeable that FELD outperforms NP+FE despite a decreased model capacity.Mini-ImageNet, Tiered-ImageNet Similar tendency can be observed in classification.We consider mini-ImageNet [63] and tiered-ImageNet [46], which are frequently used large-scale datasets for few-shot image classification.For mini-ImageNet, we follow the split of [45], which assigns 64 classes for the meta-train set, 16 classes for the meta-valid set, and 20 classes for the meta-test set.For tiered-ImageNet, 608 classes are first grouped into 34 higher-level nodes, divided into 20, 6, and 8 nodes to construct the meta-train set, meta-valid set, and meta-test set.We use the feature provided by [49], which is obtained by pre-training a deep residual network in a supervised manner as in [17,42,43].However, unlike [43,49], the meta-valid set is used for early stopping and hyperparameter search but not utilized to update the parameters.In Table 2, 3, accuracy on mini-ImageNet and tiered-ImageNet is reported.We collect the score of various baselines that use either convolutional networks or deep residual networks and do not exploit any data augmentation for a fair comparison.While NP performs no better than a random guess when following [15], NP+LD results in a comparable score to the recent models in gradient-based meta-learning, verifying the validity of the linear decoder in classification.FELD achieves even better performance than the state-of-the-art, which is remarkable in the sense that the attention modules in Set Transformers can not be fully utilized in low-shot regime.

Heterogeneous dataset
Sine & Polynomial To verify the performance on the family of functions, we experiment on the toy 1D regression as in [64,67,68].In particular, we follow the exact setting of [67] where each task is randomly chosen to be one of the following one-dimensional functions where the coefficients are uniformly sampled from the prefixed intervals summarized in Appendix C.1: (sine) y = A s sin(B s x) + C s , (line) y = A l x + B l , (quad) y = A q x 2 + B q x + C q , (cubic) y = A c x 3 + B c x 2 + C c x + D c .A small number of data points are given as context, requiring the model to appropriately interpolate and extrapolate in a highly variable prediction space.In Table 4, MSE over 4000 tasks are presented with 95% confidence interval.Generally, all the gradientbased meta-learning algorithms are outperformed by the neural processes, and a noticeable gain is again observed by solely exploiting the encoder-decoder pipeline, FELD.By adjusting FELD to MAHA by task clustering and MAHA to MAHA* by knowledge distillation, a monotonic improvement is observed. 3n Figure 8, we illustrate the interpolation and extrapolation of MAHA in comparison to ANP.As noted in Section 5.1, the main interest of ANP is shown to fitting the context points, which poorly perform in predicting the target outputs whose corresponding inputs are located farther away from that of the context points.This tendency can be during interpolation and extrapolation, leading to a wiggly prediction with significant variance.By contrast, MAHA can correctly infer the functional shape, which can be confirmed through a consistently low variance.In Figure 9, for 1-shot setting, mean value of the variational distribution q(z|C) is visualized through t-SNE [59].Without external knowledge, such as the number of true clusters, the embeddings get interpretable when using both the dimensionwise pooling and the auto-encoding structure.The distinct datasets are no more clearly discriminated without either of them, which is quantitatively demonstrated by the estimated purity values in the bottom table.Note that the validity of the methodologies stands out particularly in low-shot regime which implies the difficulty of task identification within ambiguity.The tendency can be observed by the performance measure presented in Table 5.Compared to 1-shot setting where a noticeable gain is occurred by task clustering, in 5-shot setting, there is almost no difference between FELD and MAHA.This is because the models can clearly identify the tasks regardless of whether the pooling or the auto-encoding structure is used or not, demonstrated by the high purity values.Accordingly, the knowledge distillation, which is fundamentally devised to regularize the model within ambiguity appropriately, has shown a worthwhile improvement from MAHA to MAHA* particularly in 1-shot setting.Eventually, MAHA (and MAHA*) beats all the previous works with a fairly large margin and achieves state-of-the-art performance.

Conclusion
This paper proposes a new meta-learning framework, MAHA, that performs robustly amidst heterogeneity and ambiguity.We aim to disentangle the stochastic representation by the dimension-wise pooling and the auto-encoding structure based on the newly devised encoder-decoder pipeline to better leverage the latent variables.With the multi-step training process, comprehensive experiments are conducted on regression and classification.In the end, we argue that the proposed model captures the task identity with lower variance, leading to a noticeable improvement in performance.The potential limitation of MAHA would be the additional computational cost from the flexible encoder composed of multiple attention modules.However, by orthogonally applying to the existing work, the compatibility and the necessity are empirically verified.An interesting future work would be to apply our model to reinforcement learning.In particular, training a policy directly from well-clustered representations for sample-efficient exploration seems promising in an environment with sparse rewards.

Broader Impact
When training meta-learning models, there comes a customization process based on the problem at hand.If not using the benchmark datasets that frequently appear in academia, it becomes unclear to which extent the distinct datasets should be combined, expecting the model to be versatile on every possible task generation.MAHA, in this respect, can guide for a human to analyze and cluster the available data into separate clusters.Moreover, MAHA mainly benefits future AI industries where the limited communication between the decentralized servers is available as it can infer the global context even with a small amount of information.As a result, we do not expect any negative societal impacts, but we believe that MAHA possesses many implications in more realistic scenarios.

Figure 1 :
Figure 1: Heterogeneity and ambiguity occurred in task distribution.Those are not independent concepts, but the ambiguity naturally comes after the heterogeneity.

Figure 2 :
Figure 2: Graphical model of the related baselines.Circles denote random variables, whereas diamonds denote deterministic variables.Shaded variables are observed during the test phase, and every in-between edge is implemented as a neural network.

Figure 4 :
Figure 4: Prediction on output distribution.Superscript b indicates the b-th batch instance.

Figure 7 :
Figure 7: MAHA.K is the number of estimated clusters such that the meta-train set S tr = K k=1 S tr k .

Figure 8 :
Figure 8: Qualitative comparison of ANP and MAHA on various function types.The context points are selected from 40% of the entire domain for extrapolation.

Figure 9
Figure 9: t-SNE of µ z from q(z|C) and the estimated purity values

Table 1 :
MSE on Gaussian Process

Table 5 :
Accuracy on multi-dataset