Learning to Balance Local Losses via Meta-Learning

The standard training for deep neural networks relies on a global and fixed loss function. For more effective training, dynamic loss functions have been recently proposed. However, the dynamic global loss function is not flexible to differentially train layers in complex deep neural networks. In this paper, we propose a general framework that learns to adaptively train each layer of deep neural networks via meta-learning. Our framework leverages the local error signals from layers and identifies which layer needs to be trained more at every iteration. Also, the proposed method improves the local loss function with our minibatch-wise dropout and cross-validation loop to alleviate meta-overfitting. The experiments show that our method achieved competitive performance compared to state-of-the-art methods on popular benchmark datasets for image classification: CIFAR-10 and CIFAR-100. Surprisingly, our method enables training deep neural networks without skip-connections using dynamically weighted local loss functions.


I. INTRODUCTION
Deep neural networks have recently achieved state-of-the-art performance in various applications and tasks. A prototypical way to train a neural network is to optimize model parameters with respect to a fixed global loss function, which is attached to the top layer and all the layers are optimized for. In a classification task, for instance, the cross-entropy loss after a softmax function is a common choice. The fixed global loss functions are often suboptimal. The fixed loss is agnostic to tasks, datasets, and the architectures of neural networks. So, a variety of loss functions, e.g., variants of cross-entropy and softmax losses, have been proposed to achieve better performance: Focal loss [1], Large-margin-softmax loss [2], and Angular-softmax [3]. The standard training uses global loss functions which are applied to the top layer and the global error signals are back-propagated from the top layer to the bottom layer sequentially [4]. This is often problematic as The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . neural networks become deeper causing the gradient vanishing/exploding problems. This necessitates architectural treatments such as skip-connections [5] and Rectified Linear Unit [6].
Alternatively, dynamic loss functions have been studied by various adaptive learning strategies. Machine teaching [7]- [9] aims at identifying the smallest subset of training data for efficient training. Curriculum learning [10], [11] and self-paced learning [12], [13] maximize learning efficiency by learning samples in a meaningful order such as an easy-to-hard order. These approaches reduce to a dynamic loss function that adjusts the importance of samples depending on the learning progress. From this perspective with metalearning (learning to learn), optimization-based frameworks have been proposed. Learning to teach (L2T) [14]- [16], meta-weight-net [17], and learn-to-optimize [18] adjust the loss function or weights for samples to improve the performance of the learner model on meta datasets, which is similar to a validation set for hyperparameter tuning. This line of works achieved remarkable improvement but they mainly FIGURE 1. Overall architecture. Our framework learns to dynamically balance local losses (L local ) to optimize a global loss (L global ) via meta-learning. A local loss block attached to a hidden layer captures local error signals and allows identifying a specific layer or sub-network to train. The meta-network dynamically generates the weights of local losses leveraging the local error signals from all the local loss blocks. A learner network is trained by minimizing the total loss (L total ) that is the sum of the global loss and the weighted local losses, i.e., L total = L global + L l =1 v (l ) L (l ) local .
focus on global loss functions meaning that the approaches do not differentially train different layers depending on their states or the learning stage. Separated from adaptive learning strategies, local error signals or internal states of neural networks have been investigated for efficient learning. Deeply supervised nets [19] minimizes the classification errors as well as a local loss function at each hidden layer called a companion function. It alleviates the vanishing gradient problem. In knowledge distillation instead of hard pseudo-labels (predictions from teachers), soft pseudo-labels, features, or predictions before the last few layers are proven to be more effective [20], [21]. Also, local error signals are often sufficient to train neural networks [22] and it opens the door to parallel layer-wise model training. The local loss functions leveraging internal states are useful to accelerate learning and improve representation power.
Inspired by the adaptive learning strategies and local loss approaches, we propose a framework to dynamically balance local losses to optimize a global loss differentially training layers in a neural network via meta-learning. In contrast to the global loss at the top layer, the local loss attached to a hidden layer captures local error signals and enables training a specific layer or a sub-network. To generate the weights of local losses, our framework learns a small network called meta-network using meta-learning. The meta-network dynamically balances local losses leveraging the local loss values captured by local loss blocks as shown in Figure. 1. In this paper, we use two types of local losses: similarity matching loss and prediction loss [22]. The similarity matching loss allows measuring the correlation between labels and features at an intermediate layer. The prediction loss is the cross-entropy loss between labels and predictions from a local classifier at the intermediate layer. The goal of the proposed method is to learn to dynamically adjust the weights of the local losses to train a neural network more effectively. Our contributions are as follows: • We propose a novel training strategy to differentially train layers by dynamically adjusting the weights of local losses to improve the global loss function.
• We propose a minibatch-wise dropout for local loss blocks to obtain richer and more robust hidden representations that are correlated with labels.
• The proposed framework is a meta-learning paradigm. We study a bi-level optimization algorithm equipped with a cross-validation inner loop to address metaoverfitting.
• Lastly, our experiments demonstrate that the proposed method achieves the competitive performance on popular benchmark datasets compared to strong baselines. Surprisingly, the proposed method is capable of training deep neural networks without skip-connections.

II. RELATED WORK
There is a rich body of work on loss functions to improve the generalization ability of models. Early work has explored manually designed fixed loss functions that are more effective than the standard loss functions, e.g., cross-entropy loss. Several different loss functions have been proposed: Large-margin-softmax loss [2] and Angular-softmax loss [3] use angular similarity to assign larger angular separability between features. Focal loss [1] imposes larger weights on hard negative examples to achieve similar effects of hard negative mining. Subsequent approaches instead study dynamic loss functions. The aim of the methods is to learn (or adjust) loss functions during training or dynamically adjust loss based on a learning phase [23]- [27]. For example, 'learning to teach frameworks' [14]- [16] train a student model with VOLUME 9, 2021 the guidance of a teacher model that dynamically outputs different loss functions. These approaches achieved notable improvements, but they focus on global loss functions. This limits the training schemes to differentially handle layers using local error signals from each layer. So, another line of work considers local loss functions. Local classifiers [28] and local loss functions [29]- [31] have been studied to train deep neural networks. A recent work [22] proposes local loss functions to capture local error signals. It enables distributed training of a huge neural network by layer-wise training. In addition to the prediction loss from local classifiers, the similarity matching loss [22] measures the distance between cosine similarity matrices obtained from the labels and the feature maps at an intermediate layer. Similar local cost functions are studied in the literature [32], [33]. Meta-learning aims at learning to learn models more effectively as a learning experience is accumulated and generalize the learning methods to new tasks. Meta-learning approaches have addressed the following problems: optimizationbased methods for fast adaptation to new tasks [34], hyper-parameter optimization [35]- [37], black-box methods to predict gradients without access to the target models [38], learning loss functions for efficient training [14], [15], [27], distance metric learning [39], [40], few-shot learning [39], [41], transfer learning/knowledge distillation [42], optimal initialization for neural networks [43], [44], and adaptive sample reweighting for imbalanced datasets and noisy datasets [17]. In this work, inspired by the dynamic loss functions and local loss approaches, we propose a new training strategy that dynamically balances local losses to optimize the global loss. In contrast to previous works, our method learns a meta-network to adjust weights of local losses and explicitly optimize the performance on the meta-dataset with regards to the global loss via meta-learning.

III. METHOD
We introduce the details of our method called Meta Local Loss Learning. The goal of the framework is learning to learn a neural network with multiple local loss functions. Our framework trains a learner network with a global loss and multiple local losses balanced by meta-network. The meta-network adjusts the weights of local loss functions to inform the learner network which layer is more important at a different learning stage. We cast this as a meta-learning problem. In this section, we describe local loss blocks, metanetwork, our meta-learning formulation, and algorithm.

A. LOCAL LOSS BLOCKS
We introduce local loss blocks to locally optimize neural networks. A local loss block is a shallow neural network that takes an input from an intermediate layer and outputs a local loss.
Local training with shallow sub-networks is often effective to avoid the vanishing/exploding gradient problem in deep neural networks. To directly train intermediate layers rather than backpropagation from the top layer, [22] used Prediction Loss L pred and Similarity Matching Loss L sim . Prediction Loss L pred is a cross-entropy loss measured between labels and the predictions from a local classifier. Similarity Matching Loss L sim measures the Frobenius norm between cosine-similarity matrices derived from the labels and feature maps in a minibatch. The Similarity Matching Loss is given as where H ∈ R n×d h is the feature map matrix from each layer and Y ∈ R n×d y is a one-hot encoded label matrix. n denotes the number of samples, d h and d y are dimensionalities of features and one-hot encoded labels. g(·) is a small network to transform features and in our experiments a convolutional layer is used. Sim(X ) is the cosine similarity matrix of a minibatch X .
whereX is the sample-wise mean-centered matrix X.

1) MINIBATCH-WISE DROPOUT
We observed that Similarity Matching Function in (1) does not encourage the abundant features. For instance, if one feature in H is highly correlated with Y , then it can result in high fidelity with respect to Similarity Matching Function. To address this, we propose a new Similarity Matching Function with minibatch-wise dropout. The minibatch-wise dropout removes the same dimensions across all the samples in a minibatch. Thereby, similarity matrices can be accurately computed as well as model training benefits from dropout obtaining robust/abundant features. This is more effective than the standard (sample-wise) dropout for Similarity Matching Function. More discussion and experimental results are in Section IV-D. The Similarity Matching Function with minibatch-wise dropout can be written as where R is a diagonal matrix whose the main diagonal entries are drawn from the Bernoulli distribution, i.e., R = diag(r), r = [r 1 , . . . , r d h ], r i ∼ Bernoulli(p). For more details, see Figure 2. The prediction loss is the cross-entropy between labels and predictions by a local classifier ψ using the features from an intermediate layer. As [22], we used a single-layer perceptron. The predication loss can be written as where L ce (·, ·) is a standard cross-entropy. Now, a local loss L local attached to the l th layer is defined as a convex combination of the two losses (L sim , L pred ) as where γ ∈ (0, 1) is the ratio between the two losses. In our experiments, we used γ = 0.99. This also can be learned. For more discussion, refer to Section V-C. Local loss block with minibatch-wise dropout. A local loss block computes local loss L local comparing feature map H with ground truth labels. To compute the prediction loss L pred , and similarity matching loss L sim , the feature map H is transformed by small networks g and ψ. To achieve rich and robust features, the minibatch-wise dropout is applied before g. The minibatch-wise dropout drops the same dimension across all the samples in a minibatch whereas the standard dropout masks different dimensions across samples. Compared to the standard dropout, the minibatch-wise dropout is more effective for L sim that is a minibatch-wise loss function.

B. META-NETWORK
The local losses in (14) are combined by a meta-network. The meta-network learns to balance the local losses with a global loss to improve the generalization power of the learner network. We finally combine the global loss L global (f θ (x), y) which is a standard cross-entropy loss and the weighted sum of local loss L local . The total loss is as follows In our experiments, a simple network with a fully-connected layer is used for the meta-network shown in Figure 1. Including the learner network f and meta-network, the overall structure of our framework is shown in Figure 1. Learning the meta-network can be cast as a meta-learning.

C. META-NETWORK TRAINING
We here propose a meta-learning formulation to train the meta-network. Consider a classification task with a whole training dataset D. We divided the whole training dataset D into a training dataset D tr and a meta-dataset D meta to train the learner network and the meta-network. D tr and D meta can be viewed as a support set and a query set in meta-learning. Concretely, D is a whole training dataset and D tr is the training dataset except for D meta . That is, D = D tr ∪ D meta and D tr ∩D meta = ∅. Given the learner network parameters θ, and meta-network parameters φ, our meta-learning formulation is given as where L total (·, ·; φ) is the total loss in (6), which is the weighted sum of global loss and local losses using {v (k) } from the meta-network. The optimal parameters θ * and φ * are obtained by minimizing the following losses as where N and M are the number of samples in training data D tr and meta-data D meta respectively. (x tr i , y tr i ) and (x meta i , y meta i ) are samples from D tr and D meta . Intuition is the following. In (10), given the meta-network (φ), the learner network (θ) minimizes the weighted losses on D tr . Since the optimal parameters of the learner network depends on φ, we denote it by θ * (φ). (12) trains the meta-network (φ) by improving the performance of the learner network (θ), which is trained by the meta-network, on meta-data D meta .

1) CROSS VALIDATION TO OVERCOME META-OVERFITTING
Meta-learning often suffers from meta-overfitting [41], [45]. For instance, in our work, if the meta dataset D meta is small, our meta-network can overfit to D meta resulting in erroneous weights for local losses. This hinders the training of the learner network. Training the meta-network using k-fold cross validation alleviates meta-overfitting and allows virtually larger meta datasets compared to training on a single split. To be specific, the whole training dataset D is split into k folds, e.g.,  Accuracy (%) on CIFAR-10 and CIFAR-100. We run 3 times and show the mean with standard deviation. We compare the proposed method with the cross-entropy (CE) loss, smooth 0-1 loss (smooth), large-margin loss softmax (L-softmax), focal loss (FL), dynamic loss function (L2T-DLF), meta-weight-net (MW-Net), and local error signal function (LES). We reproduce the baselines for detailed experimental results (standard deviation) and the experiments of deeper ResNet (ResNet56), except for smooth, L-softmax, and L2T-DLF. Our framework performs an average of 1.3% higher than the highest performance of baselines. Considering the small average standard deviation (0.06%) of our framework, the performance gain by our method is significant.

2) META-LEARNING OPTIMIZATION ALGORITHM
We introduce a training algorithm for our Meta Local Loss Learning. Algorithm 1 first applies an one-step adaptationθ from θ minimizing the total loss in (6) based on a minibatch training dataset D tr . After the adaptation,θ is used to measure the quality of φ t by evaluating the performance of the learner network withθ on a minibatch meta dataset D meta . Note thatθ does not update values. Only its computational graph is used to learn meta parameters φ. The gradients for meta parameters φ are calculated K times in the cross validation loop. This allows more accurate updates of meta parameters. Using the updated meta parameters φ t+1 , the weights of local losses are generated. More discussion of variants of our algorithms and ablation study are in Section IV-D and Section V-B.
for k=0 to K − 1 do Cross validation 6:

IV. EXPERIMENTS
In this section, we evaluate the benefits of the proposed method on CIFAR-10 and CIFAR-100 [26]. First, we study the performance gain of our Meta Local Loss Learning compared to several strong baselines including Large-Margin Softmax, Focal Loss, learning to teach, and local training strategy. Second, we show that our framework is capable of training deep neural networks without skip-connections leveraging local loss functions. Third, we perform the ablation study to evaluate the contribution of each component in our framework such as our meta-learning formulation, cross-validation loop in the optimization algorithm, and minibatch-wise dropout. Finally, we also evaluated our framework on other datasets such as STL-10 [46] and Fashion-MNIST [47].

A. EXPERIMENTAL SETUP
In our experiments, a popular architecture ResNet [5] is used including ResNet8, ResNet20, ResNet32 and ResNet56 to verify the applicability from shallow to deep neural networks.
To test the applicability of our framework, we perform the experiments with various sizes of ResNet [5]. The same hyper-parameters are used (e.g., batch sizes, learning rates, and the optimizer) in all experiments. The only differences with the baselines are the learning method and loss functions. We average three independent results across different random seeds. In our experiments, the average standard deviation of our framework is 0.0014. Standard deviation of classification accuracy is provided in Table 1.

1) LOCAL LOSS BLOCKS
compute the similarity matching loss and the prediction loss at a hidden layer. As Figure 1, a local loss block has a small network g(·) to compute the similarity matching loss, i.e., Sim(g(HR))−Sim(Y ) 2 F . In our experiments two convolutional layers are used. We apply the minibatch-wise dropout R before the convolution layers with a dropout rate r where r ∈ {0.1, 0.2, 0.3}. Between the two convolutional layers, we used Batch normalization [48] and Leaky ReLU [49]. To calculate the prediction loss, i.e., L ce (ψ(H ), Y ), we generate a prediction from the local classifier φ(·), which consists of an average pooling and one linear layer.
2) META-NETWORK learns to generate the weights of local losses. In our experiments, the meta-network consists of a fully-connected layer TABLE 2. Accuracy (%) of ResNet architectures without Skip-Connections. We run 3 times and average the accuracy with computing standard deviation. The first column denoted by CE (with s-c) uses the standard ResNet. The remaining columns are results from ResNets without skip-connections. All the baselines showed significant degradation whereas our method shows competitive performance without skip-connections and even outperforms CE (with s-c) in some cases: ResNet-8, 20 on CIFAR10, and ResNet-8 on CIFAR100. Note that without skip-connections LES becomes unstable and many runs failed. So we do not include the standard deviations.
with a hundred hidden nodes. A layer normalization [50] is applied before the activation function. The meta-Network takes the concatenation of local losses and generates the weights from the softmax function to reweight the local losses for training a learner network. Therefore, the sum of the weights for the local losses is one. Our meta-network is shared among all of the layers and updated during metatraining.

B. EVALUATION ON THE CLASSIFICATION TASK
We compare our method with various baselines. The first group of baselines are fixed global losses: cross-entropy (CE) loss [51], the smooth 0-1 (denoted by Smooth) loss function [52], the large-margin softmax (L-Softmax) loss [2], and the Focal Loss (FL) [1]. Also, dynamic global loss such as L2T-DLF method [14], and Meta-Weight-Net (MW-Net) method [17] are included. Lastly, the closest baseline, the local error signal (LES) function [22] is compared. For methods whose code is not publicly available we used their performance reported in the literature. So some experimental results with larger networks, e.g, ResNet-56, are not available. Table 1 shows that our method achieves the best performance and the average gap with the second-highest baseline is 1-2%. Considering the small standard deviation of the performance of each method from three independent runs, the improvement is significant. LES that optimizes each layer using layer-wise local error signals provides good parallelization but it shows poor generalization performance especially when the learner network gets deeper. It indicates that training with only layer-wise local loss has a limitation whereas our method leverages all the internal states (local loss values) from other hidden layers.

C. LEARNING DEEP NEURAL NETWORKS WITHOUT SKIP-CONNECTIONS
We experiment with a challenging architecture to optimize to highlight the advantage of our framework. It is known that skip-connections are essential to training deep neural networks due to the vanishing/exploding gradient problem. We compare our framework with the baselines using neural networks without skip-connection. In the experiment, all the settings are the same except for the learning method or loss functions. In Table 2, CE (with s-c) denotes the cross-entropy with skip-connection, and the others are results from neural networks without a skip-connection, which are the same as the counterpart ResNet except for skipconnections. Surprisingly, the proposed method is capable of training 'deep' neural networks with no skip-connection, whereas the baselines show poor performance. Among the ones without skip-connections, our method achieves the best performance. Especially with ResNet-8, 20 on CIFAR-10, and ResNet-8 on CIFAR-100, our framework outperforms even the cross-entropy with skip-connection denoted by 'CE (with s-c)'.

D. ABLATION STUDIES AND ANALYSIS
We perform ablation studies to analyze the contribution of the features of the proposed method. We investigate the gain from end-to-end training, meta-learning, cross-validation loop, and minibatch-wise dropout.

1) END-TO-END TRAINING
The prior work [22] provides a way to optimize each layer independently using a layer-wise local loss without a global forward/backward pass. Since the method does not use a global loss using the entire network, training is prone to be suboptimal. The disadvantage gets evident when the network becomes deeper. In contrast, our framework leverages both a global loss and local losses from all the layers and trains a learner network in an end-to-end fashion. Table 3 shows that the method (denoted by w/o meta) that learns with the global loss and all local losses outperforms the cross-entropy (CE) and the local error signal (LES) [22].

2) TRAINING VIA META-LEARNING
Our framework learns the weights of local losses via metalearning. We observed that a naive adoption of meta-learning (w/ meta) suffers from meta-overfitting. The weights for local losses generated from the meta-network hinder the training of a learner network. Since meta-datasets are small, VOLUME 9, 2021 TABLE 3. Ablation study. We compare our framework with the following: (w/o meta) global and local losses without the meta-network, (w/o CV) all losses with the meta-network without the CV loop, (meta+CV) meta-network and cross-validation loop, (meta+CV+D.out) standard dropout with meta-network and cross-validation loop, and (meta+CV+mD.out) all the proposed features including minibatch-wise dropout.
the meta-network trained by a bi-level optimization tends to have poor generalization ability. Instead, (w/o meta) that trains with fixed weights, e.g., one for all the local losses, performs better than the meta-network without CV (w/o CV).

3) CROSS VALIDATION
The proposed method trains the learner network by metalearning, where the meta-network is updated by meta-dataset. If splitting the dataset D into a training dataset D tr and a meta-dataset D meta , training data for the learner network and the meta-network become smaller than D, which is the training data for (w/o meta). The smaller meta-datasets lead to overfitting of the meta-network (or meta-learning) as known as meta-overfitting. To overcome this, we inserted a 5-fold cross-validation loop in our optimization algorithm. We observed that the learner network can be updated inside of the cross-validation loop with the meta-network. This allows more updates for the learner network and training is more efficient with respect to the wall-clock time. Table 3 demostrates that our method with cross-validation (meta+CV) shows a performance gain compared to CE, LES, (w/o meta) and (w/o CV). On the CIFAR-10 dataset, the average accuracy is respectively 0.8835 when training without cross-validation and 0.9320 when training with cross-validation. On the CIFAR-100 dataset, each mean of the accuracy is 0.5765 without cross-validation and 0.7007 with cross-validation. The average performance gap on CIFAR-100 is larger than on CIFAR-10. The gap between them is respectively 0.0485 and 0.1242. On the CIFAR-100 dataset, the mean accuracy difference between them is almost 0.12, which is a large performance gap in this classification task.

4) MINIBATCH-WISE DROPOUT
We proposed a minibatch-wise dropout (denoted by mD.out) to obtain richer and more robust representations without interfering with the Similarity Matching Loss function that is a minibatch-wise loss function. In Table 3, the training with the minibatch-wise dropout (mD.out) achieved the best results in all different settings whereas the standard dropout (D.out) mostly shows degradation. The consistent improvement in Table 3 supports the effectiveness of our minibatch-wise dropout method.

5) DYNAMIC WEIGHTS FOR LOCAL LOSSES
We analyze the weights for local losses generated from the meta-network. In Figure 4(b), we observed that the weight for the local loss of the bottom block (block1) is larger than that of the others (block2 and block3) at the early stage of training. It implies that the meta-network focuses more on the bottom layers at the early stage. After about 200 iterations, the weight of block1 becomes lower and the weight of block3 (top layers) increases. At the end of the training, all the weights converge to the same value. More discussion in the next section.

E. ADDITIONAL EXPERIMENTS
In this section, we additionally conducted experiments on two other datasets such as STL-10 [46] and Fashion-MNIST [47].

1) CLASSIFICATION TASK ON STL-10
We evaluate our method in the classification task on STL-10 dataset, and the result is in Table 4. STL-10 dataset is an image recognition dataset that consists of 5,000 labeled training images and 8,000 test images with 10 classes. The resolution of STL-10 dataset is 96 × 96 and no data augmentation is used. We compare our framework with Cross Entropy (CE), Focal loss (FL) and Meta-weight Net (MW-Net). Our framework averagely outperforms the second highest baseline CE by 2.57%. TABLE 4. Accuracy (%) on STL-10 dataset. We compare our framework with the cross-entropy (CE), Focal Loss (FL) and meta-weight-net (MW-Net). Our framework achieved averagely 2.56% higher accuracy than the second highest baseline.

2) CLASSIFICATION TASK ON FASHION-MNIST
Fashion-MNIST is a dataset which contains 70,000 grayscale images in 10 categories. Fashion-MNIST is a drop-in replacement for MNIST dataset, and it is a slightly more challenging problem than regular MNIST. Fashion-MNIST is the dataset with different classes of clothing and consists of 60,000 training images and 10,000 test images. Table 5 shows that our framework outperforms the cross-entropy (CE) and Meta-Weight-Net (MW-Net) on ResNet10 and ResNet18. Our framework achieves higher accuracy than the baselines. TABLE 5. Accuracy (%) on fashion-MNIST dataset. We compare our framework with the cross-entropy (CE) and meta-weight-net (MW-Net).

V. DISCUSSION
In this section, we provide detailed experimental results and additional discussions about our method. This includes: 1) Cross-Validation, 2) Exponential Smoothing, 3) Balancing losses within a layer, and 4) Analysis of dynamic weights for local losses.

A. CROSS VALIDATION
In the main paper, we discussed that addressing metaoverfitting is crucial to achieving improvement. We integrate cross-validation into our estimation algorithm. There are two different ways to apply the cross-validation loop in the algorithm. One approach is to update the learner network and the meta-network in every fold as the algorithm in the main paper. This allows the meta-network to utilize all the training data. But each meta-network update may not be accurate since it is immediately updated based on only one fold of data. As an alternative, we here propose a more accurate update scheme (denoted by CV 2 ) for the meta-network in Algorithm 2. The algorithm updates the meta parameters φ with accumulated gradients outside of the cross-validation loop. This method updates the meta parameters φ only once for one iteration Gradient averaging 10: t+1 , Exp. smoothing 12: θ t+1 ← θ t − α 1 n (x,y)∈B tr L total (θ |x, y,ṽ t+1 ) 13: end for of the outer loop. In Table 6, we compare these two different cross-validation algorithms (meta+CV, meta+CV 2 ). Although the meta+CV 2 algorithm uses more accurate local loss weights, meta+CV 2 takes K times more computations to update the learner network once. Since the performances of two different versions of CV do not show a big difference, we employ the former cross-validation algorithm (meta+CV) in the main paper.

B. EXPONENTIAL SMOOTHING
We briefly mentioned in the main paper about another technique for stable optimization, which is exponential smoothing. To reduce the variance of loss weights set by the meta-network, we propose a different estimation algorithm (denoted by R.avg) in Algorithm 2 with the exponential smoothing. Our preliminary experiments show that exponential smoothing allows stabilize the optimization but the improvement on the final accuracy is limited. So for simplicity, the final version of our algorithm in the main paper does not have exponential smoothing. The equation is as follows: v (l) where α ∈ [0, 1] is the smoothing factor to compute the convex combination of the previous weights We believe that the exponential smoothing prevents a drastic change in loss weights caused by potentially meta-overfitted Metanetworks.ṽ (l) t−1 and current outputs v (l) t of the meta-network. For the first iteration, we used raw outputs of the metanetwork, i.e.,ṽ T is the iteration of the training process, α ∈ [0, 1] is a hypterparameter for a convex combination of the weights v l , and l ∈ {1, . . . , L} is an index of each layer. The exponential smoothing is represented in line 11 of Algorithm 2. The ablation study for the exponential smoothing is in Table 6.

C. BALANCING LOSSES WITHIN A LAYER
We mentioned that a local loss L  where γ ∈ (0, 1) is the ratio between the two losses. Similar to weights v (l) for local losses, the ratio γ can be learned by the meta-network. We conduct preliminary experiments with another meta-network denoted by HNet. Now, the weighted loss within a local loss block is given as where . HNet is also trained by meta-learning, and two meta-networks generate the weights v (l) , and h (l) in this setting. We found that h (l) generated by HNet is similar to 0.99 that is used for γ in Eq 14. The tendency of h (l) is in Figure 3. Since the dynamic weight h (l) from HNet does not improve the performance compared to manually set parameter γ = 0.99, we use the constant weight γ = 0.99 to save the computational cost in the final version.
In Table 7, the ablation study for the prediction loss shows that our framework with the combination of the prediction loss and similarity matching loss is more effective than our framework without the prediction loss.

TABLE 7.
Ablation study for the prediction loss. We denote our framework without the prediction loss as 'w/o prediction loss', and our framework as 'ours'.

D. DYNAMIC WEIGHT ANALYSIS
The meta-network of our framework generates weights for local losses at each iteration. Figure 4 shows that the dynamic weights for layers change differentially. The overall tendency of the weights is that the weight of the local loss at the bottom block is larger than that of the top block at the early stage of . Dynamic weights for local losses. We observe that the weight of the bottom block is larger than that of the top block at the early stage. At the mid-stage, the weight of the top block is the largest until it converges to the uniform weights. The length of the early stage differs depending on tasks; Training on CIFAR 10 in Figure 4(a) has a short early stage compared to the one on CIFAR-100 in Figure 4 the training, and after the early stage, the weight of the top block becomes bigger than the weight of the other blocks. At the last stage, all the weights converge to the uniform weights, i.e., 1/N . Different tasks show different lengths of stages and transitions happen at different iterations. In a difficult task (e.g., CIFAR-100) in Figure 4(b), it shows a longer early stage with a larger weight on the bottom block whereas in an easy task (e.g., CIFAR-10), our framework has a short early stage in Figure 4(a).

VI. CONCLUSION
We present a novel framework that dynamically balances local losses to optimize a global loss. The proposed method differentially trains important layers or sub-networks at different learning stages leveraging the local error signals. Experiments on benchmark datasets demonstrate that our training strategy provides consistent improvement compared to baselines without any additional data or architectural change. The ablation study shows the effectiveness of our cross-validation inner loop and minibatch-wise dropout to address meta-overfitting. Also, we show that, in contrast to baselines, our method successfully trains deep neural networks without skip-connections. We believe that our effective training strategy helps researchers to explore diverse architectures that are less studied due to the poorer optimization results.
We now discuss a few limitations of our method. To reduce the overhead by our local loss blocks, we attached our local loss blocks to a subset of hidden layers. We believe that the hidden layer selection for local loss blocks can be further optimized and automated. Another limitation is that the computational cost slightly increased since our framework is trained via meta-learning. More advanced and efficient meta-learning techniques can be studied to reduce the computational overhead while maintaining competitive performance. We leave these limitations for future work.