Resource-Efficient Multi-Task Deep Learning Using a Multi-Path Network

Multi-task learning (MTL) improves learning efficiency compared to the single-task counterpart in that it performs multiple tasks at the same time. Due to the nature, it can achieve generalized performance as well as alleviate overfitting. However, it does not efficiently perform resource-aware inference from a single trained architecture. To address the issue, we aim to build a learning framework that minimizes the cost to infer tasks under different memory budgets. To this end, we propose a multi-path network with a self-auxiliary learning strategy. The multi-path structure contains task-specific paths in a backbone network, where a lower-level path predicts earlier with a smaller number of parameters. To alleviate the performance degradation from earlier predictions, a self-auxiliary learning strategy is presented. The self-auxiliary tasks convey task-specific knowledge to the main tasks to compensate for the performance leak. We evaluate the proposed method on an extensive set of multi-task learning scenarios, including multiple tasks learning, hierarchical learning, and curriculum learning. The proposed method outperforms existing multi-task learning competitors for most scenarios about by a margin of 1% ~ 2% accuracy on average while consuming 30% ~ 60% smaller computational cost.


I. INTRODUCTION
D EEP learning has achieved great success in domains such as computer vision [1] and natural language processing [2]. A well-designed deep architecture is generally deployed for a single task [3], such as object detection [4], pose estimation [5] and image segmentation [6]. Despite the success in various fields, it is well-known that the approach requires heavy computational resources, especially when we address multiple tasks. This is generally from a common practice that a network is specialized for a single task. If the number of tasks grows, the number of parameters required increases accordingly. As a result, given a large number of tasks to be performed, it will become an intolerable burden for resource-constrained devices.
To overcome the limitation, one strategy is to learn tasks jointly, which is referred to as multi-task learning (MTL) [7]. MTL executes multiple tasks at the same time and is more efficient than its single-task counterpart. Due to the nature of exploiting knowledge of multiple different tasks, it can be employed in various fields, [8]- [10], and natural language processing [11], [12]. Despite its practical benefit, there might be unavoidable issues. First, we may encounter task interference during training when different tasks share the same network. This is because, unlike a single task, multiple knowledge from different tasks are learned using the same set of parameters, decreasing learning efficiency. Second, if a different computation budget is required, we usually need to define a new network, which introduces additional training effort. In other words, when we learn a network tailored for a specific budget, it is difficult to directly transfer the learned network to another one for a new budget. Therefore, it would be desirable to perform more flexible inference under different computation budgets in a learned network.
Many existing works try to solve one of the above two challenges. Some recent works [13]- [15] have made efforts to reduce the negative impact between tasks to prevent performance degradation. [13], [16] propose a selection module so that relevant tasks are encouraged to share their features while irrelevant tasks are disentangled. [14] utilizes task-specific attention in a single shared backbone and [15] reparameterizes existing convolution modules into shared and task-specific modules. However, the above works do not perform budget-aware inference when different budgets are required, thus they need to produce multiple individual networks for different costs. Although [17] enables costaware inference by building a nested structure containing multiple networks of different sizes, it mainly focuses on a single task, not multiple tasks.
In this work, we develop a new network containing different inference paths, called a multi-path network, which learns multiple tasks simultaneously under different computation costs. The proposed architecture is structured hierarchically and consists of multiple paths of different hierarchy levels. The multiple paths correspond to internal networks of different sizes in the architecture, respectively (see Figure  1). The path under a lower-level (resp., higher-level) hierarchy enables earlier (resp., later) prediction with a smaller (resp., larger) number of parameters. The network possesses a nested structure such that an internal network of an earlier path is a subset of internal networks of later paths. Thus, the knowledge of a lower-level path is shared by a higherlevel path. This is different from a common practice with multiple output branches at the end of a network [7]. We note here that some tasks are performed early with a partial set of parameters, and the other tasks can occupy the rest of the parameters. From this, tasks do not fully share the entire set of parameters, allowing different learning and inference flows and mitigating the strong influence (connection) between tasks. Besides, the multi-path network containing internal networks of diverse sizes can meet the requirement of different computation costs. Thus, it will reduce the effort to design and train additional networks from scratch. Note also that tasks of different natures can be efficiently addressed in the proposed approach. For instance, it can apply lowerlevel paths to easier tasks and higher-level paths to harder tasks. In addition, to avoid the performance degradation risk that occurred in the early prediction with a small number of parameters and to boost the performance of all target tasks, a self-auxiliary learning strategy is introduced through knowledge distillation [18]. The multi-path network receives rich task-specific knowledge from the self-auxiliary tasks that can improve the representation of the target tasks.
Additionally, to our knowledge, there is no comprehensive study of multi-task learning from various views. It is important to note that existing works in the multi-task learning literature have not extensively conducted a diverse set of scenarios, so their analyses might be limited. In this work, we study the proposed multi-task learning method under a wide range of scenarios, including multiple tasks learning, hierarchical learning, and curriculum learning (see Section ??). For those scenarios, we use three benchmark datasets: CIFAR-10 and CIFAR-100 [19], and Celeb-A [20]. We show from the experiments that the proposed method performs better than other approaches in most scenarios while effectively minimizing the harmful interference between tasks. We also FIGURE 1. A graphical illustration of the proposed learning framework based on a multi-path network and self-auxiliary tasks. The network is composed of hierarchically constructed subnetworks and their inference paths, where the lower-level path (Path A) branches out at an earlier layer to perform Task A and the higher-level path (Path B) outputs at the end of the network to perform Task B. We also have auxiliary tasks corresponding to the main tasks, which are called self-auxiliary tasks. The self-auxiliary tasks (with their pre-trained networks) transfer task-specific knowledge to the main network to boost the performance of all tasks.
analyze the computation cost of compared methods to verify that the proposed method is resource-efficient.
In summary, the main contributions of the proposed method are three folds: • The proposed hierarchical structure can efficiently handle multiple tasks, including hierarchical and curriculum learning tasks. • The proposed network contains multiple internal networks of different sizes and can address various computation budgets from a single training phase. • The self-auxiliary learning strategy mitigates the performance leak by sharing the task-specific information with the multi-path structure. The organization of this paper is as follows. We introduce related work in multi-task learning in Section II. In Section III , we explain the proposed multi-path structure and self-auxiliary learning strategy. In Section IV, we show the effectiveness of our work with other methods. Finally, the conclusion of this work is discussed in Section V.

II. RELATED WORK
Multi-task learning. The goal of multi-task learning (MTL) is to learn multiple tasks jointly [7]. MTL can be classified into two main categories. The first category introduces multiple individual networks proportional to the number of tasks with some constraints between the networks [21], [22]. [21] proposes a cross-stitch module that enables to share the task knowledge between individual networks. Likewise, [22] fuses the features from the different tasks to train them jointly. The other one learns multiple tasks using a single shared architecture [11], [17], [23]- [26]. [11], [23] utilize the auxiliary tasks to rich the feature representation. [24] suggests a soft-attention module for learning task-specific features. [25] attempts to hold important parameters for each task by iterative pruning. [17] constructs a resource-aware structure that compresses the network by splitting each channel into separated groups. Additionally, there are many studies to solve multi-task learning from an optimization point of view, such as Pareto optimal solution [27], multi-objective optimization [28], task priority [29], gradient surgery [30].
In this work, we are particularly interested in learning on a single shared structure since it is memory efficient. While most existing methods need to be trained from scratch if another computation budget is needed, the proposed method does not require extra training efforts due to the internal networks in the proposed architecture. Besides, task interference can be mitigated in the proposed method by allowing different learning and inference flows for tasks. Knowledge distillation. Knowledge distillation (KD) is a method that transfers knowledge from a large-scale (teacher) network to a small-scale (student) network [31]. The student network is trained to mimic the teacher network by minimizing the difference of the predictions between student and teacher from the last layer [18], intermediate layers [32], [33], attention maps [34], and so on. Knowledge distillation has been applied to face recognition [35], style transfer [36], and natural language processing [37], [38] to compress the network, while maintaining the performance. KD is used not only in single task learning but also in multi-task learning [12], [39]. [12] suggests using KD when addressing multi-task learning in natural language processing. [39] employs KD to handle an imbalance problem while optimizing the multiple losses.
Apart from the previous works, we deploy KD to accelerate the performance in multi-task learning while achieving resource efficiency. To overcome the performance degradation that can arise with early predictions in the multipath network, we distill knowledge from self-auxiliary tasks with pre-trained networks to learn the subnetworks in the proposed architecture.

III. METHOD
The proposed method aims to provide prediction at different computation costs from a single trained architecture while alleviating destructive interference that can arise when learning different tasks. To achieve this goal, we propose a learning framework based on a multi-path structure with a selfauxiliary learning strategy. The multi-path structure performs multiple tasks under different paths. Self-auxiliary learning assists the main tasks to make up for the performance loss with earlier paths. We introduce the multi-path network in Section III-B and the self-auxiliary learning strategy in Section III-C. Additionally, the notations used in this work are listed up in Table 1.

A. MULTI-TASK LEARNING
Multi-task learning (MTL) trains multiple tasks jointly, which can improve the learning efficiency and generalized performance. We define a task as learning an attribute in a sample [40], [41], learning each hierarchy of hierarchically constructed dataset (in hierarchical classification) [17], [42],  Set of task-specific parameters of the i th task Set of parameters of the i th task assigned to the main task.
Set of parameters of the i th task assigned to the auxiliary task L Loss of the i th task for the main task L Loss of the i th task for the auxiliary task or learning a group of samples of different difficulty (in curriculum learning) [43], in this work, but not limited to.
In the conventional MTL method [7], the set of datasets is denoted as {D (i) } 1≤i≤t , where the number of tasks is t and the task index is i ∈ [1, 2, ..., t]. For the i th task, the dataset D (i) consists of the set of data samples X (i) and the corre- is a collection of the shared parameters W sh and the task-specific parameters {W task }. Given a collection of tasks, the loss function L is defined as follows: (1) Note that task-specific classifiers for multiple tasks occupy a small number of parameters in W. There are studies [24], [25] considering how to distribute parameters of each task to W sh to overcome the limitation of learning with a single fixed architecture. [24] introduces feature-level attention modules applied to the shared parameters for different tasks. In [25], the shared parameters are divided into task-specific disjoint groups. However, [24] requires an additional number of parameters, and it may be difficult for [25] to address many tasks because all tasks are learned independently.

B. MULTI-PATH NETWORK
Multi-task learning generally addresses different tasks within a single shared architecture, which reveals that the mixture of knowledge can negatively influence each other. Moreover, when different memory budgets or networks of different sizes are required, it is common to define individual networks corresponding to the requirements. To handle both problems, the proposed method is built on a multi-path structure (see Figure 2). The structure contains multiple different paths constructed hierarchically in a way that an earlier path entails a smaller number of parameters and a later path requires a larger number of parameters. Note that the hierarchical structure of the multi-path network indicates a nested network structure in such a way that an internal network corresponding to an earlier prediction path shares its parameters (and knowledge) with other internal networks corresponding to later prediction paths. Hence, different inference paths from the multi-path network may avoid negative task interference because tasks do not fully share the entire network (i.e., partial sharing can reduce the chance of interference due to different learning and inference flows). Besides, due to the different sizes of internal networks, the proposed method can handle diverse computation requirements.
The set of parameters for the i th path (or the i th internal network) is W (i) main . The set of entire parameters of the multipath network is W main which is the same as the parameters in the last task W (2) Given the proposed network, the loss function for the i th task is defined as: where H(·) denotes a loss function (e.g., cross entropy), and f (·) is the proposed multi-path network. By learning the multi-path network, we can perform tasks under different paths (and different computation costs). Moreover, if the overall structure of the network g and f are the same, the number of parameters for a task in the conventional MTL method is equal to the number of parameters for the last task in the proposed method. Thus, the proposed method can contain the existing multi-task learning approaches sharing a single network by adjusting the branches. It is a more general framework that can cover existing MTL approaches and can produce subnetworks of lower computation costs.

C. SELF-AUXILIARY LEARNING
We adopt knowledge distillation [18] to transfer the knowledge from one network to another one (see Figure 2). Since early prediction requiring fewer parameters may occur performance degradation, the presented self-auxiliary learning strategy overcomes the problem. Assume that we have pretrained networks g (i) 's for the self-auxiliary tasks (denoted as Aux i in Figure 2), respectively, before learning the multipath network. In the proposed method, the multi-path network receives task-specific knowledge from the pre-trained networks using the distillation loss L (i) aux as follows: where KL(·) denotes the Kullback-Leibler divergence function and u(·) is the softmax function. W (i) aux denotes the set of parameters for the i th self-auxiliary task. The temperature T controls the amount of transferred knowledge. We adopt the self-auxiliary networks g with the same capacity as the main network f . When it comes to inference, the auxiliary networks are not used. This means the self-auxiliary learning tasks do not require additional parameters on inference.

D. TOTAL LOSS
The loss of each task is computed by combining the losses from the multi-path network (Eq. (3)) and self-auxiliary learning (Eq. (4)). The multi-path network learns the set of entire parameters W main by minimizing the total loss function across all tasks, which is defined as where α is a balancing factor between the two losses L Update W main ← W main − η ∂L total ∂W main end for

IV. EXPERIMENTS A. SETUP
Datasets. We used five datasets: CIFAR10, CIFAR-100 [19], split CIFAR10, split CIFAR-100 and Celeb-A [20]. CIFAR-10 and split CIFAR-10 contain 50,000 training and 10,000 test images of the 32×32 size. Split CIFAR-10 is originated from CIFAR-10 but its total classes (and the corresponding number of samples) are partitioned into several groups (see below for more details). CIFAR-100 and split CIFAR-100 have 50,000 training and 10,000 test images of the same image resolution to CIFAR-10. We constructed split CIFAR-100 similar to split CIFAR-10. Celeb-A is a face image dataset composed of 40 attributes and has 162,770 training and 39,829 test images of the 218×178 resolution.
Scenarios. We constructed a range of scenarios to analyze the method from diverse perspectives, including multiple tasks learning, hierarchical learning, and curriculum learning. Multiple tasks learning handles different classes (tasks), and the corresponding datasets are split CIFAR-10 and split CIFAR-100. We split CIFAR-10 and CIFAR-100 into five disjoint tasks so that each task includes 2 and 20 consecutive classes, respectively. In addition, to show the robustness against task order on multi-task learning, we conducted experiments with respect to randomly produced task orders using the split CIFAR-100. In hierarchical learning, we learned both coarse-and fine-class tasks in CIFAR-100 that consists of two levels of hierarchy: 20 coarse-classes and 100 fineclasses. To introduce another hierarchical level of the dataset, in this work, we further grouped the 20 coarse classes into 3 super classes. Accordingly, we renamed the coarse class as the intermediate-class and the fine-class as the sub-class. Since the coarse-class task includes less information than the fine-class task, the internal network with smaller parameters for the coarse-class task is assigned. In curriculum learning, we presume that the number of samples in data indicates the difficulty of the task because it is well-known that learning becomes easier (resp., harder) when many (resp., small) samples are given. To conduct the scenario, we allocated easy, normal, and hard tasks (attributes) from Celeb-A (see Figure 5). We used a subset of the dataset containing eight attributes out of 40 to ease the learning procedure while almost preserving the distribution of the original dataset.
Compared methods. We compared six algorithms in the experiments: Hard parameter sharing [7] as a baseline method, NestedNet [17], PackNet [25] with batch normalization, MTAN [24] and AdaShare [44]. Most of the compared methods do not increase the size of the network regardless of the number of tasks similar to ours.
While NestedNet updates the set of parameters up to the i-th task, PackNet does not update the parameters allocated for the previous tasks. MTAN performs task-specific feature level attention, and AdaShare suggests a task-adaptive sharing approach that determines which layers to share between tasks, while minimizing resource efficiency. We used the same settings, such as network configuration and optimizer, for all the compared approaches and trained them from scratch.

B. IMPLEMENTATION DETAILS
We built our method under ResNet and MobileNetV2 backbone architectures. While ResNet was used for the CIFAR datasets, MobileNetV2 (with multiplier 0.5) was deployed for ImageNet. Note that we did not downsize the CIFAR images at the input layer for MobileNetV2, since the size of VOLUME 4, 2016 TABLE 2. Results of the multiple tasks learning scenario for split CIFAR-10 and CIFAR-100 on ResNet-18. Classification accuracy, the ratio of the number of parameters, and FLOPs of the compared methods are provided. Bold font shows the best accuracy or the least computation cost for each dataset, and underline gives the second best accuracy or the second least computation cost. CIFAR is much smaller than that of ImageNet. We applied the SGD optimizer with the Nesterov momentum of 0.9 for both ResNet and MobileNetV2. The proposed network was trained with an initial learning rate of 0.1 and the learning rate was decayed by a factor of 10 when the training loss converges. The batch size of 128 for CIFAR-10 and CIFAR-100 and 64 for Celeb-A were used in all experiments. For the proposed method, we constructed five branches in the multiple tasks learning scenario and three branches in other scenarios. The branches take 25%, 36%, 58%, 79%, 100% of the number of parameters in the multiple tasks learning scenario and around 30%, 75%, 100% of the number of parameters in other scenarios, respectively. Implementation of the proposed method was conducted under the PyTorch library [45].

C. MULTIPLE TASKS LEARNING
For the first multiple tasks learning scenario, we used split CIFAR-10 and CIFAR-100. We partitioned the classes into five disjoint groups for both CIFAR-10 and CIFAR-100. Each group has 2 successive classes for CIFAR-10 and has 20 successive classes for CIFAR-100. Accordingly, we assigned the groups (tasks) to the branches by order of class labels, respectively. We compared with Baseline and major competitors sharing a similar idea to ours, PackNet, NestedNet, MTAN and AdaShare. For self-auxiliary learning, we set the temperature T to 2, and α was set to 1 to distill knowledge from self-auxiliary tasks to main tasks. Table 2 shows the results of the scenario for three measures: accuracy, the number of parameters, and FLOPs. First, for split CIFAR-10, Baseline and MTAN show good performance but the computation cost (the number of parameters and FLOPs) is not efficient. Overall, the proposed method is comparable to other methods while consuming less number of parameters.  For split CIFAR-100, which is more challenging than split CIFAR-10 due to the large number of classes, the proposed method shows similar performance to Baseline, MTAN, and AdaShare. When it comes to the computation cost, the compared methods have twice as many parameters as the proposed method on average. Note that there is no significant performance drop in our method compared to them (1% ∼ 2% drop on average). The self-auxiliary learning strategy enriches the task-specific knowledge of the multipath architecture. However, the performance leak can be gained without the self-auxiliary learning strategy since the tasks predicted early consume a subset of the parameters. Consequently, it can maintain the performance while requiring fewer computations by applying the proposed method.
Furthermore, we demonstrated the proposed method under different randomly shuffled task orders as shown in Figure 3. Note that the number of parameters used to perform tasks in our multi-path network does not change from the experiment for the CIFAR-100 dataset. The results from the figure indicate that even if the task order is different, the proposed method performed better than other competitors, PackNet and NestedNet, with the smaller number of parameters. In particular, the proposal achieved the highest accuracy for four out of five tasks from the experiments, showing its excellence and robustness against task order.

D. HIERARCHICAL LEARNING
In this scenario, we learn class hierarchy from the coarsest to the finest in the proposed multi-path network. We used CIFAR-100 and constructed three levels of hierarchy (super-, intermediate-, sub-classes) as mentioned earlier (See more detail in Table 4). We learn super-classes with the lowest-level path, intermediate-classes to the intermediatelevel path, and sub-classes to the highest-level path, respectively, to make the network learn the hierarchical structure of the dataset. The proposed approach was compared with Baseline, NestedNet, PackNet, MTAN, and AdaShare that can address class hierarchy. In this scenario, we used the same backbone architectures as the previous scenario. We set the temperature T to 2 and α to 1. Table 3 shows the classification accuracy and computation cost among the compared methods for super-, intermediate-  Results of the hierarchical scenario using two backbones (a) ResNet-18 and (b) MobileNetV2. We also provide the computation information such as the ratio of the number of parameters. The bold is the best accuracy or least computation. The underline is the second best accuracy or second least computation.

Task
Super-class ( , and sub-classes of the CIFAR-100 dataset. As we can see in Table 3, the proposed method outperforms the other methods on ResNet-18 and MobileNetV2 with multiplier 0.5, respectively. Interestingly, although the competitors require a larger number of parameters than the proposed method, the performance of ours is higher than the other approaches. We also performed another experiment to demonstrate the effectiveness of the self-auxiliary task, denoted as ours w/o distillation, in Table 3 denotes the accuracy with respect to different numbers of parameters of the proposed method without applying distillation (described in Section III-C). The proposed method without distillation does not suffer from a significant performance drop compared to ours with distillation. The proposal without distillation still gives competitive performance when compared to the other methods on both architectures while consuming a smaller amount of parameters. In addition, Table 3 reports the best performance of the proposed method without self-auxiliary learning on MobileNetV2 compared to other methods. The results indicate that the proposed hierarchical structure enables learning of the hierarchical knowledge better than the compared methods. At a lower-level path, the coarser knowledge is learned, and at a higher-level path, the richer knowledge is attained using the coarser knowledge, resulting in a performance gain.

E. CURRICULUM LEARNING
In the curriculum learning scenario, we experimented on Celeb-A that has attributes (tasks) of different distributions. For the scenario, we made three tasks according to difficulty; each, normal and hard tasks that contain small to large number of samples, respectively. Accordingly, the proposed multi-path network contains three branches such that we performed easy tasks in the early branch, normal tasks in the middle branch, hard tasks in the last branch, respectively. The temperature and α are set to 2 and 0.9 for both backbones.
The results of the curriculum learning scenario are summarized in Figure 4. For ResNet-18, the proposed method achieves the best or the second-best performance in the six tasks out of eight and for MobileNetV2, the proposed method maintains the best accuracy or the second best accuracy among all tasks. Multiple inference paths at different dif-ficulties of tasks make the proposed method distribute the importance of parameters for each task, which can reduce the negative interference between tasks and thus yields better results than the compared methods. When it comes to parameter usage, our approach is superior to Baseline, PackNet, NestedNet, MTAN and AdaShare on average. Ours requires less than half the number of parameters when compared to Baseline, MTAN and AdaShare in the easy task on both architectures. Figure 4 reports the ratio of parameters and average accuracy on the curriculum learning scenario. As shown in Figure  4 (a), the proposed method requires the smallest number of parameters, while MTAN consumes the largest number of parameters among the compared methods. PackNet and NestedNet also use a smaller amount of parameters compared to Baseline. However, they require a larger number of param-VOLUME 4, 2016 eters compared to ours while showing poorer performance shown in Figure 4. Figure 5 shows the accuracy of individual tasks. It can be seen that the performance of ours is similar or superior to the most methods on both architectures. Note that the required resource is reported in Figure 4 (a).

V. CONCLUSION
In this work, we have proposed a multi-task learning framework to resolve the combined problem of performing tasks under different computation costs and avoiding task interference. The proposed method is realized by a multi-path network with self-auxiliary learning. The multi-path network is constructed hierarchically to perform tasks under different levels of hierarchy (or prediction paths), resulting in the mitigation of negative interference. Furthermore, since the multipath structure is composed of internal networks with different sizes, diverse memory budgets can be handled without extra training efforts. To boost the performance of the network, self-auxiliary learning has been presented, which supplements the task-specific knowledge to the main tasks. The proposed method has been demonstrated under an extensive set of experimental scenarios and has shown its efficiency with respect to both task performance and computational cost compared to other multi-task learning approaches. In future work, we investigate the scalability of the method for more challenging computer vision tasks other than classification tasks.