A Novel Orthogonality Loss for Deep Hierarchical Multi-Task Learning

In this paper, a novel loss function is proposed to measure the correlation among different learning tasks and select useful feature components for each classification task. Firstly, the knowledge map we proposed is used for organizing the affiliation relationship between objects in natural world. Secondly, a novel loss function–orthogonality loss is proposed to make the deep features more discriminative by removing useless feature components. Furthermore, in order to prevent the extracted feature maps from being too divergent and causing over-fitting which will reduce network performance, this paper also added the orthogonal distribution regularization term to constrain the distribution of network parameters. Finally, the proposed orthogonality loss is applied in a multi-task network structure to learn more discriminative deep feature, and also to evaluate the validity of the proposed loss function.The results show that compared with the traditional deep convolutional neural network and a multi-task network without orthogonality loss, the multi -task based orthogonality loss is significantly better than the other two types of networks on image classification.


I. INTRODUCTION
In recent years, image classification [1]- [8] has become more and more widely used in field exploration and daily life due to the rapid development of deep learning, and image classification also receives much attention in optical machine learning such as [9] and [10]. Currently, the best tool for feature extraction is the deep convolutional neural network. Deep convolutional neural networks [3]- [8]can not only extract edge information in shallow layers, but also learn more highlevel feature representations which become more abstract and closer to human cognitive behavior with the deeper semantic information. The multi-task network [11]- [14] derived from deep learning has gradually entered people's sight. Different tasks of the multi-task network are mutually assigned and trained at the same time, but each task has its own independent loss function. However, in the multi-task network, The associate editor coordinating the review of this manuscript and approving it for publication was Jun Wang . the joint training of parameters in hidden layer can not make the feature extraction and classification completely matched, and the useless feature components may bring negative influence to the final classification result. Therefore, in order to achieve such an ambitious goal, we must solve the following problems first.
The first problem is how to guide classification task in each level to assist other classification tasks. It is a gradual process from coarse-level to fine-level when identifying thousands of object classes in the real world by imitating the human learning experience. People may only identify coarse classes, such as birds, cars, and plants when they are young. As the brain system matures, the identifiable targets may be refined in the process of learning common sense. For example, one specific type of bird have parrots, sparrows, etc., and there are buses or cars in the types of car. The recognition of the coarsegrained genera is considered as the high level task, and there are many subtasks that identify fine-grained classes under each genera. Therefore, the knowledge [15]- [17] acquired in VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the high level classification task can also be added to the new task of identifying the fine grained species, which can help separating fine-grained object classes. At the same time, you can better understand the high level classification tasks after learning new tasks. The multi-task scheme constructed in this paper simulates the human learning process, using deep convolutional neural networks as a hidden layer to simulate human brain for feature extraction. Also a hierarchical tree classifier [18]- [20] is leveraged as a task-related output layer for progressive classification, the classification process from easy to difficult and interrelated constitutes different learning tasks. Therefore, based on these progressive relationships, this paper establishes a knowledge map about the database to guide the network learning during the training process. Just as humans need to learn by means of books or predecessors, additional information is used to guide each classification task to assist each other. The second problem is the classifier and the features extraction network are not completely matched. In the multitask network structure, the feature representation is shared by multi-level tree classifiers because the weight parameter in hidden layer are shared. But for the different classification tasks of the task-related output layer-hierarchical tree classifier, the features that multi-level classification task need are not exactly the same. For example, when identifying coarsegrained object classes, identifying steering wheel or wheels can help the network accurately identify the car category. But when identifying fine-grained object classes, these common features may have a greater effect on bringing SUVs and vans into the same category, and we expect the network to pay more attention to their unique feature components, such as appearance, shape, which can be used to distinguish them. Therefore, in the process of performing different classification tasks in the multi-task network, we need to distinguish the common feature components and unique feature components. Considering previous description, a novel loss function -orthogonality loss is proposed for feature selection in multi-level hierarchical classifier. Orthogonality loss uses the cosine similarity in metric learning [21]- [23] to measure the similarity of multi-level deep features. When a vector is divided into a projection vector and an orthogonal vector with another vector, the area of the projection vector represents the cosine similarity. If the cosine similarity between the coarse level deep feature vectors and the fine level deep feature vectors approximate to 0, the two vectors are orthogonal, and the overlap features (projection vectors) can be filtered out, leaving their own useful feature components (orthogonal vectors) only. Such method can allow our proposed network structure get more discriminative deep features for different classification tasks.
The last question is how to prevent the network from overfitting. After the feature selection is added to the loss function, the feature is too divergent and overfitting phenomenon is caused. In order to prevent such problem, this paper adds orthogonal distribution regularization term in the orthogonality loss to constrain the feature distribution, so that the partial distribution of deep feature under same learning task is required to be close to the overall distribution. So it is necessary to prevent the network from overfitting by adding regularization, since overfitting may prevent the network from global optimum.
Based on above discussions, the orthogonality loss proposed in this paper is used in multi-task network to select useful feature components for different classification tasks to improve the accuracy of image recognition. The rest of the paper is arranged as follows: section 2 introduces the related work of this paper, section 3 introduces the network structure and orthogonality loss, section 4 introduces the experimental results and explanations, and section 5 draws the conclusions of the full paper.

II. RELATED WORK
In order to extract rich and vivid deep features, we generally train a deep convolutional neural network. With the advancement of many theories [1]- [8] of deep learning and hardware devices from the earliest time-delayed neural network (TDNN) [1] and LeNet-5 [2], convolutional neural networks have developed rapidly and been widely used in the fields of computer vision [24], natural language processing [25] and optical imaging [26], [27]. Deep convolution neural network is a kind of deep network structure with convolution operation and characterization learning ability, which consisting of convolutional layers, pooling layers, nonlinear activating layers and fully connected layers. This kind of network extracts the invariant features through parameter sharing and sparse connections between convolution kernels. Among this, Resnet [7] is widely used because of significant advantages in recognition accuracy and calculation amount. However, the N-way softmax classifier is unable to pay attention to the similarity imbalance between classes.
With the development of classification technology [28]- [36], the application of multi-task network [37]- [40] in recent years has accelerate this problem. The most common way for applying multi-task learning into deep convolutional neural networks is the hidden layer parameter sharing mechanism: the sharing mechanism first proposed in [41], which proposed the hidden layer parameters were shared but the task-related output layers were independent. Such technology has also been rapidly developed and been widely used in natural language processing [37], facial landmark detection [39] and object detection [42]. The multi-task network mentioned in [43] states that the deep convolutional neural network is applied as a hidden layer for parameter sharing, which can be further used to extract the high level representation of object image, and the tree-classifier is used as the task-related output layer for different classifications tasks. Moreover, the construction method of the label tree mentioned in [13] and the ontology tree mentioned in [44]can guide the treeclassifier for multi-task classification.
However, guiding multi-task classification through knowledge map in multi-task network [45] is the closest approach to human behavior. Knowledge map [46], as a novel knowledge structure and retrieval technology in the era of big data, has gradually revealed its advantages in various aspects and has received extensive attention. The knowledge map was originally proposed by Google in 2012 to improve the capabilities of search engines [46]. However, the powerful semantic processing ability of knowledge map makes it one of the key technologies in the development and application of artificial intelligence [47]. In a multi-task network, the knowledge map is applied to construct a two-layer semantic structure [19] for representing the relationship between objects in the real world, which can be used to guide the transmission of information between different classification tasks of the multi-task network, and guide backpropagation for gradient updates.
The orthogonal transformation of images is widely used in the fields of image feature extraction [48], image enhancement [49], image restoration [50] and image classification [51]. The orthogonal matching pursuit algorithm (OMP) [52] has also been well applied in the field of image fusion. And the Go-CNN network mentioned in [53] can learn the foreground and background of the feature. Therefore, orthogonal transform can be used to extract and distinguish image features. In addition, the projection vectors of the two vectors are common parts and the orthogonal vectors are unique components in geometry. If the two feature vectors are orthogonal, it can be proved that the two feature vectors are completely independent. Therefore, orthogonal transform can also be utilized for feature selection. In order to achieve end-to-end learning, the loss function is usually used to update the network structure to reduce the gap between the predicted value and the true value. The classic losses are Hinge Loss (multiple for SVM) [54], Softmax cross entropy (classification task and feature extraction task) [55], Contrastive Loss (contrast loss function, LeCun proposed in the siamese twin network) [56]. The most fundamental criterion of the loss function is to achieve the defining ultimate goal of the model. Therefore, network performance can be improved by optimizing the loss function.
Based on these observations, this paper proposes a loss function optimization method for multi-task networks. Firstly, the softmax loss of each classification task is preserved. On this basis, the orthogonal part is added to complete feature selection, so that the sub-classifier and the parent node classifier feature vector are orthogonal and the extracted features are relatively independent. Orthogonal distribution regularization is also added to constrain the distribution of features, so that the overall distribution of each sub-task feature vectors is closer to the distribution characteristics of its parent task. Therefore, the orthogonality loss mentioned in this paper obtains a multi-task network with higher classification accuracy and better robustness through feature selection.

A. KNOWLEDGE MAP
During the training process, completely ignoring the similarity between different classes makes it difficult to achieve the global optimum. The knowledge maps are widely used in large-scale classification task as it can efficiently organize large-scale object classes in a course to fine fashion. In this paper, based on the taxonomic knowledge of object classes in real world, Fashion-60 [43] and Caltech-UCSD Birds-200-2011 [57] are divided into two semantic structures, including coarse-grained genus and fine-grained classes. For Fashion-60 database, there are 60 classes of clothes (including dress, shoes, etc.). The two-layer knowledge map constructed with reference to the functional relationship of each item is shown in Fig.1. According to the function of each item, 60 fine classes are used to represent 60 specific classes of clothing and all of them are assigned into 5 different coarse-grained genus; for Caltech-UCSD Birds-200-2011 database, a knowledge map of 200 species of birds, constructed with reference to the natural system relationships of birds, is shown in Fig.2, which containing 10 coarse-grained genus to represent 10 species of birds and 200 fine classes to represent the specific classes of birds under each genus.  The knowledge map consists of the two-layer semantic structure is used to guide the hierarchical tree classifier for multi-level classification tasks. Each tree structure constitutes a learning task, and a tree classifier is constructed for considering the inter-species relations between multiple classes, and the knowledge map guide tree classifier replaces the traditional softmax classifier. The objective function we proposed can be used to help to efficiently update weight parameters in both classifier and base deep network to make the gradient distribution under the same task more uniform.

B. ORTHOGONALITY LOSS
In this paper, the multi-task classification is been divided in two different classification tasks. The classifier in each level corresponded for different tasks, so the required deep features should be very different. According to the knowledge map, when dealing with coarse level classification task, the network is expected to pay more attention on the common feature components that all the fine grained classes which under the same learning task, and ignores the unique feature components of their own, then a specific classifier is trained VOLUME 8, 2020 to recognize those fine grained classes. In addition, it is more difficult when dealing with fine classification tasks, because the similarity between the fine-classes which under the same coarse class is much higher. Therefore, we expect the network focus on the unique feature components of each fine classes and ignore their common feature components to make sure the visual relevance between classes can be ignored, making the network have a higher chance to distinguish images.
Based on the multi-task network for classification, an orthogonality loss is proposed in this paper for feature selection. Orthogonality loss completes feature selection like this: it randomly selects an image among the whole training set for feature extraction, and then the features are input to different task-related output layers. Under the guidance of the loss function in different classification tasks, the taskrelated output layer features are divided into coarse level deep features and fine level deep features. The coarse level deep features are used for the coarse classification task which aims to find out which coarse class the image belonging to; the fine level deep features are used for the fine classification task to determine which fine class under the coarse class the image belonging to. As shown in Fig.3, the spatial projection of the coarse classifier feature and the fine classifier feature is the overlap feature components. The overlap feature components contain the overlap components of the coarse level deep features and the fine level deep features: in the coarse level classification task, the overlap feature components of the coarse level deep feature will contain some unique feature components of each fine classes, which restrain to classify the same coarse class images together. Similarly, when performing a fine level classification task, the overlap feature of the fine classifier feature will have some common feature of the same coarse class, which will restrain to separate the fine classes belonging to a same coarse class. Therefore, we hope the coarse level deep feature and the fine level deep feature are orthogonal in space, which makes the overlap feature approximate to 0. We used the loss function to achieve the ultimate goal by adding the target of the feature selection to the loss function and measuring the size of the overlap feature with orthogonality loss. So, the network can automatically complete feature selection and distinguish the two feature vectors spatially to help different classifiers remove features that are useless to their own classification tasks.

C. THE STRUCTURE OF THE MULTI-TASK NETWORK
Based on the above understanding of the orthogonality loss, the structure of the network is first described in this section, as shown in Fig.4. Considering an image as input, one classifier is used to complete the coarse classification task to determine which coarse class the image belongs to; the corresponding fine-grained classifier is selected to determine which fine class under a coarse class the object image belong to. The hidden layers of the two classification tasks are the deep convolutional neural networks, whose parameters are shared. In this paper, the hidden layer of the network is used for feature extraction. After the features are passed to a fully connected layer FC6, they enter the task-related output layer, which also called the tree classifier. The tree classifier consists of different sub-classifiers, each of them has the same structure but the weight parameters are not shared, where the same structural means that they both contain two fully connected layers and a softmax layer for classification tasks. Therefore, the feature vectors through the task-related output layer -coarse classifier and fine classifier can be differentiated. As shown in Fig.4, suppose the input image is X , CNNs with shared weight are used for two different tasks, and then features are extracted through a same FC6 layer, so that all processing is the same and the extracted features is the same. But there are two independent fully connected layer of the sub-classifiers and their parameters are not shared, so the FC7 and fc7 extracted features are different. So when performing a coarse level classification task, the coarse classifier feature isf g (x); and when performing a fine classification task, the fine classifier feature is f s (x). So f g (x) and f s (x) can be used to calculate the orthogonality loss of the network. Assuming that N training images are input, the orthogonality loss function can be formulated in following (1): where k represents the number of coarse classes, f 1 , f 2 , . . . , f k represents k fine classification tasks (only one fine classifier structure is drawn in Fig.4). f g (x) represents the coarse classifier feature of N images and f s (x) represents the fine classifier feature of N images.The trace of f s (x)f T g (x) represents the sum of dot products of the coarse classifier features and fine classifier features with the N images. When Tr[f s (x)f T g (x)] approaches to 0, it means that the corresponding vectors of the coarse grained deep feature and the fine grained deep feature tend to be orthogonal. α is a hyper parameter, and the magnitude of α represents the influence of orthogonality loss of the entire network parameters during backpropagation.
Under the guidance of orthogonality loss function, f g (x) and f s (x) tend to be orthogonal, so that the network's derived classifier features and fine classifier features may become more discriminative. Then f g (x) and f s (x) are transferred to the softmax layers for classification, the coarse classification result and the fine classification result are respectively obtained. Combining the two results determines which fine class under the coarse class the object image should belonging to. Different tasks have different losses, so the loss by the classification is composed of the gap between the predicted value and the coarse class label in the coarse classification task and the difference between the predicted value and the fine class label in the fine classification task. In Fig.4, softmax loss 1 maens the gap between the predicted value and the coarse class label in the coarse classification task; and softmax loss 2 means the difference between the predicted value and the fine class label in the fine classification task. which can be measured by the following loss function (2): where g represents the coarse class and s represents the fine class. l(y(i) = j) represents characteristic function, if y(i) = j, l(y(i) = j) = 1. X represents the depth features of the input image obtained, θ g and θ s represents the model parameters in the coarse and fine classifiers respectively, k g and k s the number of categories of the coarse-grained class and the fine-grained class respectively.When the loss function (2) is infinitely close to 0, the predicted value is infinitely close to the true value.

D. ORTHOGONAL DISTRIBUTION REGULARIZATION
Considering large-scale training data, the choice of features using orthogonality loss distinguishes f g (x) and f s (x). However, the features obtained after several training iterations may over-fit the requirements by different learning tasks, making f g (x) and f s (x) too divergent and may cause overfitting. Therefore, orthogonal distribution regularization term is constructed in this paper to limit the distribution of parameters. According to the laws of natural systems, there is a fixed relationship between things. Therefore, the knowledge map constructed in this paper is a fixed tree structure, and the corresponding tree classifier also has a fixed composition. There is a fixed number of fine-grained classes for each coarse-grained class. As shown in Fig.5, there are k parent nodes (coarse genus) and N fine gained child nodes (object classes). Each parent node contains S leaf nodes (S 1 , S 2 , . . . . . . , S k are different). Therefore, there is a correspondence relation between parent nodes and leaf nodes. We assign leaf nodes into the same parent node according to the commonality of each leaf node. Therefore, we construct the following distribution model to limit the deep feature from being too divergent: where F represents the parameter distribution of the coarse classifier features, which conforms to the normal distribution with a mean of 0 and a variance of 1 γ 1 D g . f s represents the parameter distribution of the fine classifier feature, and its mean value is the parameter of its parent node. Just like the feature points of f slipper and f boot will be closer F shoes to and the variance is 1 γ 2 D s . When the feature selection is implemented, the limitation for model can make the partial distribution tend to become whole distribution no matter how to distinguish the fine classifier features from the coarse classifier features, and also make the distribution not too divergent due to over-fitting classification tasks. Similarly, in this paper, like (4), orthogonal distribution regularization is added to the loss function to make the network self-learning: where f s represents the fine classifier feature of the network; F parent represents the network classifier feature of the network; and β represents the influence factor of the orthogonal distribution regularization on the multi-task network.

IV. EXPERIMENT
Datasets: In this paper, there are two image databases been used to validate the orthogonality loss function and the knowledge map are constructed for each database: (1) Fashion-60, containing 60 costume classes and 5 coarse grained classes. (2) Caltech-UCSD Birds-200-2011, containing 200 fine grained classes and 10 coarse grained classes. VOLUME 8, 2020 Experiment Environment In this paper, those experiments were performed on a GeForce GTX 1080 GPU. The learning rate was set to 0.01 and multiplied by 0.1 every 40 epochs.
The Basic Architecture of Hierarchical Deep Network In this paper, we used Resnet-18 as the feature extraction network and tree classifier as the task-related output layers to build the multi-task network. We used the Resnet-18 for feature extraction and tree classifier for multi-classification. The loss function is the combination of softmax loss function and orthogonality loss function. When using the multi-label classification on the database, we fuse the softmax losses of two layers.
Compared Baseline Models In this paper, we propose an orthogonality loss function to improve the classification performance of the multi-task network. There are a few baseline models that we can compare with. One is the traditional deep learning network like Alexnet [3], VGG-19 [5] or Resnet-18 [7]. The other one is the standard multitask network model with Resnet-18. We simply make the standard multi-task network as the baseline. In the experiment, we trained the standard multi-task network added with orthogonality loss function to verify the effectiveness of our proposed method.Compared with the network without adding the loss function, there will be extra computational cost at the same time, but the extra computational cost at the same time are very low. When using the proposed loss function, the network training will be slightly slower, but the testing time will not change.

A. EXPERIMENT WITH FASHION-60
In this section, we apply our proposed method with multiple baseline method on the Fashion-60 database which contains 60 fine grained classes and 5 coarse grained classes. First, the influence of the influence factor α is observed on the experimental results when we add the orthogonality loss function. Then the appropriate value of α is selected to compare the experimental results with the baseline. Finally, we validate whether there is any improvement after the orthogonal distribution regularization is added.

1) THE VALUE OF α
First, we need to select appropriate α for training. Therefore, we have selected 14 different values for α from 0.001-6. The experimental results are shown in Fig.6. As can be seen from the figure, when α is 2.5, the network performs best. When the impact factor is small, the effect of the orthogonality loss function on the network is not obvious. When the value is gradually increased, the performance of the network will gradually decrease, which indicates that the role of the orthogonality loss function will increase to affect the original performance when the value is too large. Therefore, we choose α = 2.5 to train the network and compare it with the baseline network.

2) COMPARISON WITH STATE-OF-ART METHODS
The accuracy of some methods with Fashion-60 are shown in Table 1, the results show that compared with the traditional deep convolutional neural network and a multi-task network without hierarchical orthogonality loss, the multi-task network based on orthogonality loss is better than the other two types of networks in classification. The result proves that the orthogonality loss function proposed in this paper effectively completes the feature selection, making the features obtained in the multi-task network more in line with the task requirements.
Subsequently, we compare the accuracy of each class between baseline and the multi-task network based on orthogonality loss where α = 2.5. The comparison of accuracy on the coarse level classification task is shown in Fig.7. The red bars represent the accuracy of the multi-task network based on orthogonality loss, and the blue bars represent the accuracy of the baseline methods. It can be seen from the data in the Fig.7 that in the coarse classification task, the recognition accuracies of each coarse class after adding the orthogonality loss are improved. The experimental results show that the  orthogonality loss guides the multi-task network to eliminate the useless feature components in the coarse grained deep feature. We extract the fine-grained classes with more obvious changes in the fine classification, as shown in Fig.8. As can be seen from Table 1, the overall recognition accuracy of the multi-task network has been improved when adding the orthogonality loss. However, from the data with obvious changes extracted in Fig.8, it can be easily found out that in the fine-grained classification task, the recognition accuracy of most fine-grained classes after adding the orthogonality loss is improved, but there are also some classesİŕ recognition performance is degraded. It can be seen that the recognition accuracy shows a huge differences, but after adding the orthogonality loss, the lower accuracy of some fine classifications has been improved. In comparison, the higher accuracy of some fine classifications has been declined. The reason may be that the orthogonality loss reduces the gap between the accuracies of fine classifications task and makes the network global optimal to improve the overall performance of fine classification tasks by increasing lower fine classification accuracy.

3) ORTHOGONAL DISTRIBUTION Regularization(ODP)
Finally, we observe whether there is any improvement after the orthogonal distribution regularization is added. The parameter β in (4) takes the same value as α in (1). (4) (orthogonal distribution regularization term) is a regular term added to (1) (orthogonal loss function) to limit the distribution of feature parameters and prevent overfitting. Therefore, (4) and (1) need to have the same influence factor on the multi-task network in order to balance the restriction and discrimination. As shown in Table 2, we compared three methods. And compared with the multi-task network added with center loss, our proposed method achieves some improvement in recognition accuracy. According to the data in this table, after adding orthogonal distribution regularization, the performance of the multi-task network has improved on Fashion-60. The experimental results show that not only the orthogonality loss function effectively completes the feature selection, making the features obtained in the multitask network more in line with the task requirements. And the orthogonal distribution regularization added can also effectively limit the distribution of parameters to avoid overfitting of the network.

B. FURTHER EXPERIMENT ON CALTECH-UCSD BIRDS-200-2011
In order to verify the effectiveness of the algorithm, we conducted further experiments on the Caltech-UCSD Birds-200-2011, which contains 200 fine classes and 10 coarse classes.

1) THE VALUE OF α
Similarly, we select 14 different values for α from 0.001-6. The experimental results are shown in Fig.9. As can be seen from the figure, the network performs best when α is 2. When the impact factor α is small, the orthogonality loss function is not stable to the network. And when the value gradually increases beyond a certain range, the performance of the network will gradually decrease. Compared with the experimental results on Fashion-60, the optimal value of α is different, but for both types of databases, the overall trend of the effect of α on network performance is the same. Further experimental results show that the value of α may be different in different databases, but the influence of α on network performance is regular. Therefore, it will not work when the value of α is too small, and counteraction appears when the value is too large. We need to find a balance point between orthogonality loss and softmax loss. Therefore, when give for a new dataset, our suggestion for the value of α is to first select a number between 1 and 4.5 for training and check whether the network performance is improved. However, because of the diversity of the database, experiments with values between 0.001 and 10 can further ensure the optimality of the value of α.

2) COMPARE WITH THE BASELINE
In addition, we combine the algorithm mentioned in this paper with the traditional deep convolutional neural    network and a multi-task network without orthogonality loss, the experimental results are shown in Table 3. As can be seen from the data in the Table 3, the multi-task network based on orthogonality loss is better than the other two types of networks in classification on Caltech-UCSD Birds-200-2011. This further proves that the orthogonality loss function proposed in this paper effectively completes the feature selection, making the features obtained in the multi-task network more discriminative.

3) ORTHOGONAL DISTRIBUTION Regularization(ODP)
After adding orthogonal distribution regularization, the performance of the multi-task network has also improved. The parameter β also takes the same value as α. As shown in Table 4,the experimental results on Caltech-UCSD Birds-200-2011 further prove that not only the orthogonality loss function effectively completes the feature selection, making the features obtained in the multi-task network more in line with the task requirements,and also the orthogonal distribution regularization added can effectively limit the distribution of parameters to avoid over-fitting of the network.

V. CONCLUSION
In this paper, a novel loss function-orthogonality loss is proposed to achieve feature selection in multi-task network structure, which helps achieving improvements on image classification. The orthogonality loss can guide the multitask network extract more specific deep features for different classification tasks and improve the overall classification performance of the whole network. And the orthogonal distribution regularization term is also added to limit the distribution of parameters to reduce the risk of over-fitting. Finally, the results of the classification experiment on Fashion-60 and Caltech-UCSD Birds-200-2011 prove the effectiveness of the proposed algorithm in this paper. It is worth noting that when a new dataset is given, using this method needs to build a knowledge map consisted of two-layers semantic structure firstly, which is used to guide the hierarchical tree classifier to perform multi-level classification tasks. YINCHENG HUO received the bachelor's degree in electronic information engineering from Northwestern Polytechnical University, Xi'an, China, in 2018, where he is currently pursuing the master's degree in signal and information processing. His current research interests include speech signal processing and deep learning about object detection. VOLUME