Representation Learning With Dual Autoencoder for Multi-Label Classification

Multi-label classification aims to deal with the problem that an object may be associated with one or more labels, which is a more difficult task due to the complex nature of multi-label data. The crucial problem of multi-label classification is the more robust and higher-level feature representation learning, which can reduce non-helpful feature attributes from the input space prior to training. In recent years, deep learning methods based on autoencoders have achieved excellent performance in multi-label classification for the advantages of powerful representations learning ability and fast convergence speed. However, most existing autoencoder-based methods only rely on the single autoencoder model, which pose challenges for multi-label feature representations learning and fail to measure similarities between data spaces. To address this problem, in this paper, we propose a novel representation learning method with dual autoencoder for multi-label classification. Compared to the existing autoencoder-based methods, our proposed method can capture different characteristics and more abstract features from data by the serially connection of two different types of autoencoders. More specifically, firstly, the algorithm of Reconstruction Independent Component Analysis (RICA) in sparse autoencoder is trained on patches on all training and test dataset for robust global feature representations learning. Secondly, with the output of RICA, stacked autoencoder with manifold regularization (SAMR) is introduced to ameliorate the quality of multi-label features learning. Comprehensive experiments on several real-world data sets demonstrate the effectiveness of our proposed approach compared with several competing state-of-the-art methods.


I. INTRODUCTION
Recent years have witnessed many approaches to solve the problem of that one object may associate with a set of labels, which is also commonly framed as the multi-label classification problem [1]. Different from binary class and multi-class classification in single-label problem, the intrinsic multi-label nature of most real datasets could represent the world more exactly [2]- [4]. In addition, multi-label learning has a wide range of applications in news classification [5], image processing [6] and other fields [7]. For example, a scenery image in MS COCO data set [8] may contain car, person, sky, boat The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Hao Chen . and so on, which can be regarded as a multi-label image. One news article can be classified to several topics like politics and sport due to it reports that one athlete is running for president.
The traditional multi-label classification methods, such as problem transformation and algorithm adaptation methods, either transform multi-label into single-label classification or extend specific algorithm for multi-label data [9]. For example, classifier chains methods [10], [11] built binary classification in the chain based on the previous predictions for the transformation from multi-label classification to a chain of binary classification. Multi-Label k-Nearest Neighbor (ML-kNN) method [12] introduced kNN algorithm for multi-label classification, where maximum a posteriori principle is utilized to predict label information for the instances.
Considering the shortcomings in the neglect of label correlations, many correlation-enabling methods attempted to improve the generalization ability of system in recent decades [13]. Tahir et al. [14] proposed to integrate stacked Spectral Regression based Kernel Discriminant Analysis (SRKDA) with ML-kNN, which can be utilized for dimensionality reduction and multi-label classification simultaneously, both correlation and high dimensionality problems can be tackled in this method. Alali and Kubat [15] proposed to reduce error-propagation and prune unnecessary label dependencies with classifier-stacking method, the stacking structure is used in this method to fulfill controlled label correlation exploitation by pruning the uncorrelated outputs. However, the main problem of these methods is the learning ability of feature representations for multi-label classification, the raw input data that used in the learning system directly may contain non-helpful features and deteriorate the classification performance. Due to the tremendous success in feature representation learning, there have already been some efforts on devoting deep-based methods to multi-label classification. Wang et al. [16] combined recurrent neural networks (RNNs) with deep convolutional neural networks (CNNs) for multi-label image classification, and a joint image-label embedding is learned to model the label co-occurrence dependency in an end-to-end way. Wang et al. [17] proposed a label graph superimposing method based on graph convolution network (GCN) for multi-label recognition, the knowledge graph is superimposed into statistical graph for label correlation learning, and lateral connection is conducted for label-feature correlation modeling. However, these methods always suffer from the lack of labeled data, which is often expensive and laborsome in the real world. Recently, the autoencoder based models have achieved sound performance for the superiority of powerful representations learning ability and fast convergence speed [18]. Yeh et al. [19] proposed canonical correlated autoencoder based on deep neural networks for more desirable performance on multi-label classification, and a joint feature and label embedding is performed to better relate feature and label domain data. Huang et al. [20] proposed a two-encoding layer autoencoder to share knowledge with the second encoding weight matrix, both representation learning and multi-label learning is jointly optimized with the autoencoder model for the improvement of multi-label classification performance. However, these autoencoders-based methods just relied on the single autoencoder model, which pose challenges for multi-label feature representations learning and fail to measure similarities between data spaces.
To address these problems, we propose a novel Representation Learning method with Dual Autoencoder for multi-label classification (RLDA for short), in which we can capture different characteristics and more abstract features from data by the serially connection of two different types of autoencoders. Specifically, firstly, the algorithm of Reconstruction Independent Component Analysis (RICA) in sparse autoencoder is trained on patches on all training and test dataset for robust global features learning. Then, with the output of RICA, a stacked autoencoder with manifold regularization (SAMR for short) is applied to improve the quality of multi-label feature representations. Finally, we can obtain the new feature representations for multi-label classification by serially connecting two different types of autoencoders. Extensive experiments on several real-world data sets demonstrate the effectiveness of our proposed RLDA compared with other state-of-the-art methods. The main contributions of this paper are summarized as follows: • We propose a novel representation learning method called RLDA, which exacts different characteristics and more abstract features from data by serially connection of two different types of autoencoders for multi-label classification.
• A algorithm of RICA and the method of stacked autoencoder with manifold regularization (SAMR) are introduced to learn more discriminative and abstract features, which can discover latent knowledge of the raw input data for multi-label learning.
• The comprehensive experiments over four real data sets show that our method outperforms state-of-the-art models and evaluate the effectiveness of our method. The remainder of this paper is organized as follows. Some preliminary knowledge used in our proposed method is reviewed in Section II and details of the proposed RLDA method are provided in Section III. Experimental results and analysis on four real world datasets are presented in Section IV, followed by the related work is introduced in Section V. Finally, our conclusions are summarized in Section VI.

A. AUTOENCODER
The autoencoder model [21] is an unsupervised feature representation learning model, which aims to learn an approximate representation of the input by the encoder and decoder layers. Autoencoder has already been one of the most successful deep neural networks and actively adopted as a multi-label classification model recently. Given the input as , the autoencoder model attempts to learn an approximate output h W ,b (x) ≈ x. Specifically, the autoencoder model usually contains one encoder and one decoder layer respectively. In encoder layer, the input is encoded to one or more hidden layers through several encoding processes, then the hidden layers are decoded to the output asx. The encoder and decode layer in autoencoder model that just includes one hidden layer can be represented as (1) and (2): where W 1 ∈ k×m and W 2 ∈ m×k are the weight matrixes, b 1 ∈ k×1 and b 2 ∈ m×1 are the bias vectors, ξ ∈ k×1 is the output of hidden layer, f and g are the nonlinear activation function of encode and decode layers respectively. For succinctness, the original input data are denoted as {x i } n i=1 , thus the reconstruction error can be expressed as The crucial problem of the autoencoder model is to minimize the reconstruction error by the parameters learning about W 1 , W 2 , b 1 and b 2 , which can show as (3): Reconstruction Independent Component Analysis (RICA) model [22] aims to exact sparse representations of whitened or non-whitened data from unlabeled data, which tries to learn a set of linearly independent basis features to represent input data accurately. Given the input as x, in order to learn the output which is represented in the columns of a weight matrix W as shown in (4): The optimization of the RICA's objective function is represented as (5): Compared to the above-mentioned objective function of autoencoder, the reconstructive penalty is added for scaling up to over-complete features, which is shown as the second item in (5). Since the objective function of RICA has no analytic solution, the gradient of reconstruction cost is driven with the back-propagation idea for the optimization. The gradient with respect to W T is transposed to the gradient with respect to W , and the final gradient with respect to W is shown as (6): Manifold regularization aims to construct a graph connecting similar observations for unsupervised or semi-supervised learning, and label information propagates through the graph from labeled nodes to unlabeled ones by finding the minimum energy configuration [23]. In our proposed method, we incorporate manifold learning as a regularization into autoencoder for enforcing neighbors located in the same local structure on the representation space. Given the input as . , x n }, the adjacent matrix K can be shown as (7): where NN (k, x j ) is the k nearest neighbors of x j and k is the hyper-parameter. The distance between x i and x j can be calculated by cosine distance. D is denoted as a diagonal matrix where D = diag( j K ij ), and Laplacian matrix L is denoted as L = D − K . The manifold regularization term can be written as (8):

III. METHODOLOGY
In this section, firstly, some important notations used in this paper are listed in TABLE 1. Then, the whole framework of our proposed RLDA is presented in detail.

A. OVERALL ARCHITECTURE
The proposed representation learning method with dual autoencoder is a deep neural network which is able to learn more robust and higher-level feature representations for multi-label classification. As shown in Fig. 1, the methods contains two different types of models as dual autoencoder, each autoencoder has its own strengths for extracting multiple characteristics of the input data. Specifically, there are two stages in our proposed method: (1) the algorithm of Reconstruction Independent Component Analysis (RICA) in sparse autoencoder is trained on patches for global features learning; (2) based on the results of stage (1), a stacked autoencoder with manifold regularization (SAMR) is applied to improve the quality of multi-label feature representations. After training, the softmax regression is used to predict the label set of each test instance with the learned feature representations. In the following, the details of two stages in our proposed RLDA will be given. The first stage of our proposed method is the RICA model, which learns the latent feature representation subspace from the original input data. Given the input data as training dataset i . The objective function of RICA is defined as (9): In our work, f (x) = (Wx (t) ) 2 + ε is used to implement L1 regularization in (9) so that the reconstruction term can be scaled, and ε = 0.1 is a small constant value to avoid the item (Wx (t) ) 2 + ε be numerically close to zero. Therefore, the objective function can be defined as (10): The computational formula of the partial derivatives for L with respect to W and W T is shown as (11): Based on the above partial derivatives, we feed the output i as the input into the stacked autoencoder with manifold regularization.
The stacked autoencoder with manifold regularization (SAMR) has been used to improve the quality of multi-label feature representations, and generalized eigendecomposition is used to optimize the parameters of model and learn higher level feature representations. The main intuition behind the labels is to transform the multi-label task to multi-class task, which converts (instance, labels) into a set of (instance, label) where each (instance, label) contains just one label. The The manifold regularization item can be noted as f (x) = Mx, where M is the transformation weight vectors. The loss function for manifold regularization can be shown as (12): where γ is tuning parameter.
In addition, as mentioned in (8), local geometry preserving term can be defined as j . Integrating this term into (12), the optimization problem can be expressed as (13): arg min According to the tricks mentioned in (8), (13) can be rewritten as (14): where L is the Laplacian matrix.

D. PREDICTION
After the feature representations are learned with the connection of dual autoencoders, the softmax regression method is 98942 VOLUME 9, 2021 introduced to predict the multi-labels for each test instance. More specifically, following the same strategy adopted by other methods [20], the probability of one instance belongings to every label is estimated firstly, then all the probabilities of label are sorted in descending order, and the difference between two adjacent label probabilities is calculated. Finally, the labels are assigned based on the maximum difference, the labels before the max difference are considered to be the predicted labels for instances. The whole process of our proposed RLDA model is summarized in Algorithm 1.

Algorithm 1 Representation Learning With Dual
Autoencoder for Multi-Label Classification (RLDA)

IV. EXPERIMENTS
In this section, we conduct extensive experiments to evaluate the performance of our proposed method. In the following, the details of datasets are presented firstly. Secondly the compared methods and experiment settings are introduced in detail. Then the classification results with observations of our proposed RLDA and other competing methods are given. Finally, the properties and parameter sensitivity of RLDA are analyzed with certain dataset.

A. DATASETS
The datasets including enron, medical, Corel5k and Corel16k001 are selected from Mulan [24], which is an open Java library for multi-label learning. 1

B. COMPARED METHODS
We compare our proposed RLDA with the following baseline methods: • Learning multi-label scene classification (Binary relevance, short for BR) [25]. It fits multi-label data into n independent binary classifiers for each label.
• A lazy learning approach to multi-label learning (ML-KNN) [12]. It is based on the traditional k-nearest neighbor method, and it introduces maximum a posteriori principle to determine the label set for the unseen instance.
• Random k-Labelsets (RAKEL) [26]. It is based on random label space projection, and a set of Label Powerset classifiers is trained on an ensemble of k random label subsets for multi-label classification.
• Stacked Denoising Autoencoders (SDA) [27]. In this method, higher level feature extraction is firstly learned by stacked denoising autoencoder for code layer. Then, the features are combined with the labels to construct new feature space. Finally, the method of Bayesian Multinomial Regression (BMR) is adopted for classification on the new feature space.
• Manifold regularized discriminative feature selection for multi-label learning (MDFS) [28]. The manifold regularization is introduced in this method to generate the low-dimensional representations from the original input data for the local and global label correlations exploitation, then the feature selection is conducted for discriminative feature representation learning by involving 21 -norm regularization. • Supervised representation learning for multi-label classification (SERL) [20]. It introduces a two-encoding layer autoencoder with supervised manner to learn global feature representations for multi-label classification, and the softmax regression is utilized to incorporate label knowledge by being jointly optimized with autoencoder for improving the performance of this method.

C. EXPERIMENT SETTINGS
There are three hyper-parameters in our proposed method including tuning parameter λ, γ and the number of nodes k in the embedding layer, and we set λ = 0.05, γ = 1E − 03, k = 100 for all datasets after cross-validations training. LIBSVM with linear kernel is used as the base classifier for all the compared methods except SDA. Moreover, for ML-KNN, the K is set as 10. For RAKEL, we set the value of ensemble and label subset k as 2c and 3, where c is the number of labels. For SDA, the number of nodes k in the VOLUME 9, 2021 embedding layer is set to 100, which is consistent with our proposed method. For MDFS, the influence of parameters β and γ are set to 1 and 100 as conducted in their source code 2 [28]. For SERL, the trading-off parameters α, β and k are set as 15, 0.005, 100 respectively according to [20].

D. EXPERIMENTAL RESULTS
The ranking based evaluation metrics about RankingLoss, Coverage, MacroAUC and the classification metrics about Accuracy, F1-score and MacroF1 are adopted to compare our RLDA with other methods in a more comprehensive way. The ranking and classification results of experiments are reported in TABLE 3-10, and the best results are marked in bold. We have the following observations from experimental results: • Among all ranking based evaluation metrics, our RLDA performs the best in enron dataset and achieves the best performance on Coverage and MacroAUC in datasets corel5k and corel16k01. Even on the metric of Rank-ingLoss, RLDA obtains an competitive result in all the four results.
• Among all classification evaluation metrics, our proposed RLDA performs better than BR, it indicates the multi-classification methods with autoencoder outperform the standard multi classifiers method. Our proposed RLDA can extract more robust feature representations for multi-label classification. The BR method aims to learn the classifiers for overlap class, which may not be able to learn discriminative features and deteriorate performance of multi-label classification.
• RLDA outperforms ML-KNN and RAKEL, which shows the deep-based methods can learn more abstract feature representations than the shallow architecture for multi-label classification.
• Our proposed RLDA outperforms SDA, MDFS and SERL, it indicates the serially connection of two different autoencoders which captures different features is better than stacking a single autoencoder in multi-label classification.
• Neural network based methods (e.g., SDA, MDFS and SERL) deliver a relatively good result compared to problem transformation and algorithm adaptation methods (e.g., BR, ML-KNN and RAKEL) in most cases, which demonstrates the ability of feature representations learning of neural network in multi-label classification.
• Overall, in all datasets, our proposed RLDA performs best in terms of Accuracy, F1-score and MacroF1 compared to the state-of-the-art methods. The results validate the effectiveness of our proposed method.

E. COMPARISON WITH SINGLE AUTOENCODER
To verify the effectiveness of our proposed method, especially the serially connection of two different types of autoencoders, we compare the RLDA with only RICA and only SAMR.
2 https://github.com/jiazhang-ml/MDFS      The results on medical and Corel15k are listed on TABLE 7. We can see that the performance of proposed RLDA is obviously better than only RICA and only SAMR, which   demonstrate that the combing two different types of autoencoders can capture more powerful and abstract feature representations than a single autoencoder in multi-label classification.

V. RELATED WORK
Multi-label classification has been extensively researched and used in many applications such as text categorization [29], music categorization [30] and semantic classification of images [31]. The multi-label classification methods can be mainly divided into two different groups: problem transformation and algorithm adaptation methods [9]. Problem transformation methods solve the multi-label learning problem by transforming it into other wellestablished scenarios. For example, Binary Relevance [25] and Classifier Chains [32] transformed the multi-label learning tasks into binary classification tasks. Calibrated Label Ranking method [33] aimed to transform the multi-label learning problem into the label ranking problem based on pairwise comparison. Mencía et al. [34] proposed Quick Weighted Multi-label Learning (QMWL) method, which transformed a class ranking into a bipartite prediction by introducing an artificial thresholding class with the QWeighted voting for reducing computational costs. Random k-labelsets method [35] learned an ensemble of multi-label classifiers based on the dividing of k random label subsets, which improved computational efficiency and predictive performance compared to the traditional label powerset methods.
Algorithm adaptation methods adapt the existing single label classification algorithms to multi-label data [9]. For example, Rastin et al. [36] proposed a prototype weighting method to adapt the distance measure based on the ML-kNN method [12], the prototype weights were adjusted by gradient ascent method in order to maximize the objective function as macro-F1 measure. Kouchaki et al. [37] designed the multi-label random forest (MLRF) models for treating tuberculosis resistance classification and mutation ranking in medical. Wu et al. [38] jointed Ranking support vector machine and Binary Relevance with robust Low-rank learning (RBRL), which enjoyed the advantages and tackled the disadvantages of Rank-SVM and BR. Xuan et al. [39] developed a Bayesian nonparametric model for multi-label learning, which can learn both low-dimensional labels and instances embedding without the fixed of dimensions number. Zhang et al. [40] proposed a fully associative ensemble learning method for hierarchical multi-label classification, which built a multi-variable regression model between the global and local predictions of all the nodes.
Recently, the representation learning methods have achieved the encouraging results in multi-label classification. For example, Huang et al. [41] proposed to learn label-specific data representation for each class label in a sparse stacking way, which exploited both second-order and high-order label correlations for multi-label classification. Zhang et al. [42] proposed a hierarchical and transparent representation learning method to express the semantic information for accurate paper-reviewer recommendation as multi-label classification. Ye et al. [43] introduced a dynamic graph convolutional network to project raw input into category-aware representations with semantic attention module, and the final category representations are utilized for multi-label image recognition. Gong et al. [44] proposed a hierarchical graph transformer method for multi-label text classification, a multi-layer transformer structure and the hierarchical relationship of the labels are used for feature representations learning in different level. On the other hand, the autoencoder based methods have attracted much attention for the superiority of powerful representations learning ability and fast convergence speed. For example, Huang et al. [20] designed the two encoding layers auto-encoder model for multi-label learning, and the knowledge is shared by softmax regression for the performance improvement. Law and Ghosh [45] introduced a stacked autoencoder for a discriminating and reduced input representation learning of the VOLUME 9, 2021 multi-label data. Cheng et al. [46] proposed a kernel extreme learning machine autoencoder for the associations learning between the features in the input space. Due to there has no iterative process in the extreme learning machine autoencoder, this method can reduce the computational complexity and improve the classification performance.

VI. CONCLUSION
In this paper, we propose a representation learning method with dual autoencoder (RLDA), which learns richer feature representations by the serially connection of two different types of autoencoders for multi-label classification. In our proposed method, the method of Reconstruction Independent Component Analysis (RICA) is introduced in the first stage for robust global feature representation learning. Furthermore, the stacked autoencoder with manifold regularization (SAMR) is applied in the second stage to extract more powerful feature representations. Extensive experiments conducted on four real-world datasets demonstrate the effectiveness of our proposed method compared with other competing methods.
This study points out the effectiveness of the proposed RLDA on multi-label classification. However, how to determine the types, the numbers and the connection methods of autoencoders is still a challenge. In our future work, we will try to add other different types of autoencodes for discovering more characteristics of data, and multiple types of autoencoders will be connected serially and parallelly to extract more abstract feature representations for multi-label classification.
YI ZHU received the B.S. degree from Anhui University, the M.S. degree from the University of Science and Technology of China, and the Ph.D. degree from the Hefei University of Technology. He is currently an Assistant Professor with the School of Information Engineering, Yangzhou University, China. His research interests include data mining and knowledge engineering.
YANG YANG received the B.S. degree from Nanjing Tech University. He is currently pursuing the degree with the School of Information Engineering, Yangzhou University, China. His research interests include knowledge engineering and recommendation systems. YUNHAO YUAN received the Ph.D. degree in pattern recognition and intelligence system from the Nanjing University of Science and Technology (NUST), China, in 2013. He is currently an Associate Professor with the School of Information Engineering, Yangzhou University. He is the author or coauthor of more than 60 scientific articles. His research interests include pattern recognition, machine learning, multimedia search, and information fusion. He received two National Scholarships from the Ministry of Education, China, and an Outstanding Ph.D. Thesis Award from NUST.
RUNMEI ZHANG received the B.S. degree from Huaibei Normal University, and the M.S. and Ph.D. degrees from the Hefei University of Technology. She is currently a Professor with the School of Mechanical and Electrical Engineering, Anhui Jianzhu University, China. Her research interests include data mining, complex system modeling, and architecture digitization. VOLUME 9, 2021