An Effective Semi-Supervised Multi-Label Least Squares Twin Support Vector Machine

Multi-label twin support vector machine (MLTSVM), being an effective multi-label classifier based on twin support vector machine (TSVM), has been widely studied and applied due to its excellent classification performance. However, there are some disadvantages in classical MLTSVM: (a) MLTSVM needs to solve a series of quadratic programming problems (QPPs), which makes its learning speed lower. (b) For multi-label learning problems, it is very difficult to obtain all labels of all samples. In fact, the datasets that we can obtain only contain a small amount of labeled samples and a large amount of partially labeled and unlabeled samples. However, MLTSVM can only use expensive labeled samples and ignore cheap unlabeled and partially labeled samples. For the drawbacks, we propose a novel semi-supervised multi-label least squares twin support vector machine, called SS-MLLSTSVM. Firstly, to speed up solving, SS-MLLSTSVM introduces the least squares idea into each sub-classifier of MLTSVM, which makes each sub-classifier only need to solve a system of linear equations, instead of one QPP. Secondly, SS-MLLSTSVM can make full use of the geometric information in unlabeled and partially labeled samples by introducing manifold regularization term into each sub-classifier. The experimental results on the benchmark datasets show that, compared with the existing multi-label classification algorithms, our SS-MLLSTSVM has better classification performance.


I. INTRODUCTION
TSVM [1], proposed by Jayadeva et al. in 2007, can be used to solve the binary classification problem. Because of its high learning speed and good generalization performance, it has been widely studied and applied. Many improvements of TSVM have been proposed, such as [2]- [18].
The above improvements can only solve the single-label learning problems, not solve the multi-label learning problems in which each sample may simultaneously belong to multiple labels. The multi-label learning problem is common, such as [19]- [22]. Up till now, there are two types of methods to solve the multi-label learning problem: problem transformation and algorithm adaptation. The problem transformation method solves the multi-label learning problem by transforming it into one or more single-label The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . problems, such as label powerset (LP) [23], binary relevance (BR) [24], random k-labelsets (RAKEL) [25], classifier chains (CC) [26], calibrated label ranking (CLR) [27], and so on. The algorithm adaptation method extends the existing single-label learning algorithm to handle multi-label learning problem, such as ranking support vector machine (Rank-SVM) [28], collective multi-label classifier (CML) [29], multi-label k-nearest neighbor (ML-KNN) [30], multi-label decision tree (ML-DT) [31], backpropagation for multilabel learning (BPMLL) [32], and so on.
In order to extend TSVM to handle the multi-label learning problem, in 2016, Chen et al. proposed a multi-label twin support vector machine (MLTSVM) [33]. Compared with other traditional multi-label classification algorithms, MLTSVM has better generalization performance. Thereafter, many improvements of MLTSVM have been presented, such as KNN-based multi-label twin support vector machine with priority of labels (PKNN-MLTSVM) [34], structural least square twin support vector machine for multi-label learning (ML-SLSTSVM) [35], et al.
However, there are some disadvantages in classical MLTSVM and its improvements: (a) MLTSVM needs to solve a series of quadratic programming problems (QPPs), which makes its learning speed lower. (b) For multi-label learning problems, it is very difficult to obtain all labels of all samples. In fact, the datasets we can obtain only contain a small amount of labeled samples and a large amount of partially labeled and unlabeled samples. However, MLTSVM can only use expensive labeled samples and ignore cheap unlabeled and partially labeled samples. For the drawbacks, we propose a novel semi-supervised muti-label least squares twin support vector machine, called SS-MLLSTSVM. Firstly, to speed up solving, SS-MLLSTSVM introduces the least squares idea into each sub-classifier of MLTSVM, which makes each sub-classifier only need to solve a system of linear equations, instead of one QPP. Secondly, SS-MLLSTSVM can make full use of the geometric information in unlabeled and partially labeled samples by introducing manifold regularization term into each sub-classifier. The experimental results on the benchmark datasets show that, compared with the existing multi-label classification algorithms, our SS-MLLSTSVM has better generalization performance.
The structure of this article is as follows: Section 2 introduces some related works, such as TSVM, least squares twin support vector machine (LSTSVM), MLTSVM, etc. In Section 3, the SS-MLLSTSVM is proposed, including linear case, nonlinear case and decision function. The fourth section presents the experimental results and analysis of our proposed algorithm on the benchmark datasets. The fifth section is the conclusions.

II. RELATED WORKS
For the binary classification problem, the training set is marked as T = {(x i , y i ) |i = 1, . . . , m}, where x i ∈ R n is the training sample and y i ∈ {+1, −1} is the label corresponding to the training sample x i . For convenience, we denote positive training samples as A ∈ R m 1 ×n and negative training samples as B ∈ R m 2 ×n , where m = m 1 + m 2 is the total number of the training samples.

A. TSVM
The goal of TSVM is to seek the following two nonparallel hyperplanes: The original problem of TSVM is as follows: where c ± are the penalty parameters, ξ ± are the relaxation variables, and e ± are the vector of all 1 of the proper dimension. The dual problems of (2) and (3) are as follows: where α and γ are the Lagrange multipliers, E = B e − and F = A e + . The two nonparallel hyperplanes can be obtained by solving the dual problems (4) and (5) as follows: B. LSTSVM Similar to TSVM, LSTSVM also seeks two nonparallel hyperplanes. However LSTSVM solves the following two quadratic programming problems (QPPs) to obtain the two nonparallel hyperplanes: Different from the primal problems (2) and (3) of TSVM, LSTSVM replaces the inequality constraints with equality constraints and 1-norm of slack variables ξ ± with the square of 2-norm.
By substituting equality constraints into the objective functions in (8) and (9), we can obtain Supposing G = B e − , H = A e + , we can obtain 1 ≤ k ≤ K , m is the total number of training samples and K is the total number of labels. The MLTSVM seeks the following K hyperplanes: Denote the samples belonging to the kth class by A k and the samples not belonging to the kth class by B k . To obtain the kth hyperplane, the original problem of MLTSVM is as follows: (16) where c k is the penalty parameter, ξ B k is the slack variable, λ k is the regularization parameter, and e A k (B k ) are the vector of all 1 of the proper dimension.
By introducing Lagrange function and using Karush-Kuhn-Tucker (KKT) optimization theory, the dual problem of (16) can be obtained as follows: where H = A k e A k , G = B k e B k , I k is the identity matrix of proper dimensions, and α B k is the Lagrange multiplier. By solving the dual problem, we can obtain

III. SS-MLLSTSVM
Consider the semi-supervised multi-label learning problem with training set T = {(x i , y i ) |i = 1, . . . , u}, where x i ∈ R n is the training sample, and y i = {y i1 , . . . , y ik , . . . , y iK } is the label sequence of the sample x i .
, if x i belongs to the kth class, −1, if x i does not belongs to the kth class, 0, uncertain, 1 ≤ k ≤ K , u is the total number of all training samples, including labeled, unlabeled and partially labeled samples and K is the total number of labels.

A. SEMI-SUPERVISED LEARNING FRAMEWORK
To solve the semi-supervised learning problems, Belkin et al.
proposed a manifold regularization framework. The objective function of the manifold regularization framework is expressed as follows: where f is the classification function to be solved, H k is the reproducing kernel Hilbert space (RKHS), the first part V is the loss function of labeled samples, the second part f 2 H is a regularization term used to control the complexity of the classifier, and the third part f 2 M is a manifold regularization term, which reflects the internal manifold structure of data distribution.

B. MODEL 1) LINEAR CASE
For the label k, SS-MLLSTSVM seeks a hyperplane The second part f 2 H of (20) can be expressed as: The third part f 2 M of (20) can be expressed as: where the Laplace matrix of the whole samples, W is defined as follows: if x i and x j are k nearest neighbor 1 otherwise, and D is defined as follows: For the kth label, we suppose The original optimization problem of the linear SS-MLLSTSVM is: where c ki (i = 1, 2, 3) are the penalty parameters, and L is the Laplace matrix of the whole samples. It can be observed from the optimization problem (26) that (a) similar to LSTSVM, SS-MLLSTSVM replaces the inequality constraints of the MLTSVM with the equality constraints and 1-norm of slack variables ξ B k of MLTSVM with the square of 2-norm; (b) unlike MLTSVM and LSTSVM, SS-MLLSTSVM adds manifold regularization term in order to make full use of the information of partially labeled and unlabeled samples. Therefore, SS-MLLSTSVM can effectively solve the semi-supervised multi-label problem. The Lagrange function of (26) is as follows: Using KKT condition, we can obtain: Combining (28) and (29), we can obtain: 2) NONLINEAR CASE In this section, we extend the linear SS-MLLSTSVM to the nonlinear case using the approximate kernel generating surface. For the nonlinear case, SS-MLLSTSVM constructs K approximate kernel generating surface where K (·, ·) is a suitable kernel function.
Similar to the linear case, the second part and the third part in (20) can be respectively expressed as follows: The original optimization problem of nonlinear SS-MLLSTSVM is as follows: The Lagrange function of (35) can be constructed as follows: Using KKT condition, we can obtain: Combining (37) and (38), we can obtain: Denoting E = K A k , T T e A k , F = K B k , T T e B k , J = K T , T T e and u k = w T k b k T , we can obtain is less than or equal to the given value k , k = 1, . . . , K , the sample x is assigned to the kth label.
To choose the proper k , we apply the strategy in the MLTSVM, which is a simple and effective method, i.e. we set

IV. EXPERIMENTS
In this section, we present the classification results of our proposed SS-MLLSTSVM on multiple datasets. We compare our SS-MLLSTSVM with BPMLL [32], Rank-SVM [28] and MLTSVM on the multi-label benchmark datasets. All the algorithms are implemented in MATLAB (R2017b), and the experimental environment is Intel Core i3 processor with 4G RAM.

A. BENCHMARK DATASETS
In the experiments, we used five common multi-label datasets, including flags, birds, emotions, yeast and scene. The datasets cover multiple fields, including image, audio, music, biology, and so on. The details of the datasets are listed in Table 1. In addition, in order to investigate the classification ability of our proposed algorithm, we choose 50% of the datasets as labeled samples and the remaining samples as unlabeled samples.

B. EVALUATION CRITERIA
In the experiments, in order to evaluate the performance of the algorithm, we use 7 common evaluation metrics, including Hamming loss, average precision, coverage, one error, ranking loss, balanced accuracy and Kappa. Next, we will introduce the 7 evaluation metrics in detail.
Let m be the total number of samples and K be the total number of labels. Y i and Y i respectively represent the relevant label set and irrelevant label set of sample x i . The function f (x, y) returns the confidence of y being the right label of sample x, and the function rank (x, y, f ) returns a descending rank of f (x, y) for any y ∈ {y 1 , . . . ,y K }. For the kth label, TP k represents the number of samples that belong to the kth label and are predicted correctly; TN k represents the number of samples that do not belong to the kth label and are not predicted to be the kth label; FP k represents the number of samples that do not belong to the kth label and are predicted to be the kth label; FN k represents the number of samples that belong to the kth label and are not predicted to be the kth label.

1) HAMMING LOSS
Hamming loss is used to measure the proportion of labels which are misclassified where h (x i ) is the predicted labels of sample x i .

2) COVERAGE
Coverage is used to measure that, to cover all possible labels of samples, how far we need to go down the ranked

3) ONE ERROR
One error is used to measure the proportion of samples whose label with the highest prediction probability isn't in the true label set. where

4) RANKING LOSS
Ranking loss is used to measure the proportion of label pairs that are reversely ordered.

5) AVERAGE PRECISION
Average precision is used to measure the proportion of labels ranked above a particular label y ∈ Y i . (47)

6) BALANCED ACCURACY
Balanced accuracy is used to measure the classification performance of the classifier for unbalanced dataset.
where TPR k = T P k T P k +FN k , TN R k = T N k T N k +FP k .

7) KAPPA
Kappa coefficient is used to test the consistency between the predicted results of the classifier and the actual results. where (51)

C. PARAMETER SETTING
The parameters of classifiers have an important impact on the classification performance. We use 5-fold cross validation to select optimal parameters. The parameters of each algorithm are set as follows: For the BPMLL, the number of hidden neurons is set to 20% of the input dimension, and the number of training epochs is 100. For the Rank-SVM, the kernel function parameter and penalty parameter c are selected from 2 −6 , . . . , 2 0 , . . . , 2 6 . For the MLTSVM, the penalty parameters c k and regularization parameter λ k are selected from 2 −6 , . . . , 2 0 , . . . , 2 6 . For the SS-MLLSTSVM, the penalty parameters c k1 and regularization parameters c k2 , c k3 are selected from 2 −6 , . . . , 2 0 , . . . , 2 6 .

D. RESULTS
The classification results of BPMLL, Rank-SVM, MLTSVM and our SS-MLLSTSVM on benchmark datasets are presented in this subsection. In the experiments, we use 5-fold cross validation to evaluate these algorithms. The mean and standard deviation of 20 rounds 5-fold cross validation for each metrics are respectively listed in Tables 2 to 8. From Table 2 and 3, we can observe that our SS-MLLSTSVM is superior to all other multi-label classifiers for average precision and balanced accuracy. However, from Table 4 to 8, we can observe that no algorithm is superior to any other algorithms on all datasets for coverage, Hamming loss, one error, ranking loss and kappa. Further, we use Friedman test to evaluate each algorithm statistically. The Friedman statistics are as follows: where R j = 1 N i r j i , r j i represents the rank of the jth algorithm on the ith dataset, k is the number of classifiers, and N is the number of datasets. Because χ 2 F is undesirably conservative, we apply the better statistic For coverage, Hamming loss, one error, ranking loss and kappa, we list the rank of different multi-label classifiers in Table 9             From Table 9 to 13, we can see that the average rank of our SS-MLLSTSVM is lower than other algorithms, in other words, our SS-MLLSTSVM has better classification performance for the 5 metrics.  We present the training time of each algorithm in Table 14.
From Table 14, we can observe that, compared with other algorithms, although SS-MLLSTSVM needs to calculate the Laplace matrix of the whole samples, our SS-MLLSTSVM still has higher leaning speed.

E. PARAMETERS ANALYSIS
In this subsection, we investigate the influence of parameters c k1 , c k2 and c k3 on the classification performance of the SS-MLLSTSVM. The results are shown in Figure 1 to 7.    From Figure 1 to 7, we can observe that (a) the classification metrics of the SS-MLLSTSVM vary dramatically with the change of the parameters, which means that the parameters have great influence on the classification metrics of the SS-MLLSTSVM; (b) for Hamming loss, balanced accuracy and kappa, the parameter c k1 has strong influence and the parameters c k2 and c k3 have weak influence, while there is no obvious difference among c k1 , c k2 and c k3 for other metrics.           more reasonable classifier and improve the classification performance.

V. CONCLUSION
In this article, we propose an semi-supervised multi-label learning algorithm, named SS-MLLSTSVM. SS-MLLSTSVM introduces the least squares idea into each sub-classifier of MLTSVM to improve learning speed and make full use of the geometric information in unlabeled and partially labeled samples to improve generalization performance. The experimental results on the benchmark datasets indicate that, compared with popular multi-label classifiers, our SS-MLLSTSVM has better classification performance, especially for the dataset that contains a large number of partially labeled and unlabeled samples. The high-dimensional data have great effects on the classification performance. Therefore, feature reduction for multi-label learning will be the focus of our future research.