LP-MLTSVM: Laplacian Multi-Label Twin Support Vector Machine for Semi-Supervised Classification

In the machine learning jargon, multi-label classification refers to a task where multiple mutually non-exclusive class labels are assigned to a single instance. Generally, the lack of sufficient labeled training data demanded by a classification task is met by an approach known as semi-supervised learning. This type of learning extracts the decision rules of classification by utilizing both labeled and unlabeled data. Regarding multi-label data, however, current semi-supervised learning methods are unable to classify them accurately. Therefore, with the goal of generalizing the state-of-the-art semi-supervised approaches to multi-label data, this paper proposes a novel two-stage method for multi-label semi-supervised classification. The first stage determines the label(s) of the unlabeled training data by means of a smooth graph constructed using the manifold regularization. In the second stage, thanks to the capability of the twin support vector machine to relax the requirement that hyperplanes should be parallel in classical SVM, we employ it to establish a multi-label classifier called LP-MLTSVM. In the experiments, this classifier is applied on benchmark datasets. The simulation results substantiate that compared to the existing multi-label classification algorithms, LP-MLTSVM shows superior performance in terms of the Hamming loss, average precision, coverage, ranking loss, and one-error metrics.


I. INTRODUCTION
Classification is a well-known task in the field of machine learning. Traditionally, machine learning techniques are divided into supervised, unsupervised and semi-supervised.
In supervised learning, one or more labels are assigned to each given data point by the intervention of a supervisor [1]. In supervised learning techniques, a model is constructed and trained with features of train data to predict the class labels of unseen data. Classification and regression algorithms are supervised learning algorithms. In classification, the output of classifier is a discrete number from a predefined limited set [2], while in regression the output of regressor is a continuous value [1], [3]. Support vector machine (SVM) [4], twin support vector machine (TWSVM) [5] and The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . neural networks [6] are well-known examples of supervised learning. These types of learning have been used in a wide range of applications like pattern recognition [7] and text categorization [8].
With respect to the number of class labels of each data point, the classification tasks are divided into single-label or multi-label. For instance, spam email filtering is a single-label classification problem where each instance has a single-label. In this problem, once the classifier learns the features of a spam email, it will be able to distinguish spam from non-spam emails [9].
Multiple-label learning paradigm involves a process in which several labels selected from a set of labels are assigned to a data instance. There are varieties of application that utilize multiple-label learning, music categorization [10] and image annotation [11], to name a few. In practice, music or an image can be associated with several topics.
Learning based on multiple-label data has gained significant attention [12], recently. These efforts can be categorized to algorithm adaptation, problem transformation and ensemble method. Algorithm adaptation strategy extends single-label classifiers to the multi-label data classification. Rank-SVM [13], BPMLL [14], MLRBF [15], MLIBLR [16] and SS-MLLSTSVM [17], belong to this category. Problem transformation strategy involves transforming a multi-label problem to a set of single-label sub-problems. Then a subclassifier is constructed through transforming the aforementioned single-label classifiers. Binary relevance (e.g., one-against-all) [18] and calibrated label ranking (e.g., oneagainst-one) [19] are features of this category. Ensemble strategy constructs a multi-label classifier by combining several single-label classifiers. Generally, this strategy provides the most efficient solution for multi-label problems. Boosting-based methods such as Adaboost [20] and an ensemble of classifier chains [3], [21] are examples of this strategy.
Unsupervised learning algorithms deal with unlabeled data [22]. These techniques seek to recognize a meaningful resemblance and pattern among data. Data clustering is a major goal of the unsupervised learning. Data which lay in a cluster bear a maximum similarity among themselves and a maximum dissimilarity with other clusters at the same time. K-means [23] and K-medoids [24] are two examples of this technique.
Semi-supervised learning (SSL) is conducted by combining the two aforementioned learning approaches (i.e., supervised and unsupervised learning) [25]. The principal idea behind the constructing a classifier by means of SSL is to exploit the unlabeled data which are excessive in number accompanied by a few available labeled data [26]. The SSL technique establishes a model in which not only are instance labels employed but also meaningful patterns among instances are utilized [27]. One approach to implement this type of learning is simply neglecting unlabeled data [28]. However, this approach may result in overfitting [29]. As already mentioned, the objective of SSL is to utilize all available data for the construction of the intended model [26]. Currently, there are varieties of applications that deal with the data showing multi-label and semi-supervised trait. Video surveillance [30], [31] and protein 3D structure prediction [32] are two well-known examples of such applications.
Generally, SSL is conducted through either inductive [33] or transductive [27] approaches. Taking into account that SSL consists of both labeled and unlabeled training data, two main goals need to be achieved. The first goal is to predict the label of test data and the second goal involves predicting label of those training data with no label. These two goals are reached by means of inductive and transductive learning, respectively. Denoting labeled data and unlabeled data are denoted by {(x i , y i )} L i=1 and {(x j )} L+U j=L+1 , respectively. Inductive SSL seeks to train a functionf : x → y in such a way that this function shows better performance in predicting unseen data compared to the case where unlabeled data are solely exploited. Analogous to the supervised learning, an established way to assess performance of semi-supervised classifier is to use these test data {(x k , y k )} m k=1 that do not participate in training phase.
Transductive SSL trains a function f : x L+U → y L+U so that the label of unlabeled training instances can be predicted by means of this function. It is worth mentioning that function f can only be employed for training data set and not to predict any data outside of this set.
Semi-supervised algorithms are grounded on the cluster assumption and the manifold assumption in leveraging unlabeled data. The cluster assumption implies that similar data have the same class label, so the decision boundary passes from a low-density region. According to the manifold assumption, data can be represented by a Graph Laplacian. This graph is utilized by a classifier that assigned the same label to similar instances. Almost all the SSL algorithms are based on either one or both assumptions. For example, maximum margin semi-supervised learning method such as transductive support vector machine (TSVM) [34] and semi-supervised SVM (S3VM) [35], LapSVM [36] and LapTSVM [37], leverage the cluster assumption. Graphbased semi-supervised classification methods like label propagation [38] and manifold regularization [39] adopt the manifold assumption.
Graph-based semi-supervised learning [40] schemes are mainly inductive, which makes them incompetent in practical applications, where predicting unseen instance is required. To tackle this incompetency, recent graph-based multilabel semi-supervised schemes follow transductive approach among which are image retrieval [41] and web spam identification [42] applications.
The promising performance of the MLTSVM [43] has led to its widespread application in the development of multilabel classifiers. MLTSVM, however, entails a number of drawbacks as follows. It ignores unlabeled data during the classifier construction process, and consequently it merely shows favorable performance in the cases where data samples have a limited number of labels. Moreover, this construction process involves solving quadratic programming problems, as a result of which the learning is decelerated. Therefore, we propose a novel two-stage method called Laplacian Multi-Label Twin Support Vector Machine (LP-MLTSVM) that addresses the aforementioned drawbacks by utilizing both labeled and unlabeled data. In its first stage, leveraging a Graph Laplacian an undirected weighted graph is constructed where training instances constitute its vertices and the weight of each edge reflects the similarity of its corresponding vertices. The second stage constructs sub-classifiers exploiting MLTSVM and manifold regularizer [44]. In other words, the major contributions of the proposed method to address shortcomings of the MLTSVM are: • Utilizing both labeled and unlabeled data by predicting a label ( or labels ) for each unlabeled instance, and avoiding neglecting unlabeled instance.  • Constructing sub-classifiers exploiting MLTSVM and manifold regularizer.
• Employing successive over-relaxation (SOR) to enhance the learning speed. Fig 1 exhibits the steps of the proposed method. The simulation results show that compared to the existing multi-label classification algorithms, LP-MLTSVM shows superior performance in terms of the Hamming loss, average precision, coverage, ranking loss and one-error.
The rest of this paper is structured as follows: section II presents the related works. The proposed scheme is elaborated in section III. Section IV is devoted to results and discussion. Finally, section V concludes the paper.

II. RELATED WORKS
Since the proposed method is based on SVM, in this section we briefly introduce the SVM and some extensions.

A. SUPPORT VECTOR MACHINE (SVM)
Assume T c is a set of training instances of a binary classifier defined as where X i R and Y i {−1, 1}. Let m1 denote samples labeled by +1 and similarly m2 represent samples with -1 label. Accordingly, matrix A m 1 ×n and B m 2 ×n are constructed such that i − th row represents a data sample labeled as +1 and −1, respectively and we have m = m1 + m2.
Original SVM [4] separated the two sets of data denoted byA and B through establishing two parallel hyperplanes. These two hyperplanes are obtained using the following equations.
where b R, w R n , and e is appropriate vector of ones. This approach seeks to establish a trade-off between misclassification error and the decision margin. This trade-off can be formulated as the following optimization problem: where ε i and C are slack variable and a user defined penalty factor, respectively. The Wolf dual of (4) can be expressed as Let α * = (α * 1 , . . . , α * m ) be the optimal solution of (5). Karush-Kuhn-Tucker (KKT) conditions yield a hyperplane represented as Here x i and α i are data samples used as support vectors and their respective multipliers. N SV denotes the number of support vectors holding 0 < α i < C. A new sample is labeled as +1 or −1 according to the proposed classifier.

B. TWIN SUPPORT VECTOR MACHINE (TWSVM)
TWSVM [5] has been proposed to classify binary data. It relaxes the requirement that the hyperplanes should be parallel in the classical SVM. It establishes two non-parallel hyperplanes in such a way that each of them has the highest possible distance from a class and is the lowest possible distance from another one. Unlike standard SVM, which solves a large dimensional quadratic programming problem (QPP), TWSVM arranges to solve two reduced dimensional QPPs, and accordingly it performs faster than SVM. Indeed, each of these reduced problems follows the relations of the standard SVM where each data sample can only appear in the constraint of just one problem. Those two hyperplanes are where w + , w − R n and b + , b − R n also as well as e represent a vector of ones with a proper size. Construction of hyperplane leveraged by TWSVM involves solving the following optimization problems.
where C 1 , C 2 ≥ 0 are the error penalty, and e − , e + are vectors of ones with a proper size. Applying some algebra on (10) and (11) yields where C 1 , C 2 and e − , e + are Lagrange multipliers and the followings relations The non-parallel hyperplanes (8), (9) are obtained from the solutions α, β of (12), (13) by Depending on which of two hyperplanes it is close to, a new data point is labeled as +1 or -1, that is where | . | gives the distance of point x from planes w T Let us consider a multi-label classification problem in R n consisting K classes [45]. Suppose X L+i R, (i = 1, . . . , U ) and Y are input samples and label set, respectively. A multilabel task involves constructing a decision function h(.) : X → 2 Y according to the training samples, which is x L+i X denotes training samples and y i Y is the label associated with class expressed as y j = [y j1 , . . . , y jk , . . . , y jK ] T where In this problem, K non-parallel hyperplanes represented as f k (x) = w T k + b k = 0 are established such that K − th hyperplane is as close as possible to the samples of class K while it is as far as possible from samples of other classes. Here, w k and b k are normal vector and bias terms of K − th hyperplane, respectively. The classifier h(x) ⊆ Y is associated with an unseen instance x X L+i predicting its label.
Let (x L+1 . . . , x L+U ) be the set of mixture of labeled and unlabeled data where x L+i R n , i = 1, . . . , U and labels are distributed according to, pdf,P x . In MR, labels are distributed according to Riemannian manifold [39] in a way that more geometrically close data resemble more to their respective label. Generally, data are stored in a matrix M R (L+U )×n . Applying MR requires representing training sample as a weighted graph where these samples are its vertices [43]. The adjacency matrix of this graph is denoted as w ij where its elements capture the similarity of the training samples. Accordingly, the elements of i − th row and j − th column are calculated as denotes the Euclidean distance between sample x i and x j . Also, σ is known as the bandwidth parameter that controls the decreasing rate of the weight.
Graph Laplacian L is obtained as L = D − W , where D and W are two (L + U ) × (L + U ) matrices. W is the symmetry matrix which can be constructed according to (20). D is an orthogonal matrix whose elements are defined as D ii = l+u i=1 w ij . In the Graph Laplacian, nodes are sorted in such a way that at first, labeled nodes are located followed by the unlabeled ones. This matrix can be divided into sub matrices as follows: According to a Graph Laplacian L, a prediction function f can be articulated as: In this function, the values of the labeled instances are the respective label of these instances. However, the value of each unlabeled instance is the average of its neighbor's weight, which can be obtained from the following optimization problem: The function f generates values in the range [−1, 1]. By defining the threshold value as 0, the generated values of this function can be mapped to the labels of unlabeled instances. The optimization problem (24) can be reduced to: Solving this problem yields: For the sake of clarification, an example is provided. Let the graph of Fig 2 be a training data whose nodes are numbered from left to right. Among these nodes are only those numbered as 1 and 7 which have label +1 and −1, respectively, and other nodes are unlabeled. By moving nodes that are associated with labeled instance to the front of this graph, the order of nodes is arranged as (1, 7, 2, 3, 4, 5, and 6). Applying (26) and (27)   The aforementioned example shows that at the first stage of the proposed algorithm, the label of the unlabeled training data are predicted, leveraging a Graph Laplacian. In the second phase utilizing TWSVM a multi-label semi-supervised classifier called LP-MLTSVM is established through which the labels of the unseen data can be predicted.

E. SS-MLLSTSVM
Semi-supervised multi-label least square (SS-MLLSTSVM) [17] employs the concept of the least square to increase the learning speed of the MLTSVM. Unlike MLTSVM, it uses manifold regularization to fully utilize both labeled and unlabeled data. SS-MLLSTSVM can be expressed as the following optimization problem: where c ki are the penalty parameters, ξ B k is the salck variable, and L is the Laplace matrix. The samples belonging to the kth class are denoted by A k and the samples not belonging to the kth class are shown by B k . Like the linear case, the linear SS-MLLSTSVM is extended to the nonlinear case using approximate kernel generating surface. The optimization problem of nonlinear SS-MLLSTSVM is as follows: where K (., .) is a suitable kernel function.

III. LAPLACIAN MULTI-LABEL TWIN SUPPORT VECTOR MACHINE FOR SEMI-SUPERVISED CLASSIFICATION
TWSVM supposes that each sample can take only one label [5]. However, samples can have multiple labels. This paper extends TWSVM to multi-label problems.

A. LINEAR LP-MLTSVM
Inspired by TWSVM, the proposed scheme employs square loss function and Hing loss function as: A k represents data samples with label k and A k are other data samples. The decision functions of these problems can be expressed as Also the regularization terms are defined as Accordingly manifold regularizations are where w ij is the element of data adjacency matrix in which higher similarity between x i , x j will lead to a larger w ij . Also L is Graph Laplacian as where M R (L+U )×n consists of all labeled and unlabeled data, and e vectors of ones with an appropriate size. For kernel function k(., .), which is associated with a reproducing kernel Hilbert space H k , the decision function can be obtained by minimizing where f is an unknown decision function, V represents some loss functions on the labeled data, and γ H is the weight of f 2 H that controls the complexity of f in reproducing the Kernel Hilbert space, γ M is the weight of f 2 M and controls the complexity of the function in the intrinsic geometry of marginal distribution, and f 2 M is able to penalize f along the Riemann manifold M. (40), the primal problems of linear LP-MLTSVM can be written as

Equation (44) can be rewritten as (H T H
Problem presented as (49) is a convex optimization problem that can be solved by quadratic programming technique to yield α. It is relaxed as Predicting a label for an unseen data x R n involves using where If a new sample x is close enough to the proximal hyperplane of k, its corresponding label is assigned to this sample.
We can summarize the steps of the proposed successive over relaxation [48] to be employed in LP-MLTSVM in Algorithm-1.
This algorithm obtains α used in LP-MLTSVM classifier as presented in algorithm-2.

Input:
The penalty parameter C k , the relaxation factor ω (0, 2) And the matrix Q is defined by (50).

Output:
The optimal solution α k for the problem. 1: Initialize iterate i = 0 and start with any α 0 k R L+U 2: Split Q = L+D+L , where L is the strictly lower triangular matrix and D is the diagonal matrix 3: while α i+1 − α i < 10 −6 do 4: Compute α i+1 = α i + ω α , where α is given by

B. NONLINER LP-MLTSVM
Now we extend the linear LP-MLTSVM to the nonlinear case. Like linear case, the cost function of the errors V k = (x i , y j , f k ) and V k = (x i , y j , f k ) can be expressed as (32) and (33). The decision function can be written is a nonlinear mapping from low dimensional space to a higher dimensional Hilbert space H. According to Hilbert space theory, w k and w k can be expressed as For the nonlinear case, LP-MLTSVM construct K approximate kernel generating surfaces. These kernel generated surfaces are: where K (., .) is a suitable kernel function defined as ). By means of the kernel matrix K and relevant coefficient, λ k ,λ k the regularization term f k 2 H and f k 2 H can be expressed as For manifold regularization, on the basis of can be written as (59) Thus, the nonlinear optimization problems are expressed as Define the Lagrangian corresponding to the problem (60) as The dual problem can be formulated as From (62), we obtain Combining (63) and (64) leads to Let and the augmented vector ρ k = [λ k , b k ] T , can be rewritten as So the Wolf dual problem (60) is formulated as follows: Once vector ρ k is obtained from (68), a new data point x R n might be assigned to class K , in a manner similar to the linear case.

IV. EXPERIMENTAL RESULTS
This section is devoted to the evaluation of the proposed scheme. We compare it with MLTSVM, Rnak-SVM, BPMLL and SS-MLLSTSVM using synthetic and real datasets. All synthetic data are generated by Mldatagen according to the some predefined parameters such as (hyperspheres or hypercubes), number of relevant, irrelevant and redundant features, number of instance and number of labels. We add 5% noises to labels of each instance to make the learning task more arduous. Table 1 summarizes the specifications of the synthetic datasets. Also, the real datasets Emotion, Yeast, Scene, Medical, Flags and Birds are widely used for evaluating multi-label learning methods. All real-world datasets summarized in Table 2. It is worth mentioning that these datasets are obtained from UCI [49].
RBF kernel expressed as is utilized to evaluate the proposed scheme. Its only parameter is σ .
Moreover, parameters C, λ, γ which are used by this scheme need to be determined optimally. 10-fold cross validation is employed to determine the parameters. It selects the best value of these parameters by assigning a range of values to each of them from {2 i | i = −5, . . . , 5}. It should be mentioned that this procedure is applied to all data sets and the best parameter values are obtained according to Table 3. For the Rank-SVM, the kernel function parameter and penalty parameter C are selected from 2 −6 , . . . , 2 0 , . . . , 2 6 . For the BPMLL, the number of hidden neurons is {5%, 10%, . . . , 25%} of the number of input neurons, the training epochs is set to be 100. For the SS-MLLSTSVM, the penalty parameters and regularization parameters are selected from 2 −6 , . . . , 2 0 , . . . , 2 6 .
Our algorithm code is written in MATLAB 2013 on a PC with an Intel Core I5 processor with 2GB RAM. Hamming loss, Average precision, coverage, ranking loss one-error and one-error are the comparison metrics.

A. HAMMING loss (Hloss)
Hamming loss calculates the number of times that an instance-label pair is misclassified between the predicted   label set h(x) and the ground-truth set Y: where stands for the symmetric difference of the two sets.
The small value of Hloss shows the better performance of a scheme. Table 4 and Fig 3 represent the performance of the proposed scheme terms of Hamming loss.

B. AVERAGE PRECISION (Avepre)
Average precision evaluates the average fraction of labels ranked above a particular label y y i (71), as shown at the bottom of the next page. The large value of this metric approves the better performance of the scheme. The performance of the proposed scheme in terms of Avepre is demonstrated in Table 5 and

C. COVERAGE (Cov)
Coverage evaluates how far we need, on average, to go down the ranked list of labels in order to cover all the possible labels of the instance: The smaller this metric is, the higher the performance of the algorithm will be. Table 6 and Fig 5 represent the performance of the algorithm in terms of this metric.

D. RANKING LOSS (Rloss)
Ranking loss evaluates the average fraction of label pairs that are reversely ordered. Let y be the complementary set of y in Y, so we have   The smallest value of this metric shows the best performance of the algorithm. Table 7 and Fig 6 represent the performance of the algorithm in terms of this metric.

E. ONE-ERROR (Oerr)
One-error evaluates the number of times the top-ranked label is not in the set of proper labels of the instance The smallest value of this metric shows the best performance of the algorithm. Table 8 and Fig 7 illustrate the performance of the algorithm in terms of this metric.

F. DISCUSSION
LP-MLTSVM algorithm suffers from a high time complexity in the training phase, because of its need to construct Graph Laplacian for determining labels of the unlabeled instances. This shortcoming, however, does not make this approach   impractical as the learning phase needs to be conducted only once.
According to the results provided in Table 4, we can observe that in the proposed algorithm the established decision boundary only passes the low-density region of feature space and does not meet the unlabeled data instances. It is imperative that the weights assigned to the edges of the graph be smooth without any abrupt changes, since the weights reflect the similarity between instances. Equation (20) implies that if two instances are connected through a high weight edge, these two instances share the same labels.
Construction of a competent graph depends on a deep insight from the problem domain as well as defining    appropriate distance functions and parameters employed in Table 3. A label predictor function needs to be designed in a way that 1) Euclidean distance is not considered.
2) Whole graph is smooth.
In the proposed algorithm, these two criteria are satisfied by employing MR as indicated in (58), (59), which results a higher accuracy in Table 4. We list the rank of different multi-label classification methods in terms of Hamming loss, average precision, coverage,  ranking loss and one-error in Table 9 to 13, respectively. We can observe that, the proposed LP-MLTSVM outperforms other algorithms in all metrics.
According to the average results provided in Table 9 to 13, except average precision and ranking loss the proposed method demonstrates a higher performance.
We conduct Bonferroni-Dunn analysis to determine if there is a significant difference between the proposed approach and the compared ones in terms of the comparison metrics. Table 14 to 18 presents the result of this analysis according which we can conclude that except coverage metric the proposed approach outperforms significantly.

V. CONCLUSION
This paper aimed to leverage a large number of unlabeled data along with a limited number of labeled data for increasing classifier's precision. The main reason for employing semisupervised learning stems from this fact that the number of available data is generally limited; on the other hand unlabeled data is prevalent. Accordingly, inspired from TWSVM this paper proposes a semi-supervised learning scheme called LP-MLTSVM, for the classification of multilabel data.
This scheme provides a classification model with a significant degree of precision which can more precisely classify training data compared to previous works. The proposed scheme is grounded on manifold theory on Graph Laplacian. Training data constitute the vertices of this graph. The weight of an edge between two vertices reflects their similarity. In other words, the more similar the two vertices are, the higher weight of their connecting edge is.
We applied the proposed scheme to several standard data sets to compare its performance with MLTSVM, Rnak-SVM, BPMLL and SS-MLLSTSVM based on performance metrics. The evaluation results demonstrate the outstanding performance of LP-MLTSVM compared with other works. As a future work, this scheme might be extended to the structural learning problems.