Transductive Nonnegative Matrix Tri-Factorization

Nonnegative matrix factorization (NMF) decomposes a nonnegative matrix into the product of two lower-rank nonnegative matrices. Since NMF learns parts-based representation, it has been widely used as a feature learning component in many fields. However, standard NMF algorithms ignore the training labels as well as unlabeled data in the test domain. In this paper, we propose a transductive nonnegative matrix tri-factorization method (T-NMTF) to simultaneously exploit the label information of training examples and the statistical structure of features in the test domain. Different from standard NMF, nonnegative matrix tri-factorization (NMTF) decomposes a nonnegative matrix into the product of three lower-rank nonnegative matrices, and thus provides a flexible framework to transduce discriminative information of training examples to test examples. In particular, the proposed T-NMTF projects both training examples and test examples into a unified subspace, and expects the coefficients of training examples close to their label vectors. Since training examples and test examples are assumed to identically distributed, it is reasonable to expect the learned coefficients of test examples approximate their label vectors well. To estimate the T-NMTF parameters, we develop an efficient multiplicative update rule and prove its convergence. In addition, we propose a manifold regularized T-NMTF (MT-NMTF) algorithm that exploits the local geometry structure of the dataset to boost discriminant power. Experimental results on face recognition demonstrate the effectiveness of T-NMTF and MT-NMTF.


I. INTRODUCTION
Nonnegative matrix factorization(NMF) [1] decomposes a nonnegative matrix into the product of two lower-rank nonnegative matrices. Due to the non-negativity constraints imposed on both factor matrices, NMF learns parts-based representation which has psychological and physiological evidence in human brain [2]. Moreover, NMF, Principal Component Analysis (PCA) [3], Singular Value Decomposition (SVD) [4] are powerful dimension reduction methods because the obtained factor matrices reduce the storage requirement and save the computational cost of subsequent processing. Based on them, Lai et al. [5] proposed a unified sparse learning framework for dimensionality reduction.
The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tong.
In [6], a novel discriminative low-rank preserving projection (DLRPP) algorithm was introduced to learn an optimal projection matrix for data dimensionality reduction. In [7], a novel structured optimal graph based sparse feature extraction method was proposed for dimensionality reduction. Among them, NMF can enforce the resulting matrix factors to be nonnegative, which attracted a large amount of attention and has been widely used in many fields such as information retrieval [1], pattern recognition [8], data mining [9], and computational biology [10].
Most of the NMF methods focused on the 2-factor factorization. In this framework, both theoretical and algorithmic aspects of NMF have been extensively studied. On the theoretical side, Donoho and Stodden [11] studied the uniqueness of NMF solution and concluded based on convex duality that the NMF solution is unique unless one of the factor matrices VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ is orthogonal or non-overlapping. Vavasis [12] proved that the exact NMF which exactly reconstructs the dataset is equivalent to a NP-hard problem. On the algorithmic side, Lee and Seung proposed an effective multiplicate updating rule (MUR) to solve the optimization problem in NMF in their seminal work [13]. To resolve the convergence issues in MUR, the alternating nonnegative least squares (ANLS) framework were introduced with good convergence properties [14], [15]. Nonnegative matrix tri-factorization (NMTF), which seeks a 3-factor decomposition, has become an emerging tool for co-clustering. NMTF was firstly proposed in [16] to co-cluster the rows and columns of an input data matrix. Under orthogonal constraint, NMTF was shown to have unique decomposition and can be efficiently solved by a new muplicative updating algorithm [17], [18]. Due to its encouraging empirical results, it has been further investigated and extended for many applications. Wang et al. [19]- [21] decoupled the NMTF and studied its fast implementation by constraining the factor matrices to be cluster indicator. Chakraborty and Sycara [22] proposed the graph regularized NMTF for community detection to incorporate the social relations and user generated content in the decomposition. Li et al. [23] introduce the symmetric NMTF to explore the dual relation and the bilateral information between samples and features.
Since both NMF and NMTF are proposed for unsupervised learning, the decomposition does not take into account of possible label information in the training set and the unlabeled data in the test set. To make use of label information, a series of semi-supervised NMF methods were proposed [24]- [29]. Zafeiriou et al. [24] proposed discriminant NMF method (DNMF) by incorporating Fisher's criterion in NMF and applied it in frontal face verification. Guan et al. [25] proposed manifold regularized discriminative NMF(MD-NMF) to consider more effective margin-based discriminative information for subsequent data representation. To make use of unlabeled test data, several transductive NMF methods which decompose the train and test examples simultaneously were proposed [30]- [32]. Cho and Saul [30] proposed to use a pre-trained SVM classifier to learn the discriminative support vectors and apply them in the decomposition. Liu et al. [31] introduce a label matrix (obtain from the given labels) to constrain the decomposition, which cannot handle the case where only one label is given for each class. However, how to incorporate training labels and unlabelled test data in NMTF is still underexplored.
In this paper, we propose a new transductive nonnegative matrix tri-factorization method (T-NMTF) to simultaneously utilize the labels of training examples and exploit the statistical structure of test examples in the framework of NMTF. Different from previous transductive learning methods [33]- [35]  The remainder of this paper is organized as follows: Section 2 briefly reviews NMF and its variants. Section 3 explains T-NMTF in detail. And we compare T-NMTF with other semi-supervised NMF in Section 4. Section 5 validates effectiveness of the proposed methods by face clustering and Section 6 concludes this paper.

II. RELATED WORKS A. REVIEW OF MORE NMF METHODS
NMF decomposes a nonnegative matrix into the product of two nonnegative matrices. The studies of NMF has two popular branches, the first is the uniqueness of decomposition, and the second is the NMF with label information. For the first aspect, as showed by many works that the composition itself is not unique, for examples, in the decomposition of V = WH , for any nonnegative diagonal matrix D, it can guarantee that the decomposition V = WDD −1 H also satisfies the nonnegativity as both WD ≥ 0 and D −1 H ≥ 0. To remove this uncertainty and drive the decomposition to the intend direction, many methods propose to impose task-oriented constraints on the decomposition and achieved better performances.
Ding et al. [16] studied the uniqueness of NMF and proposed the nonnegative matrix tri-factorization (NMTF), in their studies, to drive to a unique and intend decomposition, several deliberately designed constraints are imposed. Based on the NMTF, Chen et al. [36] proposed orthogonal NMTF for collaborative filtering and solved the sparsity and scalability of user-item matrix factorization very well. Ding et al. [16] introduce the bi-orthogonal 3-factor NMF to clustering and show its ability on simultaneously clustering rows and columns of input data matrix. Chakraborty and Sycara [22] proposed the graph regularized NMTF for community detection to incorporate the social relations and user generated content in the decomposition. Wang et al. [20] decoupled the NMTF and studied its fast implementation by constraining the factor matrices to be cluster indicator. However, all these studied NMTF methods ignore the label information in the decomposition. Li et al. [37] proposed a NMTF to sentiment classification with lexical prior knowledge. Ma et al. [38] introduce the orthogonal NMTF to semi-supervised document clustering. As proved, with the prior knowledge or given label information, the NMTF can achieve better performances. To better utilize the label information in the NMF, many semi-supervised NMF are well studied, Zafeiriou et al. [24] proposed discriminant NMF method (DNMF) by incorporating Fisher's criterion in NMF and applied it in frontal face verification. Guan et al. [25] proposed manifold regularized discriminative NMF(MD-NMF) to consider more effective margin-based discriminative information for subsequent data representation. The discriminative NMF methods perform better than standard NMF; however, they did not exploit the statistical structure of test examples if they are available during training. Several transductive NMF are studied recently, Cho and Saul [30] proposed to use a pre-trained SVM classifier to learn the discriminative support vectors and apply them in the decomposition. Liu et al. [31] introduce a label matrix (obtain from the given labels) to constrain the decomposition. The works of [30] and [31] decompose the train and test examples simultaneously to improve the performance on testing. However, both Cho and Saul [30] and Liu et al. [31] fail to handle the case where only one label example is given of each class.
Ding et al. [16] studied the relationship between 2-factor and 3-factor matrix factorization in deep, and provide a systematic analysis of 3-factor NMF that unconstrained 3-factor NMF is equivalent to unconstrained 2-factor NMF, and only constrained 3-factor NMF make a difference. Chen et al. [36] and Ding et al. [16] introduced the orthogonal 3-factor NMF and show that the orthogonality in the decomposition bring better performances in both collaborative filtering document clustering. Chakraborty and Sycara [22] studied the graph regularized 3-factor NMF and concluded that the informative graph with additional social relations greatly help the community detection in social network. To acquire a more robust representations of NMF for more accurate measure and representation, a novel unsupervised Nonnegative Adaptive Feature Extraction (NAFE) algorithm was proposed in [39] by integrating the sparsity constrained nonnegative matrix factorization (NMF), representation learning, and adaptive reconstruction weight learning into a unified model. For the nonnegative matrix factorization with label information, Zaferiou et al. [24] proposed discriminant NMF (DNMF) to incorporate Fisher's criterion in NMF. In specific, based on the given label, the within-class scatter and between-class scatter are constructed and force the final decomposition approximate the two scatters. Guan et al. [25] proposed manifold regularized discriminative NMF(MD-NMF) to preserve more effective margin-based discriminative information in NMF subspace. Choi et al. pre-trained a classifier with the labeled examples to learn the support vectors and proposed to use the support vectors to guide the decomposition. Liu et al. [31] constructed a label matrix with the given label information to constrain the NMF, in the decomposition, the label matrix enforced the coefficients of two same label examples to approximate to each other. Chen et al. [40] proposed to apply NMF to data clustering, Then Shao et al. [41] applied it to high dimensional data clustering, and Hu et al. [42] modified NMF to make it more stable. Kannan et al. [43] proposed to apply NMF to recommender system, Then Yu et al. [44] paralleled it. Nie et al. [45] proposed to apply NMF to missing date recovery. Recently, Pecli et al. [46] have provided a comparative study.
Concept Factorization (CF) is another variation of NMF by representing each cluster with a linear combination of data points and using a linear combination of the cluster centers to represent each data [47]. Zhang et al. [48] proposed a novel Robust Flexible Auto-weighted Local-coordinate Concept Factorization (RFA-LCF) for unsupervised clustering by integrating the robust flexible CF, robust sparse local-coordinate coding and adaptive weighting learning into a unified model. To improve the representation and clustering abilities, a Deep Self-representative Concept Factorization Network (DSCF-Net) framework was proposed in [49] by integrating the robust deep concept factorization, deep self-expressive representation and adaptive locality preserving feature learning into a unified framework. To improve the data representations by enhancing the robustness to outliers and noise in data, Ren et al. [50] proposed a joint robust factorization and projective dictionary learning (J-RFDL) model by performing the robust representation in a factorized compressed space. Similar to our work, Zhang et al. [51] proposed a joint label prediction based Robust Semi-Supervised Adaptive Concept Factorization (RS2ACF) framework recently, which utilized class information of labeled data and propagated it to unlabeled data by jointly learning an explicit label indicator for unlabeled data. However, the RS2ACF framework use a fixed label constrained matrix to guide the decomposition, that means, the label constrained matrix is given beforehand and not changed during the optimization. While our proposed T-NMTF relax the constraints and use the label indicator matrix instead.  [34] proposed spectral graph transducer (SGT) method that can be viewed as a transductive version of the k nearest neighbor (KNN) classifier. The SGT problem is relaxed to a convex formulation and solved globally optimally via the spectral method. However, TSVM and SGT are designed for binary classification, and thus they are unsuitable for multi-class cases. Liu et al. [35]  Furthermore, some novel transductive frameworks were proposed and achieved effectiveness in the classification task. Zhang et al. [52] proposed an enhanced semi-supervised classification approach termed Nonnegative Sparse Neighborhood Propagation (SparseNP) to ensure the outputted soft labels of points to be sufficiently sparse, discriminative, robust to noise and be probabilistic values. Jia et al. [53] proposed a new transductive label propagation method, termed Adaptive Neighborhood Propagation (Adaptive-NP) by joint L2,1-norm regularized sparse coding, for semi-supervised classification. In [54], a novel adaptive transductive label propagation approach was proposed by joint discriminative clustering on manifolds for representing and classifying high-dimensional data. In [55], to acquire a more accurate prediction in classification, the triple matrix recovery-based robust auto-weighted label propagation framework (ALP-TMR) was proposed by introducing a TMR mechanism to remove noise or mixed signs from the estimated soft labels and improve the robustness to noise and outliers in the steps of assigning weights and predicting the labels simultaneously.

A. NMF
The standard NMF minimizes the squared Euclidean distance between the examples V ∈ R m×n + and the product of two lower-rank matrices W ∈ R m×r + and H ∈ R r×n + , i.e., min where r min{m, n} denotes the reduced dimensionality and · F signifies the Frobenius norm. NMF performs poorly in some pattern recognition tasks because it completely ignores the labels of training examples.

B. NMTF
In contrast to NMF, nonnegative matrix tri-factorization (NMTF) decomposes a nonnegative matrix into the product of three lower-rank nonnegative matrices, i.e., min W ≥0,D≥0,H ≥0 where W ∈ R m×r + , D ∈ R r×r + and H ∈ R r×n + are three lower-rank nonnegative matrices decomposed from the nonnegative matrix. W is the basis matrix, D denotes the middle matrix and label indicator matrix. In our method the product of D and H denote the coefficient matrix. Since the matrix D can absorb different scales of V , W and H , NMTF provides a more flexible way for data representation than NMF. However, NMTF shrinks to the standard NMF without any additional constraints on the factor matrices.
Ding et al. [16] proposed bi-orthogonal NMTF that enforces the columns of W and the rows of H to be orthogonal, i.e., where I signifies the identity matrix. The orthogonality constraints over W and H prevent NMTF from shrinking to NMF and make bi-orthogonal NMTF a powerful tool for simultaneously clustering both rows and columns of V .

IV. TRANSDUCTIVE NONNEGATIVE MATRIX TRI-FACTORIZATION
Assuming both training and test examples belong to known c classes and there is at least one training example in each class, T-NMTF constructs a label vector y ∈ R c + for each training example v as follows: Based on definition (4), it is easy to construct a label matrix Y l ∈ R c×l + for training examples. T-NMTF aims to predict the label matrix Y u ∈ R c×u + for test examples. As mentioned above, NMTF provides a flexible framework for developing new data representation methods. T-NMTF follows the idea of NMTF and decomposes the concatenated examples into the product of three matrices, i.e.,   (5) and (6), we have the objective of T-NMTF min W ≥0,D≥0,H ≥0 Since the inequality and equality mix-constrained problem is not easy to solve, we rewrite eq. (7), based on the Lagrangian multiplier method [56], to the following inequality constrained problem min W ≥0,D≥0,H ≥0 whereλ > 0 is the Lagrangian multiplier of the constraint (6) (6) is approximately satisfied, C actually approximates its label vector and its entries reflect the possibility that it belongs to all classes. The larger an entry is, the greater the possibility that it belongs to the corresponding class. Therefore, T-NMTF infers the label vectors of test examples from their coefficients as follows: where Y u is defined as (4), C u = DH u and X ij signifies the (i, j)-th element of X . From (9), it is easy to predict labels for test examples.

A. MULTIPLICATIVE UPDATE RULE
Although (8) is jointly non-convex with respect to W , D, and H , it is separately convex with respect to each of them. Due to the separability of the squared Frobenius norm and the concatenation of V , based on the Lagrangian multiplier method [56], we have the Lagrangian function of f (W , D, H ) as where ϕ ≥ 0, φ ≥ 0, θ ≥ 0 and ψ ≥ 0 are the Lagrangian multipliers of constraints W ≥ 0, D ≥ 0, H l ≥ 0 and H u ≥ 0, respectively, and ·, · signifies the inner product. Assuming (W , D, H l , H u ) is a stationary point of the constrained optimization problem (8). According to the K.K.T. condition [56], the stationary point satisfies the primal condi- and the slackness conditions By substituting (11) into (12) and using simple algebra, we have From (13), we get the following multiplicative update rules (MUR): TNMTF alternatively updates W , D, and H until the objective value does not change.
The main time complexity of MUR is spent on (14), (15), and (16). Since (14) includes DH three times, the matrix multiplication DH can be computed in O(nc 2 ) before (14) and the product can be repeatedly utilized. The time complexity of (14) (14) is same as [13], which decreases the objective function f (W , D, H ) with both D and H fixed. The following Proposition 1 and Proposition 2 prove that both (15) and (16) Our goal is to prove J (D ) ≤ J (D ). To this end, we construct an auxiliary function for J (D) as In the following, we will prove that Z According to [57], we have Based on (21) and (22) Proof: For the convenience of derivative, we divide H into two parts, i.e., H l and H u . Given W and D, the objective functions with respect to H l and H u can be written as We can easily prove the update rule (16) for H u decreases J (H u ) according to [6]. It suffices to prove the update rule (16) for H l decreases J (H l ). To this end, we rewrote J (H l ) as Assuming (16) updates H l from H l to H l , i.e., We need to proveJ (H l ) ≤ J (H l ). Following the proof of Proposition 1, we construct the auxiliary function for J (H l ) as: By combining (27), (28), and (29), we know that Z (H l , H l ) ≥ J (H l ). It is easy to verify that H l is the minima of Z (H l , H l ) and it satisfies This completes the proof.
where the integral is taken over the probability distribution P v . However, since both the manifold M and the marginal distribution P v are unknown in practice, we use its empirical estimate instead. According to the manifold regularization theory [25], regularization (32) can be approximated by using the graph Laplacian of examples V . MT-NMTF constructs an adjacent graph G whose vertexes are examples and the edge weights S ij reflect the extent to which two examples are close. By using '0-1 weighting' schema, the entries of S are defined as where N ( v) denotes the sets of k nearest neighbors of v.
In our work, k is set to 5, as we find the method woks well in most cases under this setting. By using the strategy of defining the edge weight matrix, regularization (32) can be approximated by fLf T , where L = T − S is the graph Laplacian of G, and the i-th element of the diagonal matrix T is defined as Since the basis W contains c independent components, the penalty (32), by taking summation of c regularizations and using simple algebra, is equivalent to where f i is the function corresponding to the i-th components, i.e., f i = H i denotes the i-th row of H . Like T-NMTF, MT-NMTF constructs a label vector for each training example, and expects the coefficient of each training example to approximate its label vector. By exploiting (35) into (5) Tr(HLH T ), (35) where β trades off the manifold regularization. Similar to T-NMTF, the objective function g(W , D, H ) is jointly non-convex with respect to W , D and H , but it is convex with respect to each of them. Therefore, we can develop a MUR-based algorithm for solving MT-NMTF by using the Lagrangian multiplier method. Since the manifold regularization is independent of both W and D, we keep their update rules consistent with (14) and (15) and focus on the update rule of H . To this end, similar to (10), we construct a Lagrangian function of g(W , D, H ) as where ψ denotes the Lagrangian multiplier for constraint H ≥ 0. Assume H is a stationary point of the constrained optimization problem (35) with both W and D fixed. According to the K.K.T. condition [56], the stationary point satisfies the primal conditions where = D T DH T l − D T Y l , 0 , and the slackness conditions By substituting (38) into (37) and using simple algebra, we have where + = [D T DH T l , 0] and − = [D T Y l , 0] represent the positive and negative parts of , respectively. From (39), we obtain the following multiplicative update rule for H : VOLUME 8, 2020 It can be easily proved that (40) does not increase the objective function g(W , D, H ). We omit its proof due to the space limitations.
In the proposed MT-NMTF, the local geometry is expected to help to transduce the discriminative information from training examples to test examples. This is because neighbor examples are expected to have close encodings and their coefficients are also expected to be close to one another. Since the coefficient of each example approximates their label vector under the T-NMTF framework, it is likely that the neighbor examples will produce the same label vectors. In this sense, MT-NMTF enhances the clustering performance of unlabeled test examples by exploiting the local geometry structure of the whole dataset. Our experiments validate the effectiveness of both T-NMTF and MT-NMTF.

VI. EXPERIMENT
In this section, we conducted several experiments to validate the effectiveness of both T-NMTFand MT-NMTF on four popular face image datasets including ORL [58], FERET [59], YALE [60], UMIST [61], JAFFE [62] and MIT CBCL [63] by comparing them with NMF [1], NMTF [16], CNMF [31], and NMF-α [30]. The Cambridge ORL dataset [64] consists of 400 images collected from 40 subjects. Ten images were collected from each subject with varying lighting, facial expressions and facial details (with-glasses or without-glasses). The FERET dataset [59] contains 13,539 photos in total taken from 1,565 subjects. We randomly selected 100 individuals, and each individual has 7 photos. The YALE dataset [60] contains 165 frontal view face photos of 15 subjects. Eleven photos were taken from each subject with varying facial expressions (smiling or sad) and configurations. The UMIST dataset [61] contains 575 face photos collected from subjects. At least 19 and at most 48 photos were taken from each subject in varying poses with both profile and frontal views. The JAFFE dataset [62] contains 213 face photos while the MIT CBCL dataset [63] contains 3240 face photos. All photos were cropped to a pixel array and reshaped into a long vector. Table 1 summarizes the datasets used. The recognition performance is evaluated by both the accuracy (AC) and the normalized mutual information (NMI) of test examples. Refer to [8], [9], [64] for more details of AC and NMI. In this experiment, both T-NMTF and MT-NMTF output labels using (6) and NMF, NMTF, CNMF, and NMF-α output labels using K-means for the purpose of obtaining the best performance.

A. FACE RECOGNITION
To study the effectiveness of both T-NMTF and MT-NMTF, we first randomly selected one example from each individual for training and used the remaining examples for testing. To eliminate the effect of randomness, the trial was repeated ten times and the average AC and NMI on the test examples was used to evaluate the recognition performance. We varied the number of individuals from to in this experiment to study its effectiveness for multi-class clustering tasks. Figures 1 and 2 give the average ACs and NMIs versus the number of classes of T-NMTF, MT-NMTF, CNMF, and NMF on both ORL and FERET datasets, respectively. They show that CNMF performs closely to NMF because it degenerates to NMF in this case (see Section 2.1). However, both T-NMTF and MT-NMTF overcome this deficiency and significantly boost recognition performance by transducing   To sufficiently mine the discriminant power of CNMF, we randomly selected two images from each individual for training in the following experiments and varied the number of classes from 2 to 10. We repeated the trial ten times and evaluated both T-NMTF and MT-NMTF by comparing them with NMF, NMTF, CNMF, and NMF-α in terms of both VOLUME 8, 2020   average AC and average NMI. Figures 3 to 6 show both average ACs and average NMIs versus the number of classes on the ORL, FERET, YALE and UMIST datasets, respectively. Figures 3 and 4 show that T-NMTF significantly outperforms NMF, NMTF, CNMF and NMF-α on both ORL and FERET datasets. Figures 5(a) and 6(a) show that T-NMTF   is superior to NMF, NMTF, CNMF, and NMF-α in terms of AC on both YALE and UMIST datasets. Figure 5(b) shows that the performance of T-NMTF is comparable to both CNMF and NMF-α in terms of NMI because the YALE dataset suffers seriously from noise by occlusion and varying profiles. However, the following section shows that T-NMTF  The fifth and sixth rows of Figure 11 show that the test examples do help NMF and NMTF to learn good representation. The fourth row of Figure 11 shows    Table 2 shows the result of Average ACs and NMIs of Kmeans, NMF, NMTF, CNMF, NMF-α and T-NMTF versus the number of classes on the MIT CBCL face database. As the MIT CBCL face database contains 3240 images, we randomly choose 30% samples from each class as labeled set. It can be seen from the table that the performance of these methods tend to perform worse with the increasing of the number of classes, and the reason is probably that clustering data of more categories is relatively more challenging. Furthermore, The semi-supervised CNMF, NMF-α and T-NMTF can deliver better results than the unsupervised NMF, NMTF and K-means methods, and the proposed T-NMTF method is superior to other methods in the experiment. These results show the superiority of our proposed T-NMTF, and the reason is probably that it can

B. EFFECT OF SIZE OF TRAINING SET
In this section, we study the effectiveness of the size of training set. We randomly selected a different number of training images from each individual and fixed the number of classes at 10. To eliminate the effect of randomness, we repeated this trial ten times and compared the average ACs and NMIs of different methods. Since NMTF performs comparably with NMF, we have not included it in this experiment.   Figure 15 gives both the ACs and NMIs of T-NMTF versus the parameter λ on the ORL, FERET, YALE, and UMIST datasets. It shows that our T-NMTF model is stable over a wide range of the trade-off parameter λ on all the tested datasets. Throughout this experiment, we set the trade-off parameter λ = 1.

C. EFFECT OF TRADE-OFF PARAMETERS
In T-NMTF (5), the critical trade-off parameter λ balances the reconstruction error of all examples and the distance between the label vectors of the training examples and their corresponding coefficients. Here, we evaluate its effect on our T-NMTF model by cross-validating λ over a set {10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 , 10 3 }. According to Section 6.1, we randomly selected ten individuals from each of ORL, FERET, YALE, and UMIST datasets and randomly selected two images per individual for training according to Section 6.2. To eliminate the effect of randomness, we repeated this trial ten times and evaluated the performance of T-NMTF in terms of both average AC and average NMI.
In MT-NMTF (36), another critical parameter β controls the weight of the manifold regularization. According to the above analysis, the T-NMTF framework is insensitive to the trade-off parameter λ, and we therefore fixed λ = 1 and studied the effect of β on the performance of MT-NMTF in terms of both AC and NMI. In this experiment, we crossvalidated β over a set {10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 1, 10}.   We randomly selected two images from each dataset for training and reported the average AC and NMI as the final result. Figure 16 gives the cross-validation results on four datasets. It shows that MT-NMTF is stable over a wide range of β from 10 −5 to 1. Therefore, we fixed β = 0.1 in our experiments, and the experimental results are encouraging. To this end, we partitioned each dataset into two subsets including a training set and a test set, and further partitioned the training set into a labeled set whose labels are known and an unlabeled set whose labels are unknown. T-NMTF was conducted on both the labeled set and the unlabeled set to learn the subspace. The evaluation was conducted by recognizing the test examples. For each test example x, its label vector is where h = min h≥0 x −Wh 2 2 and W stands for the learned basis.
In this experiment, we respectively randomly selected p = 5, 6, 10 images for each individual from ORL, YALE, and UMIST datasets to form the training set and the remaining images composed the test set. Next, we randomly assigned labels for 1 to p training images of each individual.
To eliminate the effect of randomness, we repeated the trial ten times and output the average AC and NMI as the final result. Since the FERET dataset only contains seven images for each individual, it was not included in this experiment. Figures 17 to 19 give the average AC and NMI of T-NMTF, MT-NMTF, NMF, CNMF, and NMF-α on the test examples of ORL, YALE, and UMIST datasets, respectively. These results show that the performance of both T-NMTF and MT-NMTF is significantly superior to other methods, and this superiority increases with the increase in the percentage of labeled examples in the training set. GUOHUA DONG is currently pursuing the Ph.D. degree with the College of Computer, National University of Defense Technology. Her current research interests include image retrieval, cross-modal retrieval, hashing methods, and graph models.
ZHIGANG LUO (Member, IEEE) received the B.S., M.S., and Ph.D. degrees from the National University of Defense Technology, Changsha, China, in 1981, 1993, and 2000, respectively. He is currently a Professor with the College of Computer, National University of Defense Technology. His current research interests include machine learning, computer vision, and bioinformatics. VOLUME 8, 2020