Sparse Deep Nonnegative Matrix Factorization

Nonnegative matrix factorization is a powerful technique to realize dimension reduction and pattern recognition through single-layer data representation learning. Deep learning, however, with its carefully designed hierarchical structure, is able to combine hidden features to form more representative features for pattern recognition. In this paper, we proposed sparse deep nonnegative matrix factorization models to analyze complex data for more accurate classification and better feature interpretation. Such models are designed to learn localized features or generate more discriminative representations for samples in distinct classes by imposing $L_1$-norm penalty on the columns of certain factors. By extending one-layer model into multi-layer one with sparsity, we provided a hierarchical way to analyze big data and extract hidden features intuitively due to nonnegativity. We adopted the Nesterov's accelerated gradient algorithm to accelerate the computing process with the convergence rate of $O(1/k^2)$ after $k$ steps iteration. We also analyzed the computing complexity of our framework to demonstrate their efficiency. To improve the performance of dealing with linearly inseparable data, we also considered to incorporate popular nonlinear functions into this framework and explored their performance. We applied our models onto two benchmarking image datasets, demonstrating our models can achieve competitive or better classification performance and produce intuitive interpretations compared with the typical NMF and competing multi-layer models.


I. INTRODUCTION
Nonnegative matrix factorization (NMF)is a powerful dimension reduction and pattern recognition technique in data analysis [1], which has been widely used in diverse areas such as document clustering [4], [5], [6], face recognition [7], [8] and microarray data analysis [9], [10].It decomposes a nonnegative matrix X into the product of two low-rank nonnegative factor matrices W and H (i.e., X ≈ W H with W ≥ 0, H ≥ 0), generating natural interpretations in many cases.Moreover, NMF has been extended into a number of variant forms, allowing for various sparse [11], [12], [13] or regularized models [14], [15], most of which demonstrate distinct advantages in local feature extraction or data representation learning.
For sparse variants of NMF, Hoyer [11] found that the original NMF does not always obtain part-based features.Thus, they proposed new sparse models to explicitly control the degree of sparseness of W or H, and the new models can learn part-based features that cannot be revealed by the original NMF.However, Kim and Park [12] proposed that if strong sparsity constraints are imposed, some important information for gene selection in microarray analysis may be lost.They therefore established a varied version of sparse NMF model, in which, L 1 -norm penalty was adopted to constrain the sparseness of columns of W or H and this constriction was added as one term to the objective function.This model has been demonstrated to be able to realize biclustering for cancer subtype discovery and gene expression analysis.Peharz and Pernkopf [13] presented another sparse NMF version by adopting L 0 -norm penalty to limit the exact number of zeros in W or H. Besides, graph-regularized NMF versions have also been explored.For example, Cai et al. [15] proposed a graph-regularized NMF by incorporating prior information of samples into the typical NMF.This helps to keep the original topological structure of data after being projected into a subspace and usually leads to better clustering results.
Moreover, NMF has also been extended for data fusion and combinatorial patterns extraction when analyzing multiple data.For example, Zhang et al. [16] developed a joint nonnegative matrix factorization (jNMF) technique to integrate multi-dimensional genomics data for the discovery of combinatorial patterns to reveal phenomena that would have been ignored with only a single type of data.This model provides a way to reveal the homogeneous relationships among different data.Furthermore, Zhang et al. extended the jNMF to both sparse and graph-regularized version [17].Unlike jNMF, Yang and Michailidis [18] proposed integrative NMF (iNMF) model, besides the homogeneous effects, they also considered the heterogeneous effects of different type of data.Lastly, Žitnik and Zupan [19] proposed a matrix factorization-based data fusion method DFMF to integrate relationships of heterogeneous datasets and describe the observed system from various views to reveal hidden associations.
Although there have been extensive variants of NMF, most of them remain to be single-layer models.Deep learning is becoming increasingly popular and has been demonstrated to be powerful in learning data representation [31], [32], [33], [34], [35].However, typical deep learning is rather complicated, which requires complex structures, needs empirical skills to tune a lot of parameters and lacks explicitly theoretical foundations.Recently, several new frameworks (e.g., Deep PCA (DPCA) [36], PCANet [37] and gcForest [38]) have been proposed to attempt to tackle these issues and provide alternative views to deep learning.DPCA [36] performs a two-layer Zero-phase Component Analysis (ZCA) whitening plus PCA process to learn hierarchical features and obtain corresponding representations for face recognition.PCANet [37] contains a two-layer architecture, where PCA is employed in each layer to learn multistage filterbanks.The deep architecture is followed by binary hashing and block histograms for indexing and pooling to generate deep features for more accurate classification.gcForest [38] assembles different type of random forests at each layer to learn deep features to realize comparable classification performance with deep neural network or convolutional neural network, without establishing complex structures and tuning extensive parameters like traditional deep learning.Inspired by the success of them, multilayer factorizations are attractive to break down the complex problem hierarchically into multiple simple ones.Along these lines, Multi-layer NMF [20], [21] and Deep Semi-NMF [39] have been proposed recently.The general idea of them is by stacking one-layer NMF or Semi-NMF [30] into multiple layers to learn hierarchical relationships among features or hierarchical projections.However, these methods do not well reflect [21] or even ignore the sparse structure hidden in the complex data [20], [39].
To this end, we developed sparse deep NMF models and explored their effectiveness in multiple aspects.In the new models, the first layer is responsible for breaking down the original data into multiple initial basis.Then the factorization in the second layer is to generate meaningful relationships among all the basis in the first layer to form relatively complex features.Again, the decomposition in the third or higher layer is to learn relationships among features from the former layer to combine them selectively.This process is repeated until to the highest layer.In the end, besides all relationship matrices, a data representation matrix will be automatically yielded.During the factorization in each layer, different sparsity constraints were added to localize partial features or generate more discriminative representations for samples in different classes.The key idea is that by thoroughly learning the hidden basis as well as the additional relationships across layers, we can obtain a high level data representation matrix and yield intuitive interpretations for features generated in each layer.
We summarized the main contributions of this paper as follows: • We proposed a series of generalized sparse deep NMF models by extending single-layer algorithm into multiple layers, each of which is designed to generate matrices satisfying certain sparsity requirements.• We also considered linearly inseparable data by incorporating nonlinear functions into the deep NMF models with different ways.• We adopted the Nesterov's accelerated gradient descent algorithm [25] to accelerate the optimization process with convergence rate O( 1 k 2 ), which is much faster than traditional gradient descend algorithm with convergence rate O( 1 k ).The organization of this paper is as follows.In section 2, we introduced highly related works and explained their characteristics.In section 3, we proposed a series of diverse models as well as their corresponding optimization algorithms.
Besides, we explored the effectiveness of incorporation of different nonlinear functions into these models.In section 4, we employed two benchmarking datasets, ORL and PIEpose 27.0, to demonstrate the properties and performance of our models.We showed that the ORL data (consisting of only 400 images) with a relative small number of samples are not sufficient to train a deep model.With the PIE-pose 27.0 data consisting of 2856 images, we conducted extensive experiments and compared them with other NMF variants to demonstrate the effectiveness of our models.We also tested how the structure of deep models, the number of layers and the number of subbasis in hidden layers affect the performance of our models.Finally, we concluded this paper in section 5.

II. RELATED WORK
In this section, we introduce three related studies including two deep frameworks of matrix factorization models and the Nesterov's accelerated gradient descent algorithm for solving the typical NMF.Specially,

A. Multi-layer NMF
The Multi-layer NMF model [21] extends the typical NMF explicitly to multiple layers in the following format: where L is the number of layers.When L = 1, it reduces to the typical NMF model.Ahn et al. [20] first proposed a multi-layer NMF model coupled with three nonlinear layers.They developed a strategy called up-propagation algorithm to jointly update all weight matrices and demonstrated its role in extracting hierarchical features.Song et al. [21] proposed one multi-layer NMF model and adopted non-smooth NMF (nsNMF) [22] to solve the typical NMF in each layer.nsNMF utilizes one sparse matrix S to apply sparsity constraint to standard NMF: S = (1−θ)I(k)+ θ k ones(k).k is the number of bases in corresponding layer, and θ is parameter for smoothing effect, in the range of 0 to 1. I(k) is identity matrix of size k × k with all components of 1s.nsNMF smoothes a matrix by multiplying it with S. In Multi-layer NMF, the author smoothed H matrix by multiplying S and H during iterations as H = SH.Then W would be sparse to compensate the loss of sparsity.They applied their model to the Reuters-21578 collection dataset and showed its superiority in classification and feature hierarchies interpretations compared with the singlelayer nsNMF.

B. Deep Semi-NMF
Compared to the typical NMF, Semi-NMF does not require the original data and the basis matrix to be nonnegative.It extends the applicable range of NMF, but generates basis matrices that hardly show parts-of-whole characteristics intuitively [30].Deep Semi-NMF [39] is an extension of Semi-NMF by considering the matrix factorization in a multi-layer manner.It is defined in the following format, This method first pre-trains each layer, then fine-tunes the whole system with the outcomes of the pre-training.In general, without nonnegative constraint, W ± l can be directly updated by the least square algorithm.Nonnegative H + l is updated in a multiplicative manner with the help of auxiliary function like that in [30].Deep Semi-NMF turns out to be useful in learning hierarchical projections, from the original data space to various subspaces spanned by hidden attributes (e.g., face expression, illumination angle for face image data).Thus, it can cluster samples according to their hidden attributes in corresponding layers.
Deep Semi-NMF does not require the input data and the basis matrices to be nonnegative.Thus, it is hard to see the part-of-whole phenomenon because mutually offset occurs when the basis matrices are combined.Sparsity constraints were not considered either.Besides, the learning process that extracts the hierarchical projections from raw data space to multiple hidden attributes subspace automatically is too abstract to understand.For face images, attributes like face expression, photographing angle etc are easy to be observed but the hidden attributes of many other kinds of data are hardly to know.
On the other hand, Ahn et al. [20] described an interesting multiple decomposition for nonnegative matrix, but neglected detailed analysis for its structure.Besides, sparsity structure was not considered.As for the algorithm, they updated the feature matrices jointly without layer-wise initialization, so the results may vary greatly from one to another experiment.Song et al. [21] chose the proper number of initial basis vectors through single-layer nsNMF.They adopted nsNMF and attempted to control the sparsity level of one overall matrix by tuning parameter θ, which cannot generate a column-wise sparse structure to localize features for each sample.Moreover, in their experiments, results obtained by θ = 0 are better than that obtained by θ = 0, implying that the sparsity in their research does not work.For the optimization strategies, they adopted traditional multiplicative update algorithm for nsNMF of each layer and then stacked it into two layers.They did not fine-tune the whole system to reduce the total reconstruction error.This workflow is simple but may suffer from the problem of low convergence.Moreover, it is sensitive to initial solutions, making it rather unstable.
However, by restricting the sparsity of each column of certain matrices, sparse structure in the original data is well explored in our sparse deep NMF models.Specifically, we considered the sparsity problem of all basis matrices W l to extract local features.We also tried to solve a sparse coding problem by constricting the sparsity of representation matrices H l , representing the original data with as fewer basis vectors as possible while maintaining the discriminative ability of representations for samples in distinct classes.As for the optimization strategy, initialization is first proceeded for each layer, at the beginning of which, NNSVD [40] was adopted to generate a good pair of initialization factors.Then we finetuned the whole system to reduce the total reconstruction error.Nesterov's accelerated gradient descent algorithm [25] was adopted to alleviate the problem of numerical instability and low convergence rate.We conducted a lot of numerical tests to optimize our model's structure.Popular nonlinear functions as well as the way to be incorporated was explored by extensive experiments.

C. NeNMF
Nesterov's accelerated gradient descent algorithm [25] was adopted by Guan et al. to develop an efficient NMF solver NeNMF [26].It has been demonstrated to be able to alleviate the problem of numerical instability and low convergence rate, frequently accounted by other NMF algorithms.Here, we gave a brief introduction to NeNMF.The spirit of NeNMF was also adopted in our algorithms.It is known that NMF is a nonconvex optimization problem and it is impractical to obtain the optimal solution.Thus, block coordinate descent method [42] is popular to seek for a local optimum.Given an initial pair of W 1 and H 1 , the block coordinate descent method alternatively solves F , (4) until convergence, where t is the iteration step.NeNMF employs Nesterov's accelerated gradient descent algorithm to solve (3) and (4).Take H for instance, at each iteration, H is updated by the projected gradient method, and it performed on a chosen search point.The step size is determined by the Lipschitz constant not by the time consuming line search.Due to the convexity of (3) with respect to H and the continuity of the corresponding gradient, the convergence rate is O( 1 k 2 ) after k steps in iteration, superior to conventional gradient descend algorithm whose convergence rate is O( 1 k ).In particular, two sequences (i.e., H k and Y k ) are constructed and updated alternatively in the following way, where According to [25], the combination coefficient is carefully updated in each iteration step as follows, Based on the Lagrange multiplier method, the Karush-Kuhn-Tucker (K.K.T.) conditions of ( 5) are as follows, and ⊗ is the Hadamard product.By solving (8), we can obtain the update formula where the operator P (X) projects all the negative entries to zeros.By alternatively updating H k , Y k+1 and α k+1 with ( 9), ( 6) and ( 7) respectively until convergence, the optimal solution of (3) can be reached.As ( 4) and ( 3) are symmetrical, the update for W t+1 will be obtained in a similar way.NeNMF converges fast eventually with above optimization strategy.

III. SPARSE DEEP NMF
Similar to the general multi-layer NMF framework, sparse deep NMF models factorize a nonnegative matrix into L + 1 nonnegative ones: To make it more intuitive, we can split the equation into the following formula: where g can be a linear or nonlinear function if necessary.All of the matrices above are required to be nonnegative.For the decomposition in the first layer, W 1 is responsible for storing initial sub-basis matrix, the columns of which should be sufficient enough to learn as much information as possible.
Then the further factorization on H 1 is to learn relationships among various sub-basis vectors in W 1 to combine them and form meaningful and decipherable features.Further decomposition on H l (l = 2, 3, ..., L − 1) serves as the similar function with the decomposition of H 1 .With this hierarchical learning structure, a sequence of sub-basis matrices W l (l = 1, 2, ..., L) are generated.Complex features of raw data are extracted by additionally combining those sub-basis matrices.Meanwhile, once the complex features are learned, a high level data representation will be more representative for samples, leading to more accurate classification.
In our sparse deep NMF models, we thoroughly considered the sparse structure hidden in complex data.We first imposed L 1 -norm penalty onto each column of W in each layer (denoted as SDNMF/L).We also considered to impose L 1norm penalty onto each column of H (denoted as SDNMF/R), which helps to solve sparse coding problem by approximating raw vector with as fewer bases as possible.SDNMF/L only imposes L 1 -norm penalty on each W l (l = 1, 2, ..., L) whereas SDNMF/R only imposes the sparsity of H l .Next, we imposed sparse constraints on both W l and H l .One model from this thought requires all factors to be sparse to generate sparse sub-bases and representations (denoted as SDNMF/RL1).Intuitively, for a decomposition of X ≈ W H, if the W is compelled to be sparse, H tends to be smooth.We strengthened this tendency by controlling the L 1 -norm of each column of W l while imposing L 2 -norm constraints onto final H L (denoted as SDNMF/RL2).Also, we inspected the performance of model DNMF.It is one case of our models where there isn't any sparsity constraint on any factors.
1) SDNMF/L: Specifically, SDNMF/L is formulated as follows: where W l (:, j) Here, we adopted the square of W l (:, j) in case that a severe penalty would cause the loss of useful information.Assume that x is one column vector of input data X; h is the corresponding coefficient vector in H 1 .For W 1 , taking it as a linkage between the input vector x and the corresponding representation vector h in the first layer, the more sparse one of its column w k is, the fewer links there will be between h k (the k-th element in h) and x.This means that each element in h will respond selectively to all the elements in x, or each of them will be only activated by the certain members of x, resulting partial information of x being captured by w k .This is consistent with the desire of extracting localized features for each sample.The sparsity constraints for W 2 aims at capturing local information in H 1 containing the rough relationships among all w k in W 1 , helping to selectively combine certain w k to form meaningful and distinguishable features.The sparsity of W in upper layer functions similarly with W 2 .
Here we took g(x) = x to illustrate the algorithm simply and nonlinear models later.It first pre-trains the model in a layer-wise way.For the l-th layer, it optimizes the following model: where H 0 = X, ξ l is a row vector with all components equal one.Tr((ξ l W l ) T (ξ l W l )) is a variant of W l (:, j) 2 1 due to the nonnegativity of W . Different optimization strategy can be used to optimize it.For example, projected gradient (PG) method [45] takes advantage of the Armijo line search to estimate the optimal step size along the projection direction for solving each subproblem.However, the Armijo rule is a timeconsuming search strategy, making PG inefficient.We can also consider the projected NLS method (PNLS) proposed by Berry et al. [28].However, it is an approximate approach and cannot guarantee the convergence.
Previous studies demonstrated that NeNMF can overcome the problem of low convergence rate and numerical instability [26].Luckily, SDNMF/L shares the common necessary properties required by Nesterov's optimal algorithm as NeNMF.In other words, the objective function F (W l , H l ) is convex with respect to H l or W l respectively, and the corresponding gradient is Lipschitz continous.Thus, we adopted Nesterov's accelerated gradient descent algorithm with the following parameters related to W l and H l to solve (11): and We illustrated the pre-training procedure (Algorithm 1 in TABLE I), which alternatively minimizes one matrix factor with another one fixed until convergence.Particularly, in the l-th layer, suppose there have been t pairs of W l and H l , to get the next iteration point H t+1 l , Subalgorithm 1 solves a sub-problem by constructing two sub-sequences like that in problem (5) and updates them until convergence.W t+1 l is obtained similarly with Subalgorithm 2. This process proceeds alternatively until convergence.
After pre-training each layer separately, the algorithm finetunes the system with initial points of all W l , H l to reduce the total reconstruction error as follows: Subalgorithm 2: Optimal gradient method for W l Input: W t l , H t l (Ψ t l−1 need to be input when fine-tuning the system) Update rule for W during the fine-tuning process For W l , the subproblem in fine-tune process is equivalent to the following one, where H l is the reconstruction of H l and H l ≈ W l+1 H l+1 .The objective function is convex with respect to W , and the gradient is Lipschitz continuous with Lipschitz constant Thus, this problem can be solved by Subalgorithm 2, where the F (W, H t l ) should be substituted with the one in (15), and H l , LC W l should be substituted with Update rule for H during the fine-tuning process For H l , the subproblem is formulated as follows, The objective function above is convex with respect to H and its gradient is Lipschitz continuous with the constant LC H l = Ψ T l Ψ l 2 , where Ψ l = Ψ l−1 W l .We utilized Subalgorithm 1 to update H l , where the objective function F (W t l , H) should be substituted with the one in (17), the W t l and the Lipschitz constant should be substituted with Ψ l and LC H l = Ψ T l Ψ l 2 respectively.Finally, we proposed an algorithm for SDNMF/L in TABLE II based on the above analysis.

Algorithm complexity analysis
During the initialization process, to update H l , we need to compute LC H t l in subalgorithm 1.Without loss of generality, we let l = 1.The complexity of computing , where m is the number of rows of data X; k 1 denotes the number of sub-basis in the first layer.The complexity of computing H k in subalgorithm 1 is O(mnk 1 )+I inner •(nk 2 1 ), where I inner denotes the number of iterations in subalgorithm 1.Thus, the complexity to get H t+1 During the fine-tuning process in Algorithm 2, we updated each factor while keeping the others fixed and the process began from W 1 .Similar to the initialization, the complexity of tuning the two factors in each layer is O(mR 2 + nR 2 + mnR) + f inner • (mR 2 + nR 2 ), where f inner is the iteration number in corresponding sub-algorithm to yield each factor.
In a word, the computing complexity of SDNMF/L is L Empirically, both I inner and f inner are very small.If we ignore the inner iteration in both processes, the computing complexity is the same as Deep Semi-NMF [39].

Convergence analysis
According to [43], nonlinear Gauss-Seidel method under convex constraints will converge to the critical point (where the gradient is zero) eventually if two conditions are satisfied.For two-block coordinate problem, if the objective function of each subproblem is convex and the sequence generated by the algorithm has limit points, then every limit point is a critical point of the objective function.However, in a m-block (m ≥ 3) coordinate problem, the conditions must be that the objective function of f (x 1 , x 2 , ..., x m ) is strictly quasi-convex with respect to m − 2 variables (or blocks), and the sequence generated by the algorithm has limit points.
In our framework, when L ≥ 3, we have m ≥ 4. To analyze the convergence of SDNMF/L model, we consider the existence of the limit points of the sequence generated by SDNMF/L first.The fact that the objective is decreasing under the sequence supports that they are in a level set of C SDNMF/L .C SDNMF/L is continuous, so this level set is closed.If this level set is unbounded, then C SDNMF/L is unbounded on this level set, contradicting with the definition of level set.Thus, the level set is a bounded and closed set (compact set).Correspondingly, the generated sequence within a compact set has limit points.Although we cannot demonstrate the final convergence since the strict quasi-convexity of C SDNMF/L is hard to prove, it does decrease after each iteration, and eventually converges to a local optima.
2) SDNMF/R: Here, we considered to control the L 1 -norm of columns of each H l to deal with a sparse coding problem.It helps to represent the raw data with as fewer bases as possible while maintaining the ability to discriminate samples of different classes.Specifically, SDNMF/R is formulated as follows, Similarly, we adopted an initialization procedure like SDNM-F/L using Algorithm 1 to accelerate the optimization, and then we fine-tuned all W l and H l to reduce the total reconstruction error.During the fine-tuning process, for a specific l, the following two subproblems were considered to finetune W l and H l : where  20) can be solved by Subalgorithm 1 by substituting the objective function, W and Lipschitz constant with the one in (20), Ψ l = Ψ l−1 W l and L H l = Ψ T l Ψ l 2 +λ l e T l e l 2 , respectively.Thus, similar to SDNMF/L, with corresponding parameters in Algorithm 2 replaced, SDNMF/R will find its solution.
3) SDNMF/RL1: We next considered to control the sparsity of both W and H.It generates sparse basis matrices as well as a sparse representation.
To solve (21), initialization for each layer is also necessary.For a specific l, the problem is: where 1 2 λ l tr((e l H l ) T (e l H l )) is considered only when l = L. Obviously, the objective function of ( 22) is convex with respect to W or H separately and the respective Lipschitz constant is easy to calculate.Given the fact, back to Algorithm 1, the solution for ( 22) is ready to obtain with parameters related to W and H substituted with the following ones: H : where λ l e T l e l H and λ l • e T l e l 2 in ( 24) are removed if l = L, meaning that we just controled the sparsity of H L to obtain a high-level final presentation.After pre-training each layer separately, we fine-tuned the weights of all layers as well as the final representation with initial approximation points of W l , H L to reduce the total reconstruction error of (21).In particular, for a specific l and a factor matrix W l , let's consider the model in (15) where, By replacing the corresponding parameters in Subalgorithm 2 with equation ( 25), a new point for W l in the fine-tuning process will be obtained.For H L , consider the model (20), where (26) Similarly, with the above given parameters, a new point for H L can also be obtained using Subalgorithm 1, indicating that SDNMF/RL1 can also be optimized with Algorithm 2.
4) SDNMF/RL2: The above SDNMF/RL1 is devoted to find sparse basis as well as a sparse representation of raw data.Intuitively, for a decomposition of X ≈ W H, if W is compelled to be sparse, H will tend to be smooth.Here, we tried to strengthen this trend by controlling the L 1 -norm of each column of all weights while exerting L 2 -norm constraints onto final H (denoted as SDNMF/RL2).
The optimization procedure, either initialization or fine-tuning, is similar to SDNMF/RL1 as long as the e T L e L in ( 24) and ( 26) are replaced by identity matrix I in the optimization of (27).

5) Nonlinear function:
In this part, we took g(x) to be a nonlinear function and considered four choices: tanh, root, sigmoid, softplus.Due to the incorporation of nonlinear functions, it is different from linear system to solve nonlinear system, but the process was still divided into initialization and the fine-tuning steps.In the initialization, at first, we decomposed X = W 1 H 1 , before using H 1 into the next layer, we projected it in an element-wise way with the nonlinear function.Similarly, all H l (l = 2, 3, ..., L − 1) will be nonlinearly projected before making it as the input of the next layer.As for H L , we can project it or just leave it unchanged, resulting in two ways of incorporation.After projection, the strategy to solve H l−1 = W l H l is the same as that in Table I.
However, with nonlinear structure, Lipschitz constant for the gradient of nonlinear functions is hard to obtain, which means that Nesterov's optimal method may be not suitable to fine-tune the system.Thus, we fine-tuned the system with a traditional project gradient descend algorithm.The gradient for each factor is computed as follows: ∂f We updated H L first with (29), which can be computed according to the chain rule.Then, we updated W l , l ∈ {2, ..., L} with (30).As for H l , l ∈ {1, 2, ..., L − 1}, it is approximated directly by the nonlinear projection of the product of W l+1 and H l+1 , i.e., H l ≈ g −1 (W l+1 H l+1 ).For W 1 , we updated it with Subalgorithm 2 in Table I.

A. Data
We applied our models (DNMF, SDNMF/R, SDNMF/L, SDNMF/RL1, SDNMF/RL2) onto two benchmarking image datasets: ORL and PIE-pose 27.0, and compared them with other NMF-related models.Specifically, we compared our models with single layer models: the projected gradient NMF (PgNMF), NeNMF, and deep matrix factorization models: Deep Semi-NMF and Multi-layer models.Generally, a large number of samples are required to train a deep model to learn the complex features.Otherwise, it may not be able to outperform single layer models.We tested our models firstly on data ORL.It contains 40 subjects and each subject owns 10 images of equal size 112×92 under different conditions.Thus, the data size of ORL is 10304 × 400.With these two datasets, we first investigated the structure of our models to figure out the proper number of layers as well as the number of sub-bases in the hidden layers.

B. Evaluation criteria
In this article, we adopted three measures including normalized mutual information (NMI) [46], error rate(ER) [47] and naive precision (NP).Given the standard class partition C * and the obtained class partition C, we first constructed a confusion matrix N whose rows correspond to the classes in C * and columns correspond to the classes in C. Let N ij denotes the number of samples overlapped by the i-th real and j-th obtained classes.NMI is defined as , where |c| is the number of classes in C; N i . is the sum of i-th row of N ; N .j is the sum of j-th column of N .ER is defined as follows, where the Z and Z * are the indicator matrix for C and C * .NP is computed as follows: , where c * i denotes the number of objects in i-th classes of C * .

C. Results of ORL data
The ORL data has relative small number of samples, which is insufficient to train a deep model.Thus, we initially set L = 2.We chose k 2 = 80, 120, 160 and k 1 = 100, 200, 300, 400, 500, 600, 700, 800.In order to see the role of an extra layer, we fix k 2 while making k 1 change, that is, for each k 2 we ran our models with k 1 varying from 100 to 800. Figure 1 shows the NMI and ER of all models under k 2 = 160, from which we select appropriate value of k 1 for all models.For example, we chose k 1 = 300 for DNMF and SDNMF/R, k 1 = 700 for SDNMF/L, k 1 = 500 for SDNMF/RL2.Similarly, we chose appropriate k 1 for k 2 =80,120 for each model including Deep Semi-NMF.For Deep Semi-NMF, we fixed k 1 = 700 according to its performance across all tested values.After selection, we compared our sparse deep models with Deep Semi-NMF and single layer models (Figure 2).Each of boxes in both Figure 1 and 2 represents the distribution of 200 precision scores.They were generated by 20 computing experiments on ORL and K-means was applied 10 times on the representation matrix of each experiment for robust classification.Results show that the classification ability of our models outperforms Deep Semi-NMF, but do not always perform better than PgNMF and NeNMF.These results indeed confirm our consideration that ORL data does not contains enough samples to train a deep model sufficiently.They are randomly chosen from exponential distributions and then ranked to make sure that the value in lower layer is larger than that in higher layer.To make a roughly fair comparison, the number of sub-bases in the last layer is fixed to be 40.Generally, the classification results with k 1 = 300 and k 1 = 600 are consistent (Figure 4).To strengthen the ability of learning necessary information in the first layer, we chose k 1 = 600.As for k 2 , we tested it from 40, 80, 120, 160, 200 (Figure 5).It shows that, besides DNMF (It is one case of our models where there is no sparsity constraint on any W l or H l ), the precision of four models rise as k 2 grows at first and reach the climax under k 2 = 120 or k 2 = 160 and then fall down.Let k 2 = 160, an extra layer is added to the existing  models and we compared three-layer models with two-layer ones (Figure 6).It shows that, most of our models, except DNMF, reach their best with layer size= [600,160].Given that DNMF does not generate better results against our other sparse deep models, we did not extend all models to threelayer level for PIE-pose 27.0 data.
2) Role of nonlinear function: In this section, we incorporated nonlinear functions (tanh, root, sigmoid, softplus) into our models to achieve more powerful representation ability.After inspection, only root function was left.We took g(x) = √ x and incorporated it into the original linear models with L = 2 and layersize=[600,160].Adding nonlinear functions means that, before using H as the input of the next layer or as the final usage for classification, we projected it nonlinearly first.Since L = 2, we tried two ways to incorporate the root function into our models.One was that we only projected H 1 , leaving H 2 unchanged.The other way was that we projected both H 1 and H 2 .The steps to solve the nonlinear models are similar to those for linear models.We first initialized each layer and fine-tuned the whole system.We compared the performance of each nonlinear model with corresponding linear models (Figure 7).It shows that there are significant improvements of DNMF, SDNMF/L, SDNMF/RL2 in terms of NMI and ER after inducing g(x) = √ x.Specifically, for DNMF, the changing trend of NMI and ER of its linear model are even reversed by nonlinear function.As for SDNMF/R, its learning ability is strengthened by projecting H 1 but damaged by projecting both H 1 and H 2 .Maybe it is inappropriate to project H 2 with g(x) = √ x since, according to the design and corresponding results of our experiments, the elements of H 2 are smaller than 1 so that the projection of g(x) = √ x will enlarge all the elements and make H 2 more smooth, undermining the original sparsity generated by the our initial sparse constriction.For SDNMF/RL1, we required sparsity on both W and H.This may cause the loss of important information and cannot realize a good approximation.This situation might be deteriorated by projecting H 1 or H 1 and H 2 with the root function.We compared our models with PgNMF, NeNMF, Deep Semi-NMF and Multi-layer NMF (θ = 0.001) in terms of NMI, NP and ER (Table III).It turns out that SDNMF/L outperforms PgNMF, NeNMF and Multi-layer NMF in terms of NP and NMI.Advantages against the single-layer NMF mean that the deep decompositions of SDNMF/L works better to analyze data whereas advantages against Multi-layer NMF mean that both of the sparse strategy and the optimization strategies of SDNMF/L play important roles in generating a higher level data representation for more accurate classification.Sparsity constraints for all the columns of W helps to extract localized features and learn class-specific deep features, which in turn generates more distinctive representations for classification.Also, the fine-tuning process in the optimization approximates original data more closely.What's more, the performance of Multilayer NMF varies greatly with respect to parameter Θ.We tested it under multiple choices and some of the results were shown in Appendix.As for the comparison with Deep Semi-NMF, there is a gap between linear SDNMF/L and Deep Semi-NMF.This is reasonable since the learning ability is limited by the nonnegativity of our models whereas Semi-NMF is more likely to succeed without that constriction on basis matrices.Nevertheless, by adopting root function, the learning ability of SDNMF/L rises to a new level, generating more representative coefficient matrices than Deep Semi-NMF to distinguish samples of different classes.The ER of SDNMF/L is higher than those of the compared models.We see that sometimes multiple (up to 5) classes would be clustered into the same one by linear SDNMF/L, which rarely occurs when incorporating the root function (especially in the second way).Moreover, the learning ability of nonlinear SDNMF/RL2, SDNMF/R as well as DNMF is also greatly strengthened by such a nonlinear function.
3) Feature hierarchies: SDNMF/L can not only yield a good classification on data PIE-pose 27.0, but also learns feature hierarchies to show parts-of-whole characteristics intuitively.This benefits from our sparsity constraints on the columns of W . Sparse W 1 helps to learn part-based information for each sample and sparse W 2 selectively combine the initial basis in the first layer to generate composite basis and form relatively complex features.Again, W 3 selectively combine the composite basis in the second layer to show discriminative features for samples in distinct classes.As all the features being extracted, a high level and more meaningful representations for each class will be automatically obtained to realize accurate classification and feature interpretation.Figure 8 shows how a three-layer linear SDNMF/L model learns feature hierarchies.In Figure 8a, the left part contains 10 samples of class 36 and the right part are coefficients (some columns of H 3 ) for all samples of class 36, from which we knew that samples of class 36 are mainly reconstructed by the 7th composite basis in W 1 W 2 W 3 (the first face image in Figure 8d).Since it is the combination result of the composite bases in W 1 W 2 in the second layer, we got the top 5 elements of the 7th column of W 3 (coefficient in Figure 8c) and located the corresponding composite basis in W 1 W 2 (the 5 face images in the left in Figure 8c).To find the related initial sub-basis in the first layer, we chose the first image in Figure 8c, which has the largest coefficient and located its column number in W 1 W 2 (column 13).Then we ranked the 13th column of W 2 and get the top 5 elements (coefficient in Figure 8b) and located the corresponding columns of the W 1 (the 5 images in the left in Figure 8b).

V. CONCLUSION
In this paper, we proposed sparse deep nonnegative matrix factorization models satisfying different sparsity requirements.By extending the original NMF into multi-layer models, they can learn data in a hierarchical way.Model structure optimization was implemented for different datasets.We explored the incorporation of nonlinear functions into these models.We adopted the Nesterov's accelerated algorithm to accelerate the computing process during optimization.We evaluated the time complexity of our algorithm framework and showed that it is comparable to Deep Semi-NMF.We demonstrated that our models are able to learn more discriminative representations to realize competitive classification accuracy compared with other NMF models.Besides, they can also generate intuitive interpretations for the features extracted across all layers, which is not applicable for single layer NMF or Deep Semi-NMF.
We would like to close this paper by listing some possible research directions.Our models performed differently even on the same dataset, there must be some underlying reasons that need to be explored.Another concern is why some nonlinear functions performs well while others not.Besides, although the complexity of our model are comparable to other NMF variants, we need to search for more efficient optimization strategies to deal with the increasing big data.

Fig. 1 .
Fig. 1.Comparison of two-layer models with layersize=[k 1 , 160] in terms of NMI (a) and ER (b).Each sub-figure shows the results of Deep Semi-NMF and our models for comparison.

Fig. 3 .
Fig. 3. Classification precision of our linear models on PIE-pose 27.0 dataset in terms of NP with different layer number L= 2, 3, 4. The number of sub-bases in hidden layers was randomly extracted from certain exponential distributions.Then they were ranked to satisfy the relationship k lowerlayer > k higherlayer .For example, for SDNMF/L with L = 3 and layersize = [k 1 , k 2 , k 3 ], the value of k 1 , k 2 were drawn from exponential distributions, with k 1 > k 2 , k 3 is fixed to be 40.The experiments for SDNMF/L under each layer size was repeated 50 times, generating 50 NMI values.The rest experiments for the other models were done in the same way.

Fig. 7 .
Fig. 7. Comparison among linear and nonlinear models.Square 1 means that we only added nonlinear function onto H 1 , whereas Square 2 means that both H 1 and H 2 are projected by the nonlinear function.The classification precision in terms of NMI of DNMF, SDNMF/R, SDNMF/L, SDNMF/RL2 are shown in (a) and (b), respectively.The results of SDNMF/RL1 in terms of NP, NMI, ER are shown in (c).

Fig. 8 .
Fig. 8. Hierarchical feature interpretation by a linear SDNMF/L model with layersize=[193,141,40].(a) 10 samples of class 36 and the representation matrix of class 36.Certain bases as well as corresponding coefficients in the first, second and third layer are shown in (b), (c) and (d), respectively.

TABLE II Algorithm 2 :
Optimal algorithm for SDNMF/L Input: X ∈ R m×n , the size of each layer Output: Weight matrices W l and representation matrices H l (∀ l) is one matrix with all unit elements whose • 2 is the length of ξ 1 .Thus, the complexity of initialization in the first layer isI out • (O(mk 2 1 + nk 2 1 + mnk 1 ) + I inner • O(mk 2 1 + nk 2 1)), where I out is the number of iterations in Algorithm 1.Let R = max{k 1 , k 2 , ..., k L }, the complexity of initialization for each layers is I out (O(mR 2 +nR 2 +mnR)+I inner •O(mR 2 + nR 2 )).
(19)can be solved by Subalgorithm 2. The objective function, H and Lipschitz constant should be substituted with the one in(19), H l and LC It was randomly split into a training set with size of 10304 × 320 and a test set with size of 10304 × 80.The second dataset is PIE-pose 27.0.It contains 68 subjects and each subject owns 42 images of equal size of 32 × 32 under each condition.Thus, the size of this data is 1024 × 2856.In our experiments, PIE-pose 27.0 was randomly split into seven independent pairs of training and test sets for cross validation, where the size of each training set was 1024 × 2448 and the size for test set was 1024 × 408.

TABLE III COMPARISON
OF NMF-RELATED MODELS ON PIE-POSE 27 DATA