Convolutional Subspace Clustering Network With Block Diagonal Prior

Standard methods of subspace clustering are based on self-expressiveness in the original data space, which states that a data point in a subspace can be expressed as a linear combination of other points. However, the real data in raw form are usually not well aligned with the linear subspace model. Therefore, it is crucial to obtain a proper feature space for performing high quality subspace clustering. Inspired by the success of Convolutional Neural Networks (CNN) for extraction powerful features from visual data and the block diagonal prior for learning a good affinity matrix from self-expression coefficients, in this paper, we propose a jointly trainable feature extraction and affinity learning framework with the block diagonal prior, termed as Convolutional Subspace Clustering Network with Block Diagonal prior (ConvSCN-BD), in which we solve the joint optimization problem in ConvSCN-BD via an alternating minimization algorithm, which updates the parameters in the convolutional modules and the self-expression coefficients with stochastic gradients descent and updates other variables with close-form solutions alternatingly. In addition, we derive the connection between the block diagonal prior and the subspace structured norm, and reveal that using the block diagonal prior on the affinity matrix is essentially incorporating the feedback information from spectral clustering. Experiments on three benchmark datasets demonstrated the effectiveness of our proposal.


I. INTRODUCTION
In many problems across computer vision and pattern recognition, we need to deal with high-dimensional datasets, such as images, videos, text, and more. Such high-dimensional data can often be well approximated by a union of lowdimensional subspaces, corresponding to multiple classes or categories [1]. For example, under Lambertian reflectance, the face images of one subject with a fixed pose and varying lighting conditions lie in a linear subspace of dimension up to 9 [2]. Similarly, the feature point trajectories associated with a rigidly moving object in a video lie in a union of subspaces with dimension up to 3 [3], and the images of each handwritten digit with different variations also lie in a low-dimensional subspace [4]. Grouping the data points The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy .
into their respective subspace leads to an important problem, called subspace clustering. In the past decade, subspace clustering has found many applications in image representation and compression [5], motion segmentation [3], and temporal video segmentation [6], etc.
A. SUBSPACE CLUSTERING Problem 1: (Subspace Clustering) Given the data matrix X ∈ IR D×N whose columns are drawn from a union of k subspaces {S i } k i=1 , where subspace S i contains N i data points and k i=1 N i = N . The goal of subspace clustering is to segment the columns of X into their corresponding subspaces.
In the past decade, subspace clustering has received a lot of attentions and a number of methods have been developed. Among them, the most popular methods are based on spectral clustering, e.g., [7]- [16], which divide the problem into VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ two steps. In the first step, an affinity matrix A = [a ij ] is built from the data, where a ij measures the similarity of the i-th and j-th data points. It is expected that a ij is nonzero only for data points lying in the same subspace (which is referred as subspace preserving property [1], [6]). In the second step, the segmentation of the data can be obtained by applying spectral clustering to the affinity matrix A. The typical methods for building the affinity matrix are based on the self-expression model, which states that a data point in a subspace can be expressed as a linear combination of other data points [6], [8], i.e., x j = i =j c ij x i + e j , where e j is used to tolerate errors. To find the solution of subspacepreserving property, a proper regularizer is usually imposed on the coefficients, leading to the problem as follows: for j = 1, · · · , N , where λ > 0 is a tradeoff parameter, · is a properly chosen norm on the coefficient matrix, such as the 1 norm used in sparse representation [8], the nuclear norm used in low rank representation [9]. If we arrange the N coefficients vectors {c 1 , c 2 , . . . , c N } into the coefficients matrix C, where C = [c 1 , c 2 , . . . , c N ] ∈ IR N ×N , then we turn the problem into a compact formulation where the constraint diag(C) = 0 is sometimes used to avoid the trivial solution C = I . Once the coefficients matrix C has been learned from the data, the affinity matrix is usually induced from C, via A = 1 2 (|C| + |C T |), and spectral clustering is used to find the segmentation of the data.

B. BLOCK DIAGONAL PRIOR
Block diagonal prior [17]- [20] means that the affinity matrix learned from data has a block diagonal structure, which consists of k connected components, corresponding to data points in k subspaces. Given the data matrix X ∈ IR D×N drawn from a union of k subspace [S 1 , S 2 , . . . , S k ], the coefficient matrix C is block diagonal if C has the following form where C i ∈ IR N i ×N i and the nonzero entries in C i correspond to a pair of data points only from the subspace S i .
Recently, a few works incorporate the block diagonal prior into subspace clustering [17], [18], and demonstrate that the subspace clustering performance could be improved by adding the block diagonal prior on the coefficients matrix or the affinity matrix. For example, [17] seeks the coefficient matrix to be exactly k-block diagonal by enforcing a hard Laplacian constraint; [18] seeks for a block diagonal representation directly to construct the affinity matrix; in addition, [19] incorporates the block diagonal prior into Low-Rank Representation [9] in an implicit feature space, leading to promising subspace clustering performance.
Traditional subspace clustering methods usually work in the original space, however, due to lack of feature extraction, the potential application of subspace clustering has been seriously limited. To address this problem, attempts to joint feature extraction and subspace clustering have been carried out. For example, manual features, such as, SIFT [21], HOG [22], and a two-layer auto-encoder network is used before performing subspace clustering [23]; in [24], a stacked convolutional auto-encoder network is adopted to jointly perform feature extraction for subspace clustering; in [25], a stacked convolutional auto-encoder network is used in a generative adversarial network framework for subspace clustering. While these works show that a proper feature extraction is usually crucial for subspace clustering, more efforts are needed to explore the systematic scheme of feature extraction based subspace clustering.

C. OUR CONTRIBUTIONS
In this paper, we propose a novel subspace clustering framework, called Convolutional Subspace Clustering Network with Block Diagonal prior (ConvSCN-BD), in which the block diagonal prior is integrated into a jointly trainable convolution feature based self-expression model. By doing so, the block diagonal prior helps to learn a better affinity matrix, and then the affinity matrix provides informative feedback to refine the self-expression coefficients. Specifically, our contributions can be highlighted as follows.
• We propose a jointly trainable framework for subspace clustering, in which the convolution feature extraction and the affinity learning with the block diagonal prior are jointly optimized.
• We propose an alternating minimization algorithm to solve the joint optimization problem in ConvSCN-BD, in which the parameters in convolutional feature extraction module and the coefficients in self-expression model are updated via stochastic gradient descent and the other variables are updated with closed-form solutions alternatingly.
• We conduct experiments on three benchmark datasets to demonstrate the effectiveness of the proposed joint optimization framework. Paper Outline: The remainder of this paper is organized as follows. Section II reviews the relevant work. Section III presents our proposed ConvSCN-BD and the training strategy. Section IV shows experiments with discussions and Section V concludes the paper.

II. RELATED WORK
In this section, we review the relevant previous work about the block diagonal prior in subspace clustering. For clarity, we group them into two categories: a) subspace clustering with block diagonal prior in original space; b) subspace clustering with block diagonal prior in feature space.

Definition 1 (k-Block Diagonal Matrix): For any matrix A ∈ IR N ×N , it is k-block diagonal if A has k connected components.
For a block diagonal affinity matrix A, the number of connected components relates to the spectral property of the corresponding Laplacian matrix [26]: Let A be an affinity matrix, the multiplicity number of the eigenvalue 0 of the corresponding Laplacian matrix L A equals the number of connected components (blocks) in A.
Let λ i (L A ), i = 1, . . . , N , be the eigenvalues of L A in decreasing order. Then with the Theorem 1, In [17], the block diagonal prior is introduced by imposing a Laplacian Constraint (LC) rank(L A ) = N − k, which is added directly in the self-expression model as follows: where K is the space of k-block-diagonal matrix which is defined as While the explicit constraint on the rank of the graph Laplacian is a good choice to explore the block diagonal prior, it is hard to optimize due to the NP-hard nature.
To address this issue, Lu et al.s [18] relax the rank constraint into an 1 norm regularizer on the minor eigenvalues of the graph Laplacian matrix, which is defined as follows.
Definition 2 (k-Block Diagonal Regularizer): Let A ∈ IR N ×N be an affinity matrix computed by the coefficient matrix, e.g., A = 1 2 (|C| + |C T |), the k-block diagonal regularizer is defined as the sum of the k smallest eigenvalues of L A , i.e.
where λ i (L A ) denotes the i-th eigenvalue of L A in decreasing order.
Obviously, if the affinity matrix A is k-block diagonal, A κ = 0. In turn, if A κ = 0, the affinity matrix A exists k blocks. As A κ measures the block diagonal degree of affinity A, it can be used as a regularizer to learn a block diagonal representation. In [18], the block diagonal representation is formulated as the following: where C κ is the block diagonal regularization on the coefficient matrix, and λ > 0 is a tradeoff parameter.

B. SUBSPACE CLUSTERING WITH BLOCK DIAGONAL PRIOR IN FEATURE SPACE
Typical subspace clustering methods in the original space suffer some limitations with the basic linearity assumption, which requires that each sample can be linearly reconstructed by the other samples. However, it may not be linearly represented in many practice application. To address the issue, kernel trick is employed to yield a latent feature space for subspace clustering, e.g., polynomial kernel and Gaussian kernel are used to map the original data into high dimensional feature space implicitly, e.g., [27]- [30]. Recently, the block diagonal prior is used for subspace clustering in the latent space, which is implicitly built via kernel trick. For example, [19] develops a low-rank representation based approach with the block diagonal prior in the implicit feature space, which defines the optimization problem as following: where λ > 0, and γ > 0 are trade-off parameters, is a chosen kernel function, such as polynomial or Gaussian kernel, A κ is the block diagonal regularizer term, which encourages the affinity matrix A = 1 2 (|C| + |C T |) to be k-block diagonal, and · * denotes the nuclear norm (i.e., the sum of the singular values of the matrix).
While the aforementioned subspace clustering methods in the latent feature space bring promising performance improvements, it is not guaranteed that the data in the predefined kernels induced latent feature space will align with the union of subspaces model. In addition, choosing a proper kernel function is not trivial.
Recently, a few attempts have explored subspace clustering with deep neural networks to handle the issue that linear subspace clustering methods was suffering in the original space. For example, in [23], hand-crafted features (e.g., SIFT or HOG features) and sparse self-expression model was combined by a fully connected deep auto-encoder network; in [31], a non-linear subspace clustering (NSC) method is proposed for image clustering via a neural network; similarly in [32], a trained Self-Organizing Map (SOM) neural network is proposed for subspace clustering; in [24], Deep Subspace Clustering Networks (DSC-Nets) are introduced to tackle the non-linearity in subspace clustering, where the raw data are non-linearly mapped with convolutional auto-encoders into a feature space in which the self-expression model is conduced; in [25], a deep adversarial framework is adopted for subspace clustering. Although these methods have reported impressive performance improvements, the diagonal block prior has not been considered.
In this work, we incorporate the block diagonal prior into DSC-Nets [24] and develops a trainable framework for joint It consists of mainly three modules: i) stacked convolutional encoder module, which is used to extract convolutional features; ii) stacked convolutional decoder module, which is used with the encoder module to initialize the convolutional module; iii) self-expression module with block diagonal prior, which is used to learn a better affinity matrix from the self-expression coefficients.
convolution feature extraction and affinity learning with the block diagonal prior. In the training, we make use of the block diagonal prior to help learning a better affinity from the self-expression coefficients and then provide a feedback information to help the self-expression model.

III. OUR PROPOSAL: CONVOLUTIONAL SUBSPACE CLUSTERING NETWORK WITH BLOCK DIAGONAL PRIOR (CONVSCN-BD)
In this section, we present our proposal ConvSCN-BD for subspace clustering, which jointly learns the convolutional feature and the affinity matrix.

A. NETWORK FORMULATION
As illustrated in Figure 1, our network consists of a convolution feature extraction module and a self-expression module with block diagonal prior. Once the affinity is constructed, the segmentation of the data can be obtained by applying spectral clustering.

1) FEATURE EXTRACTION MODULE
Feature extraction module is a basic component of our proposed ConvSCN-BD, which is used to extract features from raw data that are suitable to subspace clustering. To handle the non-linearity in subspace clustering, we employ convolutional auto-encoders which is comprised of multiple convolutional layers.
Let X = {x 1 , x 2 , . . . , x N } denote the set of input samples, and Z = {z 1 , z 2 , . . . , z N } denote the set of corresponding latent representations learned by the feature extraction module. For clarity, we denote the parameters in the convolutional feature extraction module as and thus the learned convolutional feature z j = ϕ(x j , ), where ϕ(·) refers to the mapping from the input data to the convolution feature. To train the convolution feature extraction module, we form a convolution auto-encoder architecture as in [24] and pre-train the network by minimizing the following reconstruction loss: where reconstructionX = {x 1 ,x 2 , . . . ,x N } is the output of convolutional decoders, and N is the number of images in the training set. Once the convolution feature extraction module is trained, we add the self-expression model with the block diagonal prior on the top of the feature which is concatenated by the feature maps in the output layer of the convolutional encoder, as illustrated in Figure 1.

2) SELF-EXPRESSION MODULE IN FEATURE SPACE WITH BLOCK DIAGONAL PRIOR
As mentioned above, self-expressiveness property is the foundation of subspace clustering, that is, a data point can be represented as the linear combination of all the other points. In this paper, we propose to seed the block diagonal representation in the convolutional feature space by solving the following optimization problem: where Z = ϕ(x j , ) is the matrix of concatenated convolution features, λ > 0 is a tradeoff parameter, the second term Z − ZC 2 F is to tolerate errors, and A κ is the block diagonal regularizer in Definition 2.
Regarding to the optimization with the block diagonal prior term A κ , we have the following result.
Theorem 2 [18]: Let λ i (·) denote the i-th eigenvalues of a matrix in decreasing order, A ∈ IR N ×N is an affinity matrix and L A = D − A is the Laplacian matrix, then we have: To understand the block diagonal prior A κ , we denote W = QQ , where Q ∈ IR N ×k and the columns of Q are of unit 2 norm and orthogonal to each other, i.e., Q Q = I . It is easy to check that W = QQ is a feasible solution. Then, we reformulate (11) as follows: where q (i) and q (j) are the row vectors of matrix Q.
On the other hand, given the affinity matrix A = 1 2 (|C| + |C |), we apply spectral clustering to solve the following problem: where Q is the segmentation matrix, and Q = {Q ∈ {0, 1} N ×N : 1 T Q = 1 T , and rank(Q) = k} is a set of all valid segmentation matrices. In practice, as the search over all Q ∈ Q is combinatorial, to make the problem tractable, we usually relax the constraint Q ∈ Q to Q Q = I and thus solve the problem as follows: which can be solved effectively by eigenvalue decomposition.
It is now clear that the optimal solution of Q in (12) is the segmentation matrix of spectral clustering in (14). Moreover, by instituting A = 1 2 (|C| + |C |) into the objective function in (14), we have that trace(Q L A Q) = C Q , which links the block diagonal prior A κ to the subspace structured norm C Q in [14].

B. ALTERNATING MINIMIZATION ALGORITHM FOR SOLVING THE JOINT OPTIMIZATION PROBLEM IN CONVSCN-BD
In this subsection, we present an alternating minimization algorithm to solve the joint optimization problem in ConvSCN-BD.
For clarity, we formulate the whole joint optimization problem in ConvSCN-BD (as illustrated in Figure 1) as follows: where refers to the parameters in the convolution feature extraction module.
To remedy the difficulty of solving the problem, we introduce an auxiliary variable J with a penalty term β 2 C − J 2 F , where β > 0, and thus relax the problem as the following: where Z is replaced by ϕ(X , ) for clarity.
Owning to the intermediate term β 2 C − J 2 F , we can solve the joint optimization problem in (16) by alternatingly solving the following two subproblems: • Given J , we solve C and from the subproblem min C, This problem can be solved via stochastic gradient descent (SGD) [33].
• Given and C, we solve J from the subproblem The solution to this problem will be given in the following subsection.

1) UPDATING FOR θ AND C
When J is given (or initialized), we can update and C by solving the subproblem (17). Note that the objective function in (17) is differentiable with respect to both and C, and that with the help of pretraining the convolutional module, we are assumed to have a good initialization for . Then, we update and C with stochastic gradient descent [33].

2) UPDATING FOR J
When C is given (or initialized), we can update J by solving the subproblem (18). By using Theorem 2, we reformulate subproblem (18) as follows: • Given J (and C), the problem for solving W reduces to As discussed in (12), W = QQ is a feasible solution, where Q is defined as in (12), and thus the optimal solution of W can be indirectly computed via Q, which is the optimal solution of the following problem: The closed form solution of Q is the ending k eigenvectors associated with the k smallest eigenvalues of the graph Lagrangian L A .
• Given W , the problem for solving J is: Noted that except the intermediate term, other terms in objective of (22) only contains |J |. Furthermore, the elements of J have the same sign as in C. Then the solution of (22) can be computed as J =Ĵ sign(C), where is the Hadamard product andĴ is the solution of the following problem: which is equivalent to: where D W = diag(W )1 − W . As derived in [19], the closed-form solution of (24) is given as:

C. TRAINING PROCEDURE FOR CONVSCN-BD
In this subsection, we provide a two-stage training procedure for the proposed ConvSCN-BD. In the first stage, we pre-train the stacked convolutional layers to initialize our ConvSCN-BD, and in the second stage, we fine-tune the whole network to jointly optimize the convolutional feature extraction and the affinity learning with the block diagonal prior.
• In the pre-training stage, we use the loss L 0 as defined in (9) and pre-train the stacked convolutional autoencoder network at first. Then, we add the self-expression module and pre-train the stacked convolutional autoencoder Algorithm 1 Procedure for Training ConvSCN-BD Require: Input data, tradeoff parameters, maximum iteration T max , T 0 , and t=1. 1) Pre-train the convolutional module.
2) Pre-train the convolutional module with the self-expression module and initialize J with C. 3) while t ≤ T max do a) Fixed J (t) and C (t) , update W (t) by the solution of (21). b) Fixed W (t) , update J (t) by the solution of (24). c) Update C and by minimizing loss (27) via SGD with back propagation for T 0 times, and set t ← t+1. 4) end while 5) Run spectral clustering. Ensure: trained ConvSCN-BD and Q.
network with the self-expression module in which the loss function is L 0 + λL 1 where L 1 is defined as • In the fine-tuning stage, we use the loss function L = L 0 + λL 1 + βL 2 (26) where L 2 = 1 2 C −J 2 F to jointly update the parameters in the convolutional module and the coefficients C in self-expression module where J is initialized as the pretrained coefficients matrix C. Then, given C, we solve the affinity matrix A immediately by updating W and J in subproblem (19) with the close-form solutions. To summarize, the total loss function of our proposed ConvSCN-BD is as follows: where L 3 = A κ . For clarity, we list the procedure for training the whole ConvSCN-BD in Algorithm 1.

IV. EXPERIMENTAL EVALUATIONS
To evaluate the performance of our proposed ConvSCN-BD, we conduct experiments on two face benchmark datasets: the Extended Yale B and ORL, and one handwritten digit dataset: MNIST.
As baseline, we select the following algorithms, including subspace clustering in the original space: Low Rank Representation (LRR) [9], Low Rank Subspace Clustering (LRSC) [34], Sparse Subspace Clustering (SSC) [8], Efficient Dense Subspace Clustering (EDSC) [35], subspace clustering in the latent space: Kernel Sparse Subspace Clustering  (KSSC) [28], SSC with the pre-trained convolutional autoencoder features (AE+SSC), EDSC with the pre-trained convolutional auto-encoder features (AE+EDSC), Subspace Clustering by Block Diagonal Representation (BDR) [18], and Deep Subspace Clustering Networks (DSCNet) [24]. The results of these algorithms are reported in Table 1 and 2. For setting of the convolutional layers, the stride of kernel is 2 in both horizontal and vertical directions, and non-linear activation is the Rectified Linear Unit (ReLU). In addition, we set the learning rate to 1.0 × 10 −3 in our all experiments.

A. EXPERIMENTS ON EXTENDED YALE B
The Extended Yale B dataset [36] consists of 38 subjects, 2432 images in total. With approximately 64 frontal face images per subject taken under varying lighting conditions, the samples span a linear subspace of dimension up to nine [2]. In this experiment, we use a three convolutional auto-encoder and a three deconvolutional autodecoder, with a self-expressive layer for the block diagonal affinity learning. The convolutional kernel sizes are 5-3-3- The experiment results are presented in Table 1, with n ∈ {10, 15,20,25,30,35, 38}, our ConvSCN-BD reaches the lowest clustering error than other baseline methods. In particular, as shown in Figure 2, with the training going on, the coefficient matrix presents clearer the block diagonal structure.
To show the convergence behavior, we list the clustering error and each part of cost function during training period on Extended Yale B with n = 10. As shown in Figure 3, the clustering error and the cost L, L 0 , L 1 , and L 2 are decreased rapidly. In subfigure (d), there is a dramatic gap at the epoch when the clustering error is going to stable, which suggest a feasible way to stop the training of our ConvSCN-BD.
To evaluate the sensitivity of the performance with the used tradeoff parameters, we conduct experiments on Extended Yale B to evaluate the performance of ConvSCN-BD under varying the parameters, and show the clustering accuracy with respect to the tradeoff parameter λ and β, respectively,  in Figure 4. By setting λ in 0.2 − 0.5 and β in 65 − 67, our proposed ConvSCN-BD yields satisfactory performance.

B. EXPERIMENTS ON ORL
The ORL data set [37] is more challenging for subspace clustering than the Extended Yale B dataset, which consists of 40 subjects, 400 face images totally, where each subjects having 10 face images under varying lighting conditions, with different facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses).
In the experiment, each original face image is downsampled from 112 × 92 to 32 × 32. While the kernel size in convolutional layers is reduced to 3 × 3 and the channels are 3-3-5-5-3-3. For the tradeoff parameters, we set λ = 1 and β = 64 for our ConvSCN-BD. In the fine-tuning stage, we set T 0 = 10 amd T max = 940.
Experiment results are listed in Table 2. Due to the nonlinearity and small sample size per subject, the traditional linear subspace methods are suffering some limitation when meet such issue. Benefiting both the non-linear mapping and the powerful feature extraction of neural networks, methods based on convolutional layers obtain a better performance, and our ConvSCN-BD yield the best results again.

C. EXPERIMENTS ON MNIST
The MNIST dataset contains grey scale images of handwritten digits 0 ∼ 9, totally 70000 images of size 28 × 28. Since methods based on spectral clustering, such as SSC, LRR, KSSC, DSC-Net, can not apply on the whole dataset, we only use 10000 samples to evaluate the performance, which each subject has 1000 randomly sampled images. For the network, we use a six convolutional layers for feature extraction and a self-expressive layer for affinity learning. The convolutional kernel sizes are 5-3-3-3-3-5 and channels are 10-20-30-30-20-10. For the hyper parameters, we set λ = 1, β = 16 for the intermediate term C − J 2 F , T 0 = 10 and T max = 300.
We report the clustering result in Table 2. Methods based on the convolutional neural networks, such as DSC-Net and our ConvSCN-BD, outperforms the methods using the original samples. As can be read, our ConvSCN-BD boosts the performance through incorporating the block diagonal prior and feature extraction for subspace clustering, and achieves the best results.

V. CONCLUSION
We have presented a trainable framework, called Convolutional Subspace Clustering Network with Block Diagonal prior (ConvSCN-BD), in which the convolutional feature extraction and affinity learning are jointly optimized with the help of the block diagonal prior. Specifically, we solved the joint optimization problem by an alternating minimization algorithm that alternatingly update the parameters in the convolutional module, self-expression model, and the affinity matrix. By exploiting the block diagonal prior, we can learn better affinity matrix from the self-expression coefficients and also feedback useful information from spectral clustering to the self-expression model. We conducted experiments on three benchmark image datasets and verified the effectiveness of our proposal.
As the future work, we will explore more sofisticated strategy to exploit the feedback information from spectral clustering, more efficient way to solve the optimization problem, and more applications for the proposed approach.  He is currently a Professor and the Vice President of BUPT. His current research interests include pattern recognition theory and application, information retrieval, content-based information security, and network management. He has published over 100 technical articles in his fields.