Multi-View Spectral Clustering With Incomplete Graphs

Traditional multi-view learning usually assumes each instance appears in all views. However, in real-world applications, it is not an uncommon case that a number of instances suffer from some view samples missing. How to effectively cluster this kind of partial multi-view data has attracted much attention. In this paper, we propose an incomplete multi-view clustering method, namely Multi-view Spectral Clustering with Incomplete Graphs (MSCIG), which connects processes of spectral embedding and similarity matrix completion to achieve better clustering performance. Specifically, MSCIG recovers missing entries of each similarity matrix based on multiplications of a common representation matrix and corresponding view-specific representation matrix, and in turn learns these representation matrices based on the complete similarity matrices. Besides, MSCIG adopts the $p$ -th root integration strategy to incorporate losses of multiple views, which characterizes the contributions of different views. Moreover, we develop an iterative algorithm with proved convergence to solve the resultant problem of MSCIG, which updates the common representation matrix, view-specific representation matrices, similarity matrices, and view weights alternatively. We conduct extensive experiments on 9 benchmark datasets to compare the proposed algorithm with existing state-of-the-art incomplete multi-view clustering methods. Experimental results validate the effectiveness of the proposed algorithm.


I. INTRODUCTION
With the development of data collection techniques, in many practical applications such as image retrieval and cross-language document categorization, data appear in multiple modalities or naturally come from multiple sources, which are named as multi-view data. To effectively deal with multi-view data, multi-view learning has become into a hot area of research in last decades [1]- [6]. A common assumption adopted by conventional multi-view learning methods is that each data point appears in all views. However, in real applications, it is not a uncommon case that a number of instances suffer from some view representations missing, which results in partial multi-view data. For example, in web image retrieval, there may be some images with no text descriptions, and due to the invalid URL, the image itself The associate editor coordinating the review of this manuscript and approving it for publication was Corrado Mencar . may be inaccessible. Another example is cross-language document categorization, it is often the case that a document is translated into several but not all languages. Since it is an often case that every view of this kind of data suffers from some samples missing, traditional single-view or multi-view clustering methods may fail to obtain the clustering results of all data points directly. Therefore, how to effectively cluster partial multi-view data has become an important research direction in recent years.
The early researches focus on the clustering task of incomplete two-view data. The work in [7] requires that there is at least a complete view, and proposes a method which constructs a full kernel on an incomplete view with the help of the complete view. To deal with data with two incomplete views, the work in [8] factorizes view-specific instances and complete instances into a common learned latent subspace to obtain their homogeneous and comparable representations. Based on [8], the algorithms proposed in [9], [10] introduce the adaptive graph regularization terms to capture local structure in the common representation learning. The works proposed in [11]- [13] utilize the similar strategy with [8] to design incomplete multi-view clustering algorithms. One limitation of this strategy is that it requires that each instance is complete or only appears in one view. However, for data with more than two incomplete views, it is a common case that a number of instances are present in more than one view but not all views.
To cluster multi-view data with arbitrary incomplete views, several algorithms are designed based on weighted matrix factorization. The method proposed in [14] first fills in the missing samples with average values of present samples on each view independently, and then learns a common representation matrix based on weighted nonnegative matrix factorization which assigns smaller weights for these filled samples. The approach proposed in [15] introduces a zero-one diagonal matrix for each view to eliminates the influence of its missing samples, and learns a common representation matrix based on semi-nonnegative matrix factorization. Moreover, the methods proposed in [16], [17] introduce zero-one weight matrices for multiple views to distinguishes their present elements from missing elements, and thus they can deal with a more general incomplete problem. Although these matrix factorization-based methods adopt different norms, constraints, or regularization terms, they are are essentially linear methods, which limits their ability to disclose the non-linear structure of data. To address this issue, several kernel-based and graph-based methods are proposed. The work [18] recovers the missing entries of incomplete kernels by measuring both between-view and within-view relationships among kernel values. The work [19] extends multiple kernel k-means and considers mutual kernel completion to integrate kernel imputation and representation learning. The approach proposed in [20] first fills in the missing entries of graph matrices with the average of the columns, and then uses a co-training strategy to recover the representations of missing samples of each view and learn a common representation matrix. The method proposed in [21] jointly performs graph construction and common representation learning based on incomplete view representations. The work in [22] performs representation learning and clustering simultaneously with multiple incomplete similarity matrices. The approach proposed in [23] first fills in the missing entries of similarity matrices with the average of corresponding certain entries, and then learns views weights to combine a common graph matrix.
In this paper, we propose Multi-view Spectral Clustering with Incomplete Graphs (MSCIG) to handle incomplete multi-view clustering problem. Different from previous graph-based methods [20], [23] which adopt separate steps to fill in the missing entries of similarity matrices and learn a common representation matrix, the proposed MSCIG aims to achieve better clustering performance by integrating the processes of both imputation and representation learning seamlessly, which makes low dimensional representations guide the imputation of missing graph elements, and the completed graphs in turn influence the subsequent representation learning. Concretely, MSCIG employs the common representation matrix to take part in the imputation of missing entries of similarity matrices. To enable the imputation to explore the diverse information of views, a view-specific representation matrix is introduced to help the imputation of missing entries of each similarity matrix, and a regularization term is design to maximize the alignment between the common representation matrix and the view-specific representation matrix. To characterize contributions of multiple views, MSCIG adopts the p-th root integration to incorporate the losses of multiple views. To solve the resultant non-convex problem of MSCIG, we introduce another problem with explicit view weight factors. And we propose an iterative and alternative algorithm for optimization. In our proposed algorithm, the representation learning, imputation of missing graph entries, and view weight learning are alteratively performed until convergence. Fig. 1 displays the flowchart of MSCIG. The clustering performance of the proposed algorithm is evaluated by systematical experimental study on nine multi-view datasets. The experimental results indicate that our proposed algorithm achieves better clustering performance than the compared state-of-the-art incomplete multi-view clustering methods.
The rest of the paper is organized as follows. Section II introduces the background. Section III and IV introduce the formulation and the optimization algorithm of the proposed MSCIG, respectively. Section V gives some analysis about the proposed algorithm. Experimental results on nine multi-view datasets are displayed in Section VI. Finally, we give the conclusion of this paper in Section VII.

II. BACKGROUND
Throughout the paper, matrices are written as boldface uppercase letters, and vectors are written as boldface lowercase letters. For matrix M = [m ij ], its i-th row is denoted by m i . The transpose, the Frobenius norm, and the trace of matrix M are denoted by M T , ||M|| F , and tr(M), respectively. The 2-norm of vector m i is denoted by ||m i ||. VOLUME 8, 2020 A. PROBLEM SETTING A dataset with n instances can be represented by X = [x 1 ; . . . ; x n ] ∈ R n×d , where x i ∈ R 1×d is the i-th instance. Suppose they are collected from V views, and therefore, each instance has V view samples, i.e., For a incomplete multi-view dataset, every view may suffer from some view samples missing. To identify the present samples from missing samples of each views, we introduce a indicator matrix M ∈ {0, 1} n×V , and its (v, i)-th element m vi is defined as follows: Incomplete multi-view clustering aims to group {x i } n i=1 into C clusters by utilizing multiple views.

B. CLASSIC NORMALIZED CUT AND MOTIVATION
Given an one-view dataset X = [x 1 ; . . . ; x n ] ∈ R n×d , we can construct a graph similarity matrix W = [w ij ] ∈ R n×n to measure the relationships between each pair of data points, where w ij reflects the similarity between x i and x j . Based on W, the Laplacian matrix L ∈ R n×n is calculated by where F ∈ R n×C denotes the relaxed variable. Since L = D − W, the problem (2) is equivalent to The problem (3) is further equivalent to where A ∈ R n×n is the normalized similarity matrix and computed by A = D − 1 2 WD − 1 2 . In (4), the normalized similarity matrix A is complete. For an incomplete V -view dataset, we can construct {A (v) } V v=1 on each view independently, but each of A (v) may suffer from some information missing. Inspired by the objective of (4), in next section, we propose a method, which predict the missing values of {A (v) } V v=1 and learn a common representation matrix F simultaneously.

III. PROPOSED METHOD
To explore the complementary and diverse information of multiple views, we construct a graph similarity matrix on each view independently to reflect the similarity relationships between pair of view samples. The v-th view graph matrix is denoted by ij measures the similarity between x (v) i and x (v) j . Since the v-th view may suffer from some samples missing, a (v) ij can be computed as follows can be a traditional similarity computation method such as [24], [25]. NaN denotes ''not a number'', and can be regarded as a invalid number, which means the information of a (v) ij is unaccessible. To eliminate the influence of NaN in the subsequent calculating, we define are certain, and the rest elements are uncertain.
To integrate processes of both similarity matrix completion and clustering, based on these incomplete similarity matri- is the label vector of the i-th instance x i , and its c-th element f ic satisfy the following constraint where n (c) is the number of the instances belonging to the c-th cluster (c = 1, . . . , C). Based on F, the similarity information can be reflected by FF T . Therefore, the optimization problem can be written as 1} n×n distinguish certain elements of A (v) from its uncertain elements, and denotes the element-wise product between two matrices. The constraint (v) However, solving the optimization problem (7) under such constraint of F is NP-hard. Thus, we relax the constraint of F as F T F = I C , and the resultant optimization problem becomes The nonnegative constraint S (v) 0 is imposed to ensure that S (v) is a similarity matrix. Although (8) connects processes of both graph imputation and common representation learning, its performance can be improved due to the following reasons: 1) all uncertain elements of based on the corresponding elements of FF T , which limits the flexibility of {S (v) } V v=1 to exploit the view-specific information; 2) different views are treated equally, and they play the same important roles in the learning stage of F, which may suffer from performance degeneration when there are unreliable views.
To design a more reasonable model, we introduce view-specific representation matrices where λ > 0 is a balanced parameter, where R(F (v) , F) is a regularization term, which enforces F (v) and F to be approximated. In this paper, we utilize And for this regularization term, we give an explanation by the following proposition.
, which makes F (v) and F maximally align with each other. Proof: Since tr(I C ) = C is a constant, we complete the proof. After proposing the loss functions of multiple views, a rough and simple integration strategy to obtain the multi-view formulation is to sum up them directly, i.e., V v=1 J (S (v) , F (v) , F). However, this strategy ignores the various importance of views. A more reasonable integration are usually difficult to be predetermined, inspired by [26], we adopt the p-th root integration strategy with 0 < p 1 to generate the common objective function, i.e., V v=1 J (S (v) , F (v) , F) p , which utilizes an implicit way to automatically assign suitable view weights for each view according their losses. As a result, the proposed MSCIG can be formulated as follows where , F} collects all variables, and q = 1 p is a parameter. The model of MSCIG in (11) admits the following advantages: 1) our objective function well targets the ultimate goal, i.e., clustering, by integrating imputation and spectral embedding into a unified learning framework; 2) our formulation utilizes {F (v) } V v=1 to help F to complete uncertain elements of each S (v) , which enables {S (v) } V v=1 to capture view-specific information on their estimated elements; 3) our formulation weights different views automatically, which makes views with smaller losses play more important roles in the learning of common representation matrix F; 4) our method does not require a complete view or a instance present in all views, which is different from some previous methods.

IV. OPTIMIZATION ALGORITHM
When 0 < p < 1, the p-th root integration strategy makes the problem (11) difficult to solve directly. In this paper, we obtain the solution of the problem (11) by solving the following problem with a new introduced variable vector F} collects all variables of the problem (12), and γ = p−1 p is a parameter. In (12), the explicit v-th view weight factor ω (v) can be calculated as (11), where no view weight factors are explicitly defined.

A. UPDATING RULES
To solve the problem (12), an effective algorithm is designed to update four groups of variables iteratively and alternatively.
are separable, and thus each F (v) can be updated by solving the following problem Considering the constraint (F (v) ) T F (v) = I C and removing the constant terms, the problem (13) is equivalent to the following problem To solve the problem (14), we introduce the following proposition to obtain its close-form solution. Proposition 2: Suppose the compact SVD of matrix Then the optimal F (v) to the problem (14) is

VOLUME 8, 2020
Proof: Let the full SVD of ∈ R n×n , (v) ∈ R n×C , and V (v) ∈ R C×C . Therefore, the objective of (14) can be represented as ii and z (v) ii are the (i, i)-th elements of (v) and Z (v) , respectively. Since ii is a singular value, there is σ (v) ii 0. As a result, we arrive that The equality in (17) holds when all z , the optimal F (v) to the problem (14) can be written as Since (18) is based upon the full SVD of matrix B (v) , it can be rewritten as With α, F, and {F (v) } V v=1 fixed, the relations among multiple views are decoupled. By removing the constant terms, each S (v) can be updated individually by solving the following problem To solve the problem (19), we introduce a matrix E (v) ∈ R n×n calculated by E (v) = max(F (v) F T , 0). And then we can write the optimal solution to the problem (19) as the following form From (20), we can see that the optimal S (v) is a duplicated of A (v) on its known entries, and the uncertain elements are imputed by those of E (v) .

3) UPDATE F AND FIX OTHERS
With , and α fixed, by considering the constraint F T F = I C and removing the constant term, the problem (12) w.r.t. F can be transformed into with U C ∈ R n×C , C ∈ R C×C , and V C ∈ R C×C . According to Proposition 2, the optimal F to the problem (21) can be obtained by of multiple views by (9) accordingly. And the problem (12) w.r.t. α can be rewritten as the following form min When γ = 0, i.e., p = 1, the p-root integration strategy degenerates into the simple adding integration strategy, and we can set all α v = 1/V . When γ < 0, i.e., 0 < p < 1, the Lagrangian function of the problem (23) can be written , where µ is the Lagrange multiplier. Setting the derivative of L µ with respect to α v to zero and combining the constraint α T 1 V = 1, the closed-form solution of the problem (23) can be rewritten as follows According to the above four steps, we alternatively update } V , F as well as α, and repeat these procedures iteratively until the objective of the problem (12) converges. Since we adopt an alternative and iterative updating strategy, it is important to have a reasonable initialization. We initialize all α v = 1/V , which treats all views equally. For all S (v) , we initialize its element s (v) ij according to the following way where θ ij is initialized by the average of corresponding certain elements of {a , we obtain a fusion graph by adding them up directly, i.e., V v=1 S (v) , and then initialize F by solving the following problem In this paper, we solve the problem (26) by an iterative algorithm proposed in [27]. The entire optimization procedure of the optimization problems (11), and (12) are summarized in Algorithm 1.

V. ANALYSIS
In this section, we first analyze the convergence behavior of the proposed Algorithm 1, and then analyze its computational complexity.
A. CONVERGENCE BEHAVIOR Proposition 3: Algorithm 1 will monotonically decrease the objective of problem (12) until it converges to a stationary point of the problem (12).
, F, and α, respectively. SinceF (v) ,S (v) ,F, andα are the optimal solutions to their corresponding subproblems, Algorithm 1 decreases the objective of the problem (12) alternatively and iteratively. Since the objective of the problem (12) is lower bounded by 0, Algorithm will converge. Denote the converged F (v) , S (v) , F, and α asF (v) ,Ŝ (v) ,F, andα, respectively. In the convergence, {{F (v) ,Ŝ (v) } V v=1 ,F,α} satisfies the KKT conditions of the problem (12), and thus it is a stationary point of the problem (12).
,F} in Algorithm 1 will monotonically decrease the objective of problem in each iteration, which makes ,F} be at least a stationary point of the problem (11).
Proof: According to Proposition 3, we obtain inequality F). Considering (24) and γ = p−1 p , we can concluded that On the other hand, we define function g(x) = x p and the supergradient of g(x) is g (x) = px p−1 . When 0 < p < 1, g(J (F (v) , S (v) , F)) is a concave function in the domain of J (F (v) , S (v) , F), and g (J ( Based on the definition of supergradient of a concave function, the following inequality holds: Replacingx, and x with J (F (v) ,S (v) ,F), and J (F (v) , S (v) , F), respectively, we can infer that Based on (27) and (29), we arrive that ,F} will monotonically decrease the objective of the problem (11).
According to Proposition 3, ,F,α} is a stationary point of the problem (12). By replacing α v according to (24), where C collects the constraints of ϒ in (11). When 0 < p 1, the problem (31) and the problem (11) are equivalent.
,F} satisfies the KKT conditions of the problem (11).

B. COMPUTATIONAL COMPLEXITY
In this subsection, we analyze the computational complexity of the proposed Algorithm 1.
In initialization, the computational complexity of the construction of {A (v) } V v=1 and the initialization of where τ is the number of iterations of the algorithm proposed in [27] to solve (26).
In each iteration, the computational complexity to update is upper bounded by O(n 2 CV ); the computational complexity to update F is O(n 2 CV + C 3 ); the computational complexity to update α is upper bounded by O(n 2 V + nCV ).
Since C n, the overall computational complexity of Algorithm 1 is O(n 2 (d + Cτ + CVt)), where t is the number of iterations of the proposed Algorithm 1.

VI. EXPERIMENT
In this section, we conduct experiments to evaluate the performance of the proposed algorithm. First, we evaluate the effectiveness of MSCIG by comparing its clustering performance with some baselines. Then, we present experimental results about convergence behavior. Finally, we study the parameter sensitivity. VOLUME 8, 2020
• 3Sourse includes 416 news stories collected from three online news sources: BBC, Reuters, and The Guardian.
The stories belong to 6 classes, i.e., 104 business stories, 60 entertainment stories, 54 health stories, 49 politics stories and 89 sport stories and 60 tech stories. Since some news may not be reported by all sources, three views have 352, 302, and 294 present samples.
In [29], each raw document is split into 1-3 segments by merging consecutive paragraphs, and this process ensures that each segment has at least 200 words. Each segment is assigned to at most one view, and three views have 519, 531, and 513 present samples, respectively.
• BBC consists of 2225 news documents [28] belonging to 5 classes: 510 business documents, 386 entertainment documents, 417 politics documents, 511 sport documents, and 401 tech documents. In [29], each raw document is split into 1-3 segments, and each segment is assigned to one view. Its three views have 1828, 1832, and 1845 present samples, respectively.
• Forest includes multi-temporal remote sensing data for a forested area [31]. It is composed of 523 data points belonging to 4 classes, i.e., 'Sugi' forest, 'Hinoki' forest, 'Mixed deciduous' forest and 'Other' non-forest land. Each data point has 9 ASTER image bands features and 18 predicted spectral value features. i , and we ensure that each data point x i has at least one x (v) i remaining. In the experiments, we tune the probability form 10% to 50% with a step 10% on these six datasets. And we can regard the probability as the incomplete example ratio (IER) of all views of the dataset.

2) BASELINES AND EXPERIMENTAL ENVIRONMENT
In the experiments, we compare the proposed MSCIG with six state-of-the-art methods: Multiple Incomplete views Clustering (MIC) [14], Multi-view Learning with Incomplete Views (MVL-IV) [16], Incomplete Multi-modality Grouping (IMG) [9], Doubly Aligned Incomplete Multi-view Clustering (DAIMC) [15], Incomplete Multiple Kernel K-means Algorithm with Mutual Kernel Completion [19] (IMKK-MKC) and Perturbation-oriented Incomplete multiview Clustering (PIC) [23]. Since IMG is originally designed for incomplete two-view data, we extend it by [22], and thus the extended version can deal with data with arbitrary incomplete views. For fair comparison, all compared methods apply K-means on their corresponding common representations to extract clustering results. All the experiments are conducted by Matlab2016a on a laptop with Intel(R) Core(TM) i7-7500U CPU (2.9GHz) and 8.0GB RAM memory on the Windows 10 operating system.

3) PARAMETER DETERMINATION
In the experiments, we determine all hyper-parameters by grid-search, and record the clustering results with the best TABLE 1. ACC (%) and NMI (%) comparisons on 3 datasets. STD (%) is in the parentheses. The first highest score is in bold. Symbols '•/ /•' denote that MSCIG is better/tied/worse than the corresponding method by the paired t-test with confidence level 0.05, respectively. tuned parameters. For the baselines, we download the source codes from the authors' websites and determine the searching ranges of their parameters according to the corresponding papers. For the proposed MSCIG, we construct the partial similarity matrices based on [24], tune the parameter p from 0.1 to 0.9 with a step 0.2, and tune the parameter λ from in the range of {10 −4 , 10 −3 , 10 −2 , 10 −1 , 1}. In the construction of partial similarity matrix, we set the neighbor number to be 15 for 3Sourse, BBC, and BBCSport, and we fix the neighbor number as 5 for Dermatology, Forest, WebKB, Yale, MSRC-v1, and Dights. For all compared algorithms which adopt iterative optimization strategy, we adopt the following stop criteria where f (t) is the objective value in the t-th iteration.

4) EVALUATION METRIC
The clustering performance is measured in terms of accuracy (ACC) and the normalized mutual information (NMI). On datasets 3Sourse, BBC, and BBCSport, we run each compared method for 5 independent times. On each dataset of Dermatology, Forest, WebKB, Yale, MSRC-v1, and Dights, every time we create incomplete datasets with different IERs and repeat 5 independent times. And we report the average result with standard deviation (STD). Table 1 displays the clustering results of seven compared methods with respect to both ACC and NMI on three incomplete multi-view datasets. Table 2 and Table 3 show the ACC and NMI comparisons of all seven compared methods on six datasets in incomplete multi-view setting with respect to different incomplete example ratios. According to results, we have the following observations. As shown from the clustering results on Table 1 and the win/tie/loss counts in last rows of Table 2 and Table 3, the proposed MSCIG consistently achieves better or comparable performance than other methods in terms of both ACC and NMI. This may be because the integration of missing element imputation and representation learning enables the completed similarity matrices to measure the relationships between pair of view samples more accurately, and the common representation matrix to better reflect the underlying clustering structure of data.

C. CLUSTERING PERFORMANCE
MIC achieves the worst results on datasets 3Sourse, BBCSport, and BBC, and its performance degenerates more significantly than other methods with the increase of IER on datasets MSRC-v1, Digits, Yale, and Forest. This might be because that MIC simply utilizes the feature average of each view to fill the missing samples, which leads to a deviation especially when IER is large.
Comparing the performance of matrix factorization-based methods MVL-IV, IMG, and DAIMC, each of them achieves better results on several datasets but performs worse on other datasets. This can be owing to that these methods adopt different norms, regularization terms and constraints, which makes them good at clustering several datasets but bad at grouping others.
PIC achieves comparable results with MSCIG in a few cases but performs significantly worse on some datasets such as BBCSport, BBC, and WebKB, indicating that using the average of certain elements to fill the corresponding missing entries is not always an advisable manner to deal with incomplete similarity matrices. On datasets BBCSport, BBC, and Yale, IMKK-MKC achieves better performance than matrix factorization-based methods, and the possible reason is that IMKK-MKC can utilize the non-linear information. On the other hand, IMMK-MKC achieves the worst performance on Digits in some cases, and this might be because the construction of kernels ignores the local structure of data.
With the increase of IER, each method tends to achieve worse clustering performance in most cases. An exception is the clustering results on WebKB. Since both views of WebKB are sparse, the lack of certain samples may make methods fail to achieve reasonable clustering results. This also possibly explains why STD of MVL-IV, PIC, and MSCIG are high in some cases.

D. CONVERGENCE ANALYSIS
In Section V-A, we have proved the proposed Algorithm 1 will monotonically decrease the objective values of both (11) and (12) in each iteration until convergence.
In this subsection, we verify the convergence behavior of Algorithm 1 by conducting experiments on datasets 3Sourse, BBCSport3, Dermatology, and Forest. On Dermatology, and Forest, we set IER = 30%. For MSCIG, we fix p = 0.9 and γ = 10 −2 . We plot the curves of objective values of (11) and (12) in Fig. 2. As we can see, on these four datasets, both the objective values of (11) and (12) decrease as the iteration round increases and converge to fixed values within VOLUME 8, 2020 TABLE 2. ACC (%) comparisons on 6 datasets with different IERs. STD (%) is in the parentheses. The first highest score is in bold. Symbols '•/ /•' denote that MSCIG is better/tied/worse than the corresponding method by the paired t-test with confidence level 0.05, respectively. The win/tie/loss counts are reported in the last row.
As seen from the results, we have the following observations. On 3Sourse, MSCIG achieves acceptable performance in a wide range of parameters. On the other three datasets, the value variation of γ has larger effect on the performance of MSCIG than that of p. Besides, MSCIG has different optimal parameters on 4 datasets, indicating the optimal parameters are data independent. These 4 datasets have different optimal combination of p and γ since their data characteristics are different.

F. SUMMARY OF EXPERIMENTAL RESULTS
In this section, we verify the effectiveness of the proposed MSCIG by comparing with six state-of-art incomplete multi-view clustering methods. The experiments are conducted on nine datasets, and both ACC and NMI results demonstrate the advantage of MSCIG. Besides, we plot objective value curves of (11), and (12) on four datasets. The results show that Algorithm 1 monotonically decreases the objective values of both (11), and (12) until convergence TABLE 3. NMI (%) comparisons on 6 datasets with different IERs. STD (%) is in the parentheses. The first highest score is in bold. Symbols '•/ /•' denote that MSCIG is better/tied/worse than the corresponding method by the paired t-test with confidence level 0.05, respectively. The win/tie/loss counts are reported in the last row. and has fast convergence property. Moreover, we conduct experiments on four datasets to study the influence of the parameters on the performance of MSCIG. From the results, we can see that the optimal parameters are data independent, and how to determine them is still an open problem.

VII. CONCLUSION
In this paper, we propose MSCIG to cluster incomplete multi-view data, which integrates processes of both graph imputation and spectral embedding seamlessly to achieve better clustering performance. To solve the resultant problem of MSCIG, we design an optimization algorithm with proved convergence which updates common representation matrix, view-specific representation matrices, similarity matrices, and view weights alternatively and iteratively. The proposed MSCIG is evaluated on nine real-world datasets, and experimental results demonstrate its effectiveness. Inspired by [34], in the future, we plan to design a new regularization term on view-specific representations to meet the inconsistency and further improve the performance of MSCIG. Also, we want to improve the proposed method by taking the correlations of similarity matrices into account. Besides, extending the proposed MSCIG for incomplete multi-view semi-supervised classification is an interesting task. DONGYUN YI received the B.S. degree from Nankai University, Tianjin, China, and the M.S. and Ph.D. degrees from the National University of Defense Technology, Changsha, China. He was a Visiting Researcher with the University of Warwick, Coventry, U.K., in 2008. He is currently a Professor with the College of Science, National University of Defense Technology. His current research interests include statistics, systems science, and data mining. VOLUME 8, 2020