Fast Multi-view Clustering via Ensembles: Towards Scalability, Superiority, and Simplicity

Despite significant progress, there remain three limitations to the previous multi-view clustering algorithms. First, they often suffer from high computational complexity, restricting their feasibility for large-scale datasets. Second, they typically fuse multi-view information via one-stage fusion, neglecting the possibilities in multi-stage fusions. Third, dataset-specific hyperparameter-tuning is frequently required, further undermining their practicability. In light of this, we propose a fast multi-view clustering via ensembles (FastMICE) approach. Particularly, the concept of random view groups is presented to capture the versatile view-wise relationships, through which the hybrid early-late fusion strategy is designed to enable efficient multi-stage fusions. With multiple views extended to many view groups, three levels of diversity (w.r.t. features, anchors, and neighbors, respectively) are jointly leveraged for constructing the view-sharing bipartite graphs in the early-stage fusion. Then, a set of diversified base clusterings for different view groups are obtained via fast graph partitioning, which are further formulated into a unified bipartite graph for final clustering in the late-stage fusion. Notably, FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning. Experiments on 22 multi-view datasets demonstrate its advantages in scalability (for extremely large datasets), superiority (in clustering performance), and simplicity (to be applied) over the state-of-the-art. Code available: https://github.com/huangdonghere/FastMICE.


INTRODUCTION
C LUSTERING analysis has been a fundamental yet challenging research topic in knowledge discovery and data mining [1]. It aims to partition a set of data samples into a certain number of homogeneous groups, each of which is referred to as a cluster. Among the various subtopics in clustering analysis, multi-view clustering (MVC) has recently gained a considerable amount of attention due to its advantage in fusing common and complementary information from multiple views (or data sources) to enhance the clustering performance [2]. Despite the proposals of many MVC algorithms, there are still three crucial questions (with regard to scalability, information fusion, and hyperparameter tuning, respectively) that remain to be addressed.
First, how to enable MVC for very large-scale datasets? Though a large quantity of MVC algorithms have been developed in recent years, the high computational complexity is still a major hurdle for many of them to be applied in large-scale scenarios. In previous MVC works, there are several often-encountered complexity bottlenecks, such as affinity graph construction, graph partitioning, and some other expensive matrix computations. The affinity graph construction is a basic step in many MVC algorithms, which formulates the sample-wise relationship by computing an N × N affinity matrix and generally takes O(N 2 d) time and O(N 2 ) space, where N is the number of samples and d is the dimension. The graph partitioning (typically by spectral clustering) is another computationally expensive step in many MVC algorithms [3], [4], [5], [6], [7], [8], which often requires singular value decomposition (SVD) and takes O(N 3 ) time and O(N 2 ) space. Especially, the graph partitioning via spectral clustering is adopted as an important step in many MVC algorithms, such as multiview spectral clustering [3], [8], multi-view subspace clustering [6], [9], and multi-view graph learning [4], [5], which, together with some other expensive matrix computations, contributes to the O(N 3 ) complexity bottleneck in these MVC algorithms [3], [4], [5], [6], [7], [8].
Second, in which stage should multi-view information be fused? An essential task of MVC is to fuse the information from multiple views for robust clustering. The difference in various MVC algorithms is typically reflected by in which stage and how they conduct the multi-view fusion. A naive strategy is to directly concatenate the features from multiple views and then perform some single-view clustering algorithm on the concatenated features, which in practice is rarely adopted as it neglects the rich and complementary information across multiple views. In the MVC literature, the most widely-adopted strategy may be the early fusion [3], [8], [9], [10], which typically fuse the multi-view information in a unified clustering model via optimization or some heuristics (as illustrated in Figs. 1(a)). Besides the early fusion, another popular strategy is the late fusion [6], [11], [12], [13], which first obtains multiple base arXiv:2203.11572v4 [cs.LG] 24 Jan 2023 clusterings by performing the clustering process on each view separately and then fuses these base clusterings into a more robust clustering result in the final stage (as illustrated in Figs. 1(b)). While most of the existing MVC algorithms adopt either early fusion or late fusion, it is surprising that few of them have gone beyond the single-stage fusion to explore the rich possibilities and potential benefits hidden in the multi-stage fusion formulation.
Third, is the dataset-specific fine-tuning necessary? The regularization hyperparameters or some other types of hyperparameters are often involved in previous MVC algorithms to adjust the influences of different terms (or components) [2], where dataset-specific fine-tuning is frequently required to seek the proper values of these hyperparameters in a probably extensive trial-and-error manner. However, unlike supervised or semi-supervised learning [14], [15], in the unsupervised situations it may be arguable whether the ground-truth labels can be used for guiding the finetuning process. Without the fine-tuning guided by partial or even all ground-truth labels, the practicality of these MVC algorithms may be significantly weakened. Moreover, when the number of the tuning-intensive hyperparameters goes to three or even more, their tuning costs (typically via gridsearch) might become very expensive on large datasets (as reported in Table 7), which give rise to the critical question whether the dataset-specific tuning can be eliminated while maintaining robust clustering performance.
More recently several efforts have been made to deal with some of the above issues. In single-view clustering, it has proved to be an effective strategy to represent largescale data samples via a set of anchors (also known as landmarks or representatives) [16], [17], which can substantially facilitate the computation of the graph construction and partitioning for large-scale datasets. When it goes from single-view to multi-view, the anchor-based formulation still shows its promising ability [18], [19], [20], [21], but also faces a series of new challenges, ranging from multi-view anchor selection to multi-view information fusion. Typically, Li et al. [18] selected a set of anchors by performing kmeans on the concatenated multi-view features. With this unified anchor set, a bipartite graph is built between data samples and anchors on each view, and then multiple bipartite graphs are fused into a consensus graph for the final clustering [18]. Wang Fig. 2. Illustration of the hybrid early-late fusion strategy on an example of three views. With the benefits brought in by the random view groups, the key questions arise as to how to diversify them, how to fuse them, and especially how to ensure the clustering robustness while maintaining the scalability for extremely large datasets.
loss to learn a set of unified anchors and a bipartite graph for all views. However, considering the diverse characteristics of multiple views, a single set of unified anchors may not sufficiently capture the rich and complementary information of all views. Different from the pursuit of a single anchor set for all views [18], [21], Kang et al. [19] learned a set of anchors on each view, and then built a bipartite graph on each view separately. Then the multiple bipartite graphs are combined into a unified graph for final clustering. However, since each anchor set on a view is learned independently of other views, the cross-view information is inherently neglected in the construction of each single-view graph, which may lead to a degraded capacity of multiview expressiveness. Despite the progress, these methods [18], [19], [20], [21] either jointly construct a single anchor set for all views [18], [20], [21] or separately construct a single anchor set for each view [19], which, however, neglect the possibilities hidden between single and all, and dwell in the single-stage fusion strategy without the ability to capture the view-wise relationships in multiple stages. Furthermore, the requirement of dataset-specific hyperparameter-tuning in many of them [18], [19], [20], [22] also poses a practical hurdle for their real-world applications.
To jointly address the above-mentioned issues, in this paper, we propose a fast multi-view clustering via ensembles (FastMICE) approach. Different from previous MVC approaches that tend to work at either single views or all views in each stage, this paper first presents the concept of random view groups, which serves as the basic form of our flexible view-organizations. Specifically, with each view group consisting of a random number of view members, the multiple views can be extended to many random view groups for investigating the view-wise relationships in a diversified manner. Based on random view groups, a hybrid early-late fusion strategy is devised to enable efficient and robust fusions at multiple stages (as shown in Fig. 2). In the early stage, multiple fusions are simultaneously performed in multiple view groups, where three levels of diversity, namely, feature-level diversity, anchor-level diversity, and neighborhood-level diversity, are jointly leveraged to construct a set of view-sharing bipartite graphs. By efficient bipartite graph partitioning, a set of diversified base clusterings are obtained, which are further utilized to construct a unified bipartite graph for achieving the final clustering in the late-stage fusion. It is noteworthy that our FastMICE approach has almost linear time and space complexity, and is capable of producing high-quality clustering results without dataset-specific tuning. Extensive experiments are conducted on 22 real-world multi-view datasets, including 10 general-scale datasets and 12 large-scale datasets, which demonstrate the scalability (for extremely large datasets), the superiority (in clustering performance), and the simplicity (to be applied) of our approach. For clarity, the contributions of this work are summarized below.
• This paper for the first time, to our knowledge, presents the concept of random view group to capture the versatile view-wise relationships, which extends multiple views to many random view groups and may significantly benefit the clustering robustness while maintaining high efficiency.
• A hybrid early-late fusion strategy is devised, which breaks through the conventional single-stage fusion paradigm and enables the multi-stage fusions to jointly explore different levels of the multi-view information.
• A novel large-scale MVC approach termed FastMICE is proposed, whose advantages are three-fold: (i) it has almost linear time and space complexity and is feasible for very large-scale datasets; (ii) it is able to achieve superior clustering performance over the state-of-the-art approaches as confirmed by extensive experimental results; (iii) it is simple to be applied, where no dataset-specific hyperparameter-tuning is required across various multi-view datasets.
The remainder of the paper is organized as follows. Section 2 reviews the related works on multi-view clustering and ensemble clustering. Section 3 describes the overall framework of our FastMICE approach. Section 4 reports the experimental results. Finally, Section 5 concludes this paper.

RELATED WORK
In this paper, we propose a new large-scale MVC approach termed FastMICE, which involves both MVC and ensemble clustering (EC). In this section, the related works on MVC and EC will be reviewed in Sections 2.1 and 2.2, respectively.

Multi-view Clustering
In recent years, many MVC methods have been developed from different technical perspectives [2]. In spite of the difference in their specific models, they typically share a common and essential task, that is, how to fuse the information from multiple views. A straightforward strategy is to concatenate the features from all views and then perform single-view clustering on the concatenated features, which, however, ignores the multi-view complementariness and is rarely adopted. Besides feature concatenation, according to their fusion stage, most of the existing MVC methods can be classified into two categories, i.e., the early fusion methods [3], [9], [10] and the late fusion methods [6], [11], [12].
Early fusion is probably the most widely-adopted fusion strategy in MVC [3], [8], [9], [10], which typically formulates the information of all views in a unified optimization or heuristic model (as shown in Fig. 1(a)). In early fusion, the information of each view can be given via different representations, such as the original features [9], transition probability matrix [3], K-nearest-neighbor (K-NN) graph [10], and so forth. Xia et al. [3] constructed a shared lowrank transition probability matrix by exploiting multiple transition probability matrices from multiple views, and then performed spectral clustering on the shared matrix for final clustering. Zhang et al. [9] conducted multi-view subspace clustering by minimizing a self-expressive loss on each view with the tensorized low-rank constraint. Xie et al. [23] extended the tensorized multi-view subspace clustering by further incorporating a local structure constraint. Liang et al. [5], [10] performed graph fusion on multiple K-NN graphs from multiple views with cross-view consistency and inconsistency jointly modeled.
Late fusion is another popular fusion strategy in recent years [6], [11], [12], [13], which first builds a base clustering on each view (often separately) and then fuses them at the partition-level to obtain a consensus clustering (as shown in Fig. 1(b)). Wang et al. [12] built the base clusterings by performing kernel k-means clustering on each view, and learned a consensus clustering by maximizing alignment between the consensus clustering and the base clusterings. Kang et al. [6] performed spectral clustering on the subspace representation of each view to obtain a corresponding base clustering, and learned a consensus clustering by minimizing the distance between a unified cluster indicator matrix and the multiple base clusterings [6]. These MVC methods [3], [6], [8], [9], [10], [11], [12], [13] seek to fuse the multi-view information in different stages and through different techniques. Yet surprisingly, most of them perform the fusion in a single-stage manner (either in the early stage or in the late stage), which lack the ability to go beyond the single-stage fusion to investigate more possibilities in multi-stage fusions. Besides the limitation in their fusion strategy, another limitation is that many of them still suffer from quadratic or cubic computational complexity, which makes them almost infeasible for largescale datasets. Recently some large-scale MVC methods have been proposed, among which the anchor-based methods have been one of the representative categories [18], [19], [20], [21]. However, in terms of anchor selection, these anchor-based methods either learn a unified anchor set for all views [18], [20], [21] or learn a separate anchor set for each view [19]. In terms of fusion stage, they still dwell in the single-stage fusion strategy. Moreover, for most of the previous MVC methods, including the general-scale methods [3], [6], [8], [9], [10], [11], [12] and the large-scale methods [18], [19], [20], [22], their requirement of datasetspecific hyperparameter-tuning also poses a major hurdle for their real-world applications.

Ensemble Clustering
The purpose of EC is to combine multiple base clusterings into a better and more robust consensus clustering [24], [25], [26], [27], [28], [29], [30], [31]. In the final stage of our FastMICE approach, multiple base clusterings are fused into a unified clustering result, which can be viewed as an EC process. Therefore, in this section, we will also review the related works on EC.
Previous EC methods can mostly be classified into three categories, i.e., the pair-wise co-occurrence based methods [24], [25], the median partition based methods [26], [27], and the graph partitioning based methods [17], [29], [30], [32]. The pair-wise co-occurrence based methods typically construct a co-association matrix by considering the pairwise co-occurrence relationship in base clusterings, and then perform some clustering algorithm on the co-association matrix to obtain the consensus clustering. Fred and Jain [24] proposed the evidence accumulation clustering method which imposes hierarchical agglomerative clustering on the co-association matrix. Huang et al. [25] refined the coassociation matrix by an entropy based local weighting strategy and presented the locally weighted evidence accumulation method. The median partition based methods treat the EC problem as an optimization problem, which aims to find a median clustering by maximizing the similarity between this clustering and the base clusterings. Topchy et al. [26] cast the EC problem as a maximum-likelihood problem and solved it via the EM algorithm. Huang et al. [27] formulated the EC problem as a binary linear programming problem and solved it via the factor graph model. The graph partitioning based methods represent the multiple base clusterings as a graph structure and obtain the consensus clustering by partitioning this graph. Strehl and Ghosh [32] considered the concept of hyper-edge and presented three graph partitioning algorithms. Ren et al. [29] took into account the importance of the objects and devised three graph partitioning based consensus functions for weighted-object ensemble clustering.
Despite the progress of these EC works [24], [25], [26], [27], [28], [29], [30], [31], most of them are devised for single-view datasets and lack the consideration of multiview scenarios. Recently Tao et al. [33] proposed a multiview ensemble clustering (MVEC) method, which learns a consensus clustering from the multiple co-association matrices built in multiple views with low-rank and sparse constraints. Tao et al. [7] further incorporated marginalized denoising autoencoder into MVEC, and presented a marginalized multi-view ensemble clustering (M 2 VEC) method. However, in MVEC and M 2 VEC, the base clusterings in different views are generated separately, without leveraging multi-view complementariness in their ensemble generation. Moreover, the high computational complexity also restricts their applications in large-scale scenarios.

PROPOSED FRAMEWORK
In this section, we describe the proposed FastMICE approach in detail. Specifically, the notations are introduced in Section 3.1. The formation of random view groups is provided in Section 3.2. The view-sharing bipartite graph is described in Section 3.3. The generation of the diversified The v-th view in the dataset Xv ∈ R N ×dv Data matrix associated with V iewv VG The set of view groups M # of view groups VG (m) The m-th view group # of anchors for each view group p (m) # of anchors for a view member K # of nearest neighbors for each view group K (m) # of nearest neighbors for a view member A The view-sharing bipartite graph for VG (m) The m-th base clustering in Π C The set of all clusters in Π C i The i-th cluster in C kc # of clusters in C G A bipartite graph between X and C B Cross-affinity matrix of graph G. Gs A small graph with C as the node set Es Affinity matrix of graph Gs base clusterings is introduced in Section 3.4. Then the base clusterings are fused into a unified clustering via a highly efficient consensus function in Section 3.5. The time and space complexity of FastMICE is analyzed in Section 3.6.

Notations
For a multi-view dataset, each data sample can be represented by features from different views. Thus, the multi-view dataset can be denoted as X = {X 1 , X 2 , · · · , X V }, where X v ∈ R N ×dv is the data matrix of the v-th view, V is the number of views, and d v is the dimension of the v-th view. For convenience of the later view group formation, we denote the set of V views as V = Note that X v ∈ R N ×dv is the data matrix associated with the v-th view (i.e., V iew v ). For clarity, the notations used throughout the paper are given in Table 1.
The purpose of MVC is to fuse the information of multiple views for enhanced clustering. In large-scale multiview scenarios, where the data size N can be very large, it becomes a critical challenge how to robustly fuse the multiview information while ensuring high efficiency and practicality, which will be the focus of our following sections.

Early-Stage View Group Formation
An essential task of MVC is to fuse the information of multiple views for robust clustering result. The previous MVC algorithms differ from each other mainly in their fusion stages and fusion techniques [6], [8], [10], [12], [13], but they generally have two characteristics in common. First, they tend to perform the multi-view fusion in a single stage, either early or late. Second, they often implicitly comply with a single-or-all paradigm, where each stage of them involves either a single view or all views. For example, in the late fusion algorithms [6], [12], [13], each base clustering is constructed on a single view independently of other views, while the final fusion process utilizes the base clusterings from all views in a one-shot manner. However, the vast middle ground between single-views-independently and allviews-together has rarely be explored by previous works.
Different from the conventional single-or-all paradigm, in the section, we present the concept of random view groups, each of which encapsules a random number of views and serves as a basic unit for the view-wise diversification and the hybrid early-late fusion in our FastMICE framework. Formally, let VG (m) be the m-th view group, with a randomly selected subset of views, and V (m) be the number of selected views in VG (m) . The number of views in each view group can be randomly chosen in the range of [V min , V max ], where V min and V max are respectively the lower bound and the upper bound of the number of selected views such that Each selected view in the view group is called a view member. To enhance the diversity of view groups, we set the lower bound V min = 1 and the upper bound V max = V , through which each view group can have at least one view member and at most V view members. Thus, the mth view group VG (m) can be formed by randomly selected V (m) views from the set of all views, denoted as is the dimension of this view member. By performing the randomization process repeatedly, a set of random view groups can be obtained, that is where M is the number of the generated random view groups. As each view group leads to the generation of a base clustering, the number of view groups is also the number of base clusterings, also known as the ensemble size.
It is worth mentioning that the early-fusion MVC methods (as shown in Fig. 1(a)) can be viewed as a special instance of our view group based formulation with all views selected into a single view group by setting V min = V max = V and M = 1. Similarly, the late fusion MVC methods (as shown in Fig. 1(b)) can also be viewed as special instance of our view group based formulation with each view being a single view group by setting V min = V max = 1 and M = V .

View-Sharing Bipartite Graph Construction
With multiple random view groups obtained, our next goal is to perform the early-stage fusion in the view groups. Note that the early-stage fusion is not meant to achieve a single optimal solution in a one-shot manner. Instead, it performs fusions on multiple view groups, and builds multiple viewsharing bipartite graphs with multi-level diversities. Besides the diversity, to enable the scalability for very large datasets, the efficiency is another of our key concerns during the early-stage multi-fusion process.
Particularly, in each view group, the bipartite graph structure is exploited to formulate the information of multiple view member. In recent years, the bipartite graph structure has shown its advantage in handling large-scale datasets [16], [17], [20], [21]. From the perspective of topology, the bipartite graph is built between the N data samples and a set of p anchors (or representatives), typically with p N for large-scale datasets. From the perspective of matrix notation, the bipartite graph can be represented by an N × p cross-affinity matrix with its (i, j)-th entry being the affinity between the i-th sample and the j-th anchor, which can also be regarded as encoding the data samples via this small set of anchors.
To enhance the diversity of the bipartite graph construction in multiple view groups, we simultaneously leverage three levels of diversification, corresponding to the featurelevel, the anchor-level, and the neighborhood-level, respectively.
In the feature-level diversification, we first perform random feature sampling on each view member in a view group. Let τ (m) v denote the feature sampling ratio for the vth view member in the m-th view group (i.e., V iew (m) v ). To inject the diversity, the sampling ratio for each view member is randomly chosen in the range of [τ min , τ max ], where τ min and τ max are respectively the lower and the upper bounds of the sampling ratio such that 0 < τ min ≤ τ max ≤ 1. By performing feature sub-sampling with a randomized sampling ratio, we can obtain a subset of features for each view member. For V iew (m) v , its data matrix after random feature sampling can be denoted asX is the reduced dimension, and · obtains the ceiling of a real value.
After random feature sampling, we proceed to construct a bipartite sub-graph on each view member, and then combine these bipartite sub-graphs into a unified view-sharing bipartite graph, where a set of p anchors are required and the K-NN sparsification is performed. In prior works, the anchor set for multiple views can be obtained by performing k-means clustering on the concatenated features of all views [18] or by optimizing some objective function to learn consensus anchors [21]. However, on the one hand, they typically aim to find a set of anchors suitable for all views, without sufficient consideration to view-specific characteristics. On the other hand, their anchor selection or learning process may also be computationally expensive when facing very large datasets.
Instead of pursuing a set of consensus anchors, we distribute the task of finding p anchors to the multiple view members. Without prior knowledge, we expect the multiple view members in the same view group to contribute equally. Specifically, each view member is expected to make a contribution ofp (m) = p/V (m) anchors. Similarly, in terms of the neighborhood, each view member is expected to contributeK (m) = K/V (m) nearest neighbors. Thereafter, on each view member in VG (m) , we aim to build a bipartite sub-graph between N samples andp (m) anchors, with each sample connected toK (m) nearest anchors.
For the m-th view group VG (m) , the anchor selection on the v-th view member V iew ) is performed via the hybrid representative selection strategy [17], which efficiently obtains a set ofp (m) anchors, denoted as where a Then we define the bipartite sub-graph for the view member V iew (m) v as follows: where L are the left and right node sets of the bipartite sub-graph, respectively, and B is the cross-affinity matrix. An edge between two nodes exists if and only if one node is a data sample, another node is an anchor, and this anchor is one of the sample'sK (m) -nearest anchors. Formally, the (i, j)-th entry of the cross-affinity matrix B wherex (m) v,i denotes the i-th sample in this view member (with reduced dimension), Sim(·) computes the similarity between two vectors, and N K (x) denotes the set of Knearest anchors of sample x. Note that Sim(·) can be any similarity measure. Typically, we utilize the Gaussian kernel similarity, which maps the Euclidean distance to a similarity measure via a Gaussian kernel. Thus, with each sample linked toK (m) nearest anchors, the cross-affinity matrix B (m) v can be represented as a sparse matrix with only N ·K (m) non-zeros entries, whose sparsity can significantly benefit the later matrix computations.
For the v-th view member in the m-th view group, a bipartite sub-graph G being a lowdimensional feature vector for a data sample. As there are V (m) view members in VG (m) , each data sample can be represented (or encoded) by the totallyp (m) · V (m) anchors, which lead to the view-sharing bipartite graph for the entire view group, that is where L (m) = X is the left node set of the bipartite graph, and V (m) is the right node set, which is the union of the anchor sets of the V (m) view members from VG (m) . The cross-affinity matrix for G (m) is defined as to unit norm, so as to adjust multiple bipartite sub-graphs into similar scales.
The time complexity of building the view-sharing bipartite graph G (m) mainly comes from the anchor selection and the cross-affinity matrix construction. For the vth view member in VG (m) , the anchor selection via hybrid [17]. Thus the construction of a bipartite sub-graph (for a view member)

Ensemble Generation in View Groups
In this section, we describe the ensemble generation, i.e., the generation of a set of diversified base clusterings, based on the view-sharing bipartite graphs in multiple view groups.
Specifically, with a view-sharing bipartite graph constructed for each view group, a set of M view-sharing bipartite graphs, ranging from G (1) to G (M ) , can be built for the M view groups. For the base clustering in each view group, the number of clusters, say, k (m) , will be randomly selected in the range of [k min , k max ], where k min and k max are respectively the lower bound and the upper bound of the cluster number. In the following, we proceed to describe the fast partitioning of the m-th view-sharing bipartite graph into k (m) clusters.
For the m-th view group, there are totally N +p (m) ·V (m) nodes in the view-sharing bipartite graph G (m) , with N data samples in the left node set andp (m) · V (m) anchors in the right node set. By treating G (m) as a general graph, its full affinity matrix E (m) ∈ R (N +p (m) V (m) )×(N +p (m) V (m) ) can be represented as To partition this graph via conventional spectral clustering [34], the following generalized eigen-decomposition problem should be solved: where L i=1 with the following properties [35]: where h . Then, the k (m) eigenvectors will be stacked as a matrix with each row treating as a new feature vector, upon which the k-means discretization can be performed to obtain the m-th base clustering with O(N (k (m) ) 2 t) time. Since k (m) and k are generally at similar scales, the time complexity of generating the base clustering from a view-sharing bipartite graph is O(N (k 2 t + K 2 + Kk) + p 3 ), and the space complexity is O (N (k + K)).

Late-Stage Consensus Function
By partitioning the view-sharing bipartite graph of the m-th view group, we obtain a base clustering with k (m) clusters. Then the set of base clusterings generated for the M view groups can be represented as where π (m) = {C , · · · , C (m) k (m) } is the m-th base clustering, and C (m) i is the i-th cluster in π (m) . Each base clustering consists of a certain number of clusters. For convenience, we represent the set of clusters in all base clusterings as where C i is the i-th cluster, and k c = M m=1 k (m) is the total number of clusters in all base clusterings.
To jointly formulate the information of multiple base clusterings, we construct a unified bipartite graph by treating both the data samples and the base clusters as graph nodes, denoted as G = {L, R, B}, where L = X is the left node set with N data samples, R = C is the right node set with k c base clusters, and B ∈ R N ×kc is the cross-affinity matrix. A link between two nodes exists if and only if one of them is a data sample and the other one is a base cluster that contains the sample. Thus, the (i, j)-th entry of the crossaffinity matrix B can be defined as Note that each data sample belongs to one and only one cluster in a base clustering. With a total of M base clusterings, each data sample will be linked to exactly M clusters in the unified bipartite graph. That is, in each row of B, there are exactly M non-zero entries. Therefore, B is a matrix with N · M non-zero entries. Similar to the partitioning of the view-sharing bipartite graph G (m) , the eigen-decomposition problem of the unified bipartite graph G can also be tackled by conducting the eigen-decomposition on a smaller graph G s with an affinity matrix E s = B D −1 B, whose computation takes O(N M 2 ) time, whereD is a diagonal matrix with its (i, i)-th entry being the sum of the i-th row in B.
To obtain the final clustering result with k clusters, solving the eigen-decomposition of the graph G s takes O(k c

Complexity Analysis
In the following, we will analyze the time and space complexity of the proposed FastMICE approach.   In large-scale scenarios, the number of anchors may range from hundreds to thousands, which is much smaller than the data size N , but much larger than the number of clusters k or the number of nearest neighbors K. With k, K, M, V p N , the overall time complexity of the FastMICE approach can be written as O(N M p 1/2 V 1/2 ), which is linear to the data size N .

Space Complexity
This section analyzes the space complexity of FastMICE. The construction of a view-sharing bipartite graph takes O(N (K + V )) space, while its partitioning takes O(N (k + K)) space. Since the base clusterings can be generated in a serial processing manner, the space complexity of generating multiple base clusterings is still O(N (k + K + V )). The consensus function takes O(N (k + M )) space. Thus, the overall space complexity of FastMICE is O(N (k + K + V + M )), which is also linear to the data size N .

EXPERIMENTS
In this section, we evaluate the proposed FastMICE approach against the state-of-the-art MVC approaches on a variety of general-scale and large-scale multi-view datasets. The experiments are conducted on a PC with an Intel i5-6600 CPU and 16GB of RAM.

Benchmark Datasets
In the experiments, 22 real-world multi-view datasets are used, including 10 general-scale datasets and 12 large-scale datasets (as shown in Table 2 Table 2.

Baseline Methods and Experimental Settings
Our proposed FastMICE method is experimentally compared against ten MVC methods, which are listed below.
Among these baseline methods, BMVC, LMVSC, SMVSC, and FPMVS-CAG are four large-scale MVC methods, MVEC, M 2 VEC km , and M 2 VEC spec are three EC based 1   MVC methods, and AMGL, SwMC, and FPMVS-CAG are three tuning-free MVC methods. For the baseline methods as well as the proposed method, if the distance metric can be customized, then the cosine distance will be adopted for the document datasets, such as Movies, BBCSport, and Citeseer. Otherwise, their suggested distance (mostly Euclidean distance) will be adopted.
For each of the baseline methods, if the dataset-specific tuning is needed, then each of its hyperparameters will be tuned in the range of {10 −5 , 10 −4 , · · · , 10 5 }, unless the tuning range is specifically suggested by the corresponding paper. To avoid the expensive or even unaffordable computational costs of hyperparameter-tuning on the entire largescale datasets, for all the baseline methods except M 2 VEC km and M 2 VEC spec , if N > 10,000, the tuning will be conducted on a random subset of 10,000 samples. For M 2 VEC km and M 2 VEC spec , whose computational costs rapidly increase with larger data sizes, if N > 1,000, the tuning will be conducted on a random subset of 1,000 samples.
Note that the proposed FastMICE method does not require dataset-specific tuning. Though there exist several parameters in FastMICE, yet these parameters can safely be set to some common values (or randomized in some common ranges) across various datasets. Specifically, the feature sampling ratio τ   randomizations, the number of base clusterings M = 20, the number of anchors p = min{1,000, N }, and the number of nearest neighbors K = 5 are used in the experiments on all benchmark datasets.

Performance Comparison and Analysis
In this section, we experimentally compare our FastMICE method against ten baseline MVC methods. The comparison results w.r.t. NMI, ARI, ACC, and PUR are reported in Tables 3, 4, 5, and 6, respectively.
As shown in Table 3, FastMICE achieves the best performance w.r.t. NMI on 19 out of the 22 datasets. Though the MVEC method yields a higher NMI score than FastMICE on the BBCSport dataset, yet FastMICE outperforms or significantly outperforms MVEC on all the other datasets. In comparison with the four large-scale baseline methods, namely, BMVC, LMVSC, SMVSC, and FPMVS-CAG, our FastMICE method outperforms these methods on all benchmark datasets except Movies and NUS-WIDE.
Further, we report the average ranks (across the twenty datasets) of the proposed method and the ten baseline methods in Table 3. Note that, for a dataset, if four MVC methods are computationally feasible and seven MVC methods are computationally infeasible due to the out-of-memory error, then the seven infeasible methods will equally rank in the fifth position on this dataset. As can be seen in Table 3, the proposed FastMICE method achieves an average rank of 1.23, which significantly outperforms the second best method with an average rank of 4.41. Similar advantages of the proposed method over the baselines can also be observed in Tables 4, 5, and 6. In terms of ARI, our FastMICE method achieves the best scores on 18 out of the 22 datasets, with an average rank of 1.27. In terms of ACC, our FastMICE method achieves the best

Influence of Ensemble Size M
In this section, we test the influence of the ensemble size M , which corresponds to the number of base clusterings and also the number of random view groups in FastMICE. The three EC based baseline methods, namely, MVEC, M 2 VEC km , and M 2 VEC spec , similarly involve the ensemble size M , which will also be tested in the comparison. The performances of FastMICE and the three baselines are illustrated in Fig. 3. Note that the three EC based baseline .05 * Note that N/A indicates the out-of-memory error, and the symbol "-" indicates that no dataset-specific tuning is needed. ** If the dataset-specific tuning is needed, the hyperparameters will be tuned in {10 −5 , 10 −4 , · · · , 10 5 }, unless the tuning range is specifically given by the corresponding paper. *** For the baseline methods except M 2 VEC km and M 2 VECspec, if N > 10,000, the tuning (if needed) will be conducted on a random subset of 10,000 samples. For M 2 VEC km and M 2 VECspec, if N > 1,000, the tuning will be conducted on a random subset of 1,000 samples. **** The difference between M 2 VEC km and M 2 VECspec lies in the last step, where either k-means or spectral clustering is used. Thus we tune them together. methods are not computationally feasible for datasets larger than NH-4660, so their curves will be absent in the corresponding sub-figures. As shown in Fig. 3, the proposed Fast-MICE method yields consistently high-quality clustering performance with varying ensemble sizes. When compared with the other EC based methods, FastMICE achieves overall better performance than the baselines on the benchmark datasets. Empirically, a relatively larger ensemble size is beneficial. In our experiments, we use the ensemble size M = 20 on all benchmark datasets.

Influence of Number of Anchors p
In this section, we test the influence of the number of anchors p in the FastMICE method. As can be seen in Fig. 4, the proposed FastMICE method shows consistent performance as the number of anchors goes from 100 to 1400. Empirically, a larger number of anchors can be beneficial on most of the datasets, especially on the large-scale ones such as YTF-200 and YTF-400, probably due to the fact that a larger number of anchors may better reflect the overall structure of the data. As the number of anchors cannot exceed the number of original samples, in our experiments, we use the number of anchors p = min{1000, N } on all benchmark datasets.

Influence of Number of Nearest Neighbors K
In this section, we test the influence of the number of nearest neighbors K in the FastMICE method, which corresponds to the number of (nearest) anchors that are linked to each data sample. Specifically, we illustrate the performance of the proposed FastMICE method as the number of nearest neighbors goes from 1 to 10 in Fig. 5. As shown in Fig. 5, a moderate value of K can often be beneficial on most of the benchmark datasets. Empirically, it is suggested that the number of nearest neighbors be set in the range of [4,8]. In our experiments, we use the number of nearest neighbors K = 5 on all benchmark datasets.

Influence of Random View Groups
In this section, we test the influence of the random view groups in our framework. As discussed in Section 3.2, the previous early fusion methods can be regarded as a special instance of our view group formation with all views in a single group, while the late fusion methods can also be regarded as a special instance of our view group formation with a single view in each group. In this section, we compare our FastMICE method using random view groups against using a single view in each group and using all views in a group. As shown in Fig. 6, the use of random view groups leads to substantial improvements on most of the datasets. Though the use of all views in a group leads to comparable performance to the use of random view groups on the Yale dataset, yet on most of the other datasets the FastMICE method using random view groups outperforms or significantly outperforms the variants using a single view in each group or using all views in a group, which verifies the benefits brought in by the random view groups.

Execution Time
In this section, we evaluate the time costs of the FastMICE method and the other MVC methods on the benchmark datasets. Note that for the proposed FastMICE method, no dataset-specific hyperparameter-tuning is needed. For the baseline methods, if the dataset-specific hyperparametertuning is required, then the time costs of tuning and running will be respectively reported. As shown in Table 7, more than half of the baseline methods cannot go beyond the ALOI dataset due to the computational complexity bottleneck. Though the largescale baseline methods, including BMVC, LMVSC, SMVSC, and FPMVS-CAG, have shown their scalability for some larger datasets, yet they still encounter heavy computational burdens or even the out-of-memory error when they process the YTF-200 or YTF-400 datasets. Remarkably, on the YTF-200 dataset with 286,006 samples, our FastMICE method only consumes 263.81 seconds of running time, while the LMVSC, SMVSC, and FPMVS-CAG methods respectively consume 3,184.79 seconds, 74,063.30 seconds, and 32,974.03 seconds of running times. In terms of the YTF-400 dataset with 398,191 samples, FastMICE is the only method that is computationally feasible on this large-scale dataset, which demonstrates the clear advantage of our FastMICE method in scalability for very large-scale datasets.
To summarize, as shown in Tables 3, 4, 5, 6, and 7, the proposed FastMICE method is capable of yielding highly competitive clustering performance over the state-of-the-art, while showing advantageous scalability on very large-scale multi-view datasets.

CONCLUSION AND FUTURE WORK
In this paper, we propose a new large-scale MVC approach termed FastMICE, which is featured by its scalability (for extremely large-scale datasets), superiority (in clustering performance), and simplicity (to be applied without datasetspecific tuning). In particular, different from the previous approaches that mostly adopt some types of single-stage fusion strategies, this paper presents a hybrid early-late fusion strategy based on the random view groups. A large number of random view groups are first formed to serve as a flexible view-organizations to investigate the viewwise relationships. Then, three levels of diversity, i.e., the feature-level diversity, the anchor-level diversity, and the neighborhood-level diversity, are jointly leveraged to explore the rich and versatile information in the random view groups and thereby enable the highly efficient construction of the view-sharing bipartite graphs. By fast partitioning of these view-sharing bipartite graphs from different view groups, a set of diversified base clusterings can be generated, which are further formulated into a unified bipartite graph for achieving the final clustering result.
It is noteworthy that our FastMICE approach has almost linear time and space complexity, and is able to perform robustly and accurately on various general-scale and large-scale datasets without requiring dataset-specific hyperparameter-tuning. Extensive experimental results on 22 multi-view datasets have demonstrated the superiority of our FastMICE approach over the state-of-the-art. In the future work, the concept of random view groups and the diversification-and-fusion strategy may also be investigated for more MVC tasks, such as the incomplete MVC task [43] and the deep MVC task [44], [45], so as to promote their robustness while ensuring scalability for very large datasets.