Adaptive Regularized Semi-Supervised Clustering Ensemble

,


I. INTRODUCTION
Clustering, as one of unsupervised learning methods, aims to split data into several disjoint groups, so that data in the same group are more similar than those from different groups.Despite the success of clustering methods in exploring the underlying structure of data, they suffer from the sensitivity of parameter setting, i. e., the clustering performance is not robust and stable.Additionally, with the available of large-scale data collection devices, we are facing a huge amount of data, most of which are high-dimensional.The high dimensionality of these data has posed a challenge to us, since it requires prohibitively expensive hardware to process them.In this setting, random subspace method The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tong.provides a possible alternative by randomly selecting features to explore the underlying structure of data.Thus, we bypass the hardware requirement.Besides, inspired by ensemble supervised learning methods, recent years have witnessed the development of clustering ensemble, which is divided into two steps: the generations of clustering solutions and the fusion of clustering solutions.In this first step, it is expected to provide diverse information about the structures of data by adopting various methods, like random feature subspaces, random sample subspaces, random initiation or random feature transformations.Many researchers have focused on generating diverse base clustering solutions.For example, Ye et al. [1] handle the data with a limited number of labeled data by designing an improved semi-supervised K-means clustering method.Yu et al. [2] combine constraint weights and ensemble member weights to distinguish the contribution differences between pairwise constraints and ensemble members.Rathore et al. [3] fuse random projection and fuzzy c-means into a semi-supervised clustering ensemble framework designed for handling high-dimensional data.Li et al. [4] considers labeled data and pairwise constraints in a hybrid constrained semi-supervised clustering scheme in generating base clustering solutions.In the second step, clustering ensemble methods adopt a consensus method for integrating multiple base solutions in order to obtain a better clustering partition.For example, Strehl and Ghosh [5] formulate the clustering ensemble problem as an optimization problem, which is solved based on a hyper-graph model for the fusion of clustering solutions.Yu et al. [6] design a Gaussian distribution-based cluster structure, followed by that the most representative unified cluster structures are found to facilitate the process of clustering by using distribution-based distances.Bai et al. [7] propose a weighted consensus measure based on information entropy to evaluate the clustering quality.
Although these clustering ensemble methods have achieved satisfactory performance, they seldom consider the issues below: 1) how to fully exploit prior information provided by experts, denoted as must-link and cannot-link constraints, and 2) how to design a better fusion strategy to integrate all the clustering solutions into a more robust and stable solution, compared with each base clustering solution component.To achieve the above two goals, we propose an adaptative regularized semi-supervised clustering ensemble framework, which is referred to as ARSCE.Specifically, we first generate multiple random feature subspaces, followed by performing transformations on these feature subspaces while considering pairwise constraints.Secondly, we adopt clustering algorithm on the transformed features to generate diverse clustering solutions.Thirdly, we design a new fusion strategy to integrate these clustering partitions into a unified clustering solution by assigning suitable weights for clustering ensemble members.To evaluate the effectiveness of our method, we conducted extensive experiments on multiple real-world benchmark data sets.Experimental results indicate that our method ARSCE achieves better or at least comparable performance, compared with other related counterparts, which verifies its effectiveness and superiority.
The contributions of this work are summarized as follows: 1) We propose a transformation working in random feature subspaces while considering pairwise constraints for finding a clustering-friendly space, where clustering solutions are generated via using traditional clustering methods.2) We design a strategy to fuse all the clustering solutions into a unified clustering solution by adaptively assigning weights for each clustering ensemble solution member.3) We perform experiments over multiple real-world data sets; experimental results verify the effectiveness of our proposed ARSCE in selecting informative constraints and reducing the constraint redundancies.
The remainder of this work is organized as follows.Section II reviews related works on random subspace methods, semi-supervised clustering and clustering ensemble.Section III illustrates our proposed method in detail.In Section IV we present the experimental results and perform analysis.Section V draws the conclusion of this work and describes the possible future directions.

II. RELATED WORK
In this section, we review the related literatures about random subspace, semi-supervised clustering and clustering ensemble.
Random subspace approaches show remarkable advantage when reducing the relativity between different base classifiers.Practically, there are ensemble classification methods based on random subspace.For example, Ho [8] adopts the pseudo-randomly selected subsets of feature vectors to construct multiple trees in randomly chosen subspaces.To deal with high-dimensional data, Yu et al. [9] propose a graph-based semi-supervised dimension reduction scheme in random subspaces and perform semi-supervised linear classification in the random feature subspaces.Semisupervised clustering approaches have been proposed to improve the clustering performance by using prior of label constraints like ''cannot-link'' and ''must-link''.There are many semi-supervised clustering methods proposed in the past years.Specifically, semi-supervised maximum margin clustering [10] extends the margin maximum framework in supervised learning to clustering and penalizes the violation of the given pairwise constraint conditions, which shows a promising performance.In [11], Anand et al. utilize pairwise constraints as supervised information for mean shift clustering, where data are projected into a high-dimensional kernel space, followed by imposing pairwise constraints via a linear transformation on them.In [12], Fang et al. integrate low-rank representation together with Gaussian fields into a unified framework, allowing pairwise constraints to guide the construction of the affinity matrix.Liu et al. [13] combine K-means clustering and linear discriminant analysis, in which the latter aims to find a space via dimensionality reduction where clustering task can achieve satisfactory performance.Wang et al. [14] utilize constraint neighborhood projections to mitigate the issue of constraint conflicts while reducing the required number of labeled data.Xiong et al. [15] propose to select both pairwise must-link and cannot-link constraints using active learning strategy for semi-supervised clustering iteratively.Huang et al. [16] combine pairwise constraints and constraint projections to achieve both sample and feature constraint projections.Chang and Chen [17] adopt discriminative random fields for evaluating the consistency between the results obtained by certain clustering approach and supervised information derived from pairwise constraints.Wang et al. [18] obtain all the labels for unlabeled data via propagating pairwise constraints, where supervision information helps to adjust a weight matrix as a regularization term imposed on the objective function of nonnegative  matrix factorization.Yang et al. [19] use the prior about pairwise constraints to improve the community detection performances in a semi-supervised learning framework, where the network topology is integrated with supervised information.
To mitigate these issues, clustering ensemble methods have been proposed to fuse multiple clustering solutions into a unified solution.Specifically, Yu et al. [20] adopt random transformations in both the sample and feature spaces with a hybrid strategy.Mimaroglu and Aksehirli [21] design a divisive clustering ensemble scheme by combining high-quality clustering solutions into a final result, which does not require input arguments.Yang and Jiang [22] proposed a samplingbased clustering ensemble approach by adopting boosting and bagging strategies for both the global and the local constitutions respectively.Yu and Wong [23] use the perturbation on input to generate the perturbed data, on which partitions are generated by using Neural Gas as the base clustering method.Iam-On et al. [24] design a new link-based method that improves the clustering quality via discovering unknown entries with clusters in an ensemble.Huang et al. [25] design an ensemble-driven clustering scheme by estimating the clustering uncertainty and locally weighting co-association matrix.Huang et al. [26] perform the ensemble clustering by computing trajectories of random walkers over sparse graph representations, based on which consensus functions help to fuse clustering solutions.Topchy et al. [27] convert clustering ensemble as a maximum-likelihood problem and rewrite the consensus function using certain mutual information criterion.Liu et al. [28] prove the theoretical equivalence between spectral clustering ensemble and weighted k-means.Yousefnezhad et al. [29] adopt the wisdom of crowds like diversity, independencies, decentralization and aggregation to enhance the clustering ensemble.Yu et al. [30] integrate random subspace, constraint propagation as well as normalized cut algorithm into a semi-supervised clustering.Yang and Jiang [31] propose a bi-weighted ensemble scheme for time series data clustering via HMM-based K-models to reduce dependencies on initializations while automatically performing model selection.Yu et al. [32] propose a double selection-based semi-supervised clustering method working in both features and samples for tumor clustering, which pairwise constraints are as supervised information.Yu et al. [33] design an adaptive semi-supervised clustering ensemble method by fully exploiting pairwise constraints via affinity propagation.
Additionally, researchers have made effects to clustering ensemble, in which they need to trade off between quality and diversity.Shi et al. [34] transfer the learnt relationship between quality and diversity in a source domain into a target domain based on certain optimization objective functions.For better performance of spectral clustering over large-scale data with very limited resources, Huang et al. [35] convert sparse sub-matrix as a bipartite graph and use transferred cut to obtain the clustering result.

III. PROPOSED METHODOLOGY
In this section, we describe the adaptive Regularized semisupervised clustering ensemble method (ARSCE) in detail.The framework of our method is shown in Figure 1, which is mainly divided into three stages.First, we randomly select features from all the feature candidates to form a series of subspaces.Second, we conduct a weighted constraint selection and constraint mapping in the above subspaces to contribute to improving the clustering quality.Third, we design a scheme to integrate clustering solutions generated in each subspace for a more robust clustering solution.

A. PROBLEM FROMULATION
Suppose there is a data set X = {x 1 , x 2 , . . ., x n } ∈ R m×n , where n and m are the number of samples and the number of features, semi-supervised clustering ensemble aims to cluster X into k groups, thereby pairwise samples in the same group are more similar to pairwise samples from different groups and satisfy as many constraints as possible simultaneously.

B. THE DETAILED PROCEDURE OF OUR METHOD 1) THE GENERATION OF RANDOM FEATURE SUBSPACES
Given m features, we generate multiple subspaces by randomly selecting m * ρ features one by one, where τ is the sampling rate, and • denotes the greatest integer less than or equal to a number.This selection procedure is repeated B times, which leads to B random subspaces, denoted as S = {S 1 , S 2 , . . ., S B }. Specifically, the index I dx of a feature selected into a subspace is determined by where ρ denotes a random variable sampled from a uniform distribution whose range is in the interval between 0 and 1.
We adopt sampling without replacement to avoid that features are selected repeatedly in each random subspace.Meanwhile, we avoid two identical feature subspaces whose features are perfectly overlapped.These above obtained subspaces provide various perspectives of exploring the structural information of data while ensuring the diversity of information, which plays a key role in ensemble learning.

2) THE WEIGHTED CONSTRAINT SELECTION AND PROJECTION
Since the generated random subspaces contains different features for exploring the underlying structure of the data manifold, they have their own preferable constraint sets.
In this setting, we perform the pairwise constraint selection accordingly.When conducting constraint selection and projections, the assumptions shall be held: 1) a must-link constraint indicates a smaller distance between the corresponding data points, and 2) a cannot-link constraint indicates a larger distance between the corresponding data samples.In the constraint selection process, we consider that a must-link constraint with a large distance or a cannot-link constraint with a small distance plays an important role in the clustering, and we should penalize these two cases.For a given random feature subspace, the objective function of combining the semi-supervised clustering and searching a clustering-friendly space into a unified framework can be defined as follows:

s.t., P T P
where P is the partition matrix.X S i is the data matrix in the i th feature subspace.F i denotes a feature transformation from random subspace to a clustering-friendly space.L denotes the Laplacian matrix.C and M denote the cannotlink constraint set and must-link constraint set, respectively.γ i,i and θ i,i denote the penalty weights for violating must-link constraint and cannot-link constraints, respectively.1(x, y) is an indicator function, which equals 1 if x = y and 0 otherwise.Generally, an ideal transformation F i enjoys the following properties: 1) allowing a better clustering performance, and 2) meeting the pairwise constraints as many as possible.

3) THE INTEGRATION OF AFFINITIES WITH THE FUSION OF DIFFUSION
Suppose that we have obtained multiple clustering-friendly space, then we compute the corresponding affinity graphs and adopt a regularized ensemble diffusion to fuse the similarity information by where and λ is a trade-off parameter used to adjust the distribution of the learned weights.I is an identity matrix with an appropriate size.β = {β 1 , . . ., β M } is a vector of weights whose v th element β v is the weight of the v th affinity graph.w v ij denotes the weight of the edge that connects the i th and the j th data points in the v th graph of the v th subspace.A is the final similarity that not only captures the structure of data but also leverages the complementary information in multiple clustering-friendly spaces.The first term in Eq. ( 3) evaluates the smoothness of the weighted tensor product graph.The motivation is that if data points x i and x j are similar to each other and data points x k and x l are similar to each other, there should be a small difference between A ki and A lj .The second term is used to preserve the self-similarity in each learnt affinity graph.
After optimizing Eq. ( 3), we obtain the similarity A∈ R m×n that integrates all the discriminative and informative features in clustering-friendly spaces.Then we compute the normalized Laplacian matrix L sys together with its first k eigenvectors, denoted as U = {µ 1 , . . ., µ k } ∈ R n×k , followed by adopting k-means clustering, which is defined as follows: where C ∈ R k×k and P denote the cluster center matrix and the partition matrix, respectively.The detailed pseudo-code of how to optimize our proposed method ARSCE is illustrated in Algorithm 1.

IV. EXPERIMENTS
In this section, we evaluate the effectiveness of our method on multiple real-world data sets in terms of normalized mutual information (NMI).The statistical descriptions of these data sets are shown in Table 1.Among these data sets there are 8 data sets collected from GENE repository.Besides, we use 2 data sets collected from the UCI machine learning repository and another 5 data sets collected from the ASU data repository.

A. EVALUATION CRITERION
We adopt the normalized mutual information (NMI) to evaluate the quality of a clustering result, as defined below: where Y and Y are the ground-true labels and the clustering results, respectively.c denotes the number of clusters.t l and th denote the numbers of samples in the l th ground-truth class and the h th cluster, respectively.t l,h denotes the number of samples in the intersection of the l th ground-truth class and the h th cluster.Generally, a larger NMI value indicates a better clustering performance.

C. EXPERIMENTAL ANALYSIS 1) THE EFFECTS OF PARAMETERS
In this part, we first explore effects of sampling rate on clustering performance in terms of normalized mutual information (NMI), where the sampling rate determines the number of features in each subspace.This experiment is conducted over six data sets in Table 1, namely Alizadeh-2000-v3, Armstrong-2002-v2, lung disease, lymphoma, mfeat and nci9.Here the sampling rate varies in the range between 0.1 and 0.5.Figure 2 illustrates effects of sampling rate on clustering performance.From this figure, we observe that in general, performance becomes better with the increase of sampling rate initially.It means that there are increasing informative features selected to facilitate the clustering.However, when the sampling ratio reaches a certain value, we notice that the clustering performance shows an obvious downward trend.A possible reason is that redundant features are selected in this setting, which produces negative effects on clustering.In the majority of cases, the optimal sample ratio falls in the range between 0.3 and 0.4, while for the data set nci9, the optimal value of the sampling ratio is in the range between 0.2 and 0.3.In other words, different data have their respective preferable sampling rates.In this setting, we need to choose the optimal sampling ratio dedicatedly.From the perspective of feature selection, we consider that it is necessary to explore a more reasonable strategy when constructing the random feature subspaces by selecting more effective informative features.Thus, it can allow generate multiple diverse clustering partitions with satisfactory performances.
In the following, we would explore the effects of pairwise constraints on clustering performance by increasing the percentages of pairwise constraints.Generally speaking, a larger percentage of pairwise constraints indicates that we have more supervised information to drive clustering methods to find a better clustering performance.Figure 3 illustrates the effects of pairwise constraints on performance over six data sets in Table 1.From this figure, we find that with an increasing number of pairwise constraints available, the performances show upward trends of different levels.It means that these pairwise constraints provide effective supervised information, which contributes to the clustering process when finding a clustering-friendly space.
In addition, we explore the effects of the ensemble member number on clustering performance with respect to normalized mutual information (NMI), which is shown in Figure 4. From this figure, we find that the performance shows an upward trend with the increase of ensemble member numbers.It means that more ensemble members can provide much more informative and complementary information for a better clustering.However, when the number of ensemble members approaches a value, the performance shows a slight improvement, and the improvement gap is diminishing.In this setting, we shall tradeoff between the performance improvement and the computation cost, since an increasing number of ensemble members indicates a larger computation cost and time.
Besides, we study the effects of λ on clustering performance concerning normalized mutual information (NMI),  which is shown in Figure 5.In this exploration, we change the value of λ in the range between 0.1 and 0.9.From Fig. 5, we notice that when λ increases, the clustering performance shows a quick upward trend before heating the peak, followed by showing downward trends with different extents.Apart from this observation, we find that different data sets have their respective preferable λ.It indicates that our proposed method is sensitive to λ, which is used to control the distributions of the weight affinity graphs for new learnt spaces.Basically speaking, we find that in the majority cases the optimal value of the trade-off parameter λ is in the range between 0.4 and 0.6 except on the data nci9, in which case the optimal λ in the range between 0.6 and 0.8.We consider that there exists some differences between the distribution of data samples over the clusters and those of other data sets.As a result, we need to select its perferable value to adjust the weights of the learnt affinity graph for a better performance.
Next, we analyze the comparative results obtained by recent semi-supervised clustering ensemble approaches and our proposed one.The counterparts include neural gasbased clustering ensemble algorithm (NGCE [36]), random K-means-based clustering ensemble algorithm (RSKE [37]), bagging-based K-means clustering ensemble algorithm (BAGKE [38]), hierarchical clustering ensemble algorithm (HCCE [39]), exhaustive and efficient clustering ensemble algorithm through constraint propagation (E 2 CPE [40]), incremental semi-supervised clustering ensemble algorithm (ISSCE [30]), and double weighting semi-supervised ensemble clustering algorithm (DCECP [2]).The comparison results are shown in Table 2, where we do not provide standard deviation whose values are smaller than 0.001.From this table, we have the following observations: 1) E 2 CPE can achieve better performances, compared with NGCE, RSKE, BAGKE and HCCE, since it uses constraint propagation tricks to leverage the supervised information, which helps to guide the clustering process.It indicates the efficiency of pairwise constraints in boosting the clustering quality.
2) Both the constraint weighting and constraint projection weighting transform feature subspaces into a clusteringfriendly space, where high-quality clustering solutions with enough diversity are obtained.It is reflected by the fact that ISSCE and DCECP have achieved much better performances than E2CPE in the majority of data sets.
3) Our proposed method has achieved the best or at least comparable performances on all the data sets, which indicates that it is necessary to adopt an adaptive clustering ensemble via assigning proper weights for base clustering solutions to combine them for a better clustering partition.In other words, it verifies the effectiveness of fusion via diffusion.

V. CONCLUSION
In this paper, we propose a novel constraint-selection based clustering ensemble.First, we design a scheme to learn effective features that are beneficial clustering and meeting the prior clustering constraint conditions.Second, we propose to fuse all the clustering solutions by using the fusion of diffusion rather than the voting mechanism.We conduct experiments on multiple real-world benchmark data sets by comparing other recent related algorithms.Experimental results demonstrate both the effectiveness and the superiority of our proposed method ARSCE.In future, we shall explore the feasibility of using deep neural networks to learn informative and discriminative features.Besides, we shall improve new strategies of integrating multiple clustering solutions for a better performance.We will further improve our method ARSCE in terms of its sensitivity to the sample ratio.Meanwhile, we shall extend our method ARSCE to real-world application in computer vision fields.

FIGURE 1 .
FIGURE 1.The overall framework of our proposed adaptive regularized semi-supervised clustering ensemble.

VOLUME 8, 2020 Algorithm 1
Adaptive Regularize Semi-Supervised Clustering Ensemble (ARSCE) Require: input data matrix X , the number of subspaces B, the sampling rate τ , trade-off parameters α, δ, γ , θ and β and λ; Ensure: the clustering partition matrix P. 1: Generate B random feature subspaces via Eq.(1) for the i th feature subspace; 2: while i ≤ B do 3: Compute the feature transformation matrix F i via optimizing Eq. (2); 4: Compute the affinities for the transformed data X Si F i in a new space; 5: Denote the edge weight matrix as W; 6: end while 7: Compute the final similarity via optimizing Eq. (3); 8: Compute the corresponding Laplacian matrix L sys ; 9: Compute the first-k eigenvectors U = {u 1 , . . ., u k } 10: Perform clustering on U by using K-means via optimizing Eq. (5);

FIGURE 2 .
FIGURE 2. The effects of sampling rate on the clustering performance in terms of normalized mutual information (NMI).

FIGURE 3 .
FIGURE 3. The effects of pairwise constraints on the clustering performance in terms of normalized mutual information (NMI).

FIGURE 4 .
FIGURE 4. The effects of the ensemble member numbers on the clustering performances in terms of normalized mutual information (NMI).

FIGURE 5 .
FIGURE 5.The effects of λ on the clustering performances in terms of normalized mutual information (NMI).

TABLE 1 .
The statistical information of the data sets used to evaluate the clustering performance.

TABLE 2 .
Comparison of semi-supervised clustering ensemble methods on the data sets in Table1in TRMS of normalized mutual information (NMI).