Gene Expression Data Analysis Using Feature Weighted Robust Fuzzy c-Means Clustering

—Clustering of gene expression data has been proven to be very useful in various applications, i.e., identifying the natural structure inherent in gene expression, understanding gene functions, mining relevant information from noisy data, and understanding gene regulation. In all these applications, genes, i.e., features, play a crucial role in characterizing them into different groups. These features may be relevant, irrelevant, or redundant, but they have different contributions during the clustering process. This paper presents a novel approach by considering the effect of features during the clustering process. In the proposed method, the fuzzy c-means the objective function is modiﬁed using a weighted Euclidean distance between the features with a monotonically decreasing function. The monotonically decreasing function helps control the fea-tures’contributionduringthe clusteringprocessto partition the data into more relevantclusters.The proposedapproach is validated, and performance is presented in various clustering performance measures on the different standard datasets.Theseclusteringperformancemeasureshavealso beencomparedwithmultiplestate-of-the-artmethods.


I. INTRODUCTION
C LUSTERING is an unsupervised learning method used to organize the unlabelled data sample into groups with respect to each data sample's underlying properties based on some similarity measures [1]- [4]. The data samples are unlabelled in the real application and have a different distribution. It is challenging to categorize them into meaningful clusters. Various clustering algorithms have been developed in the literature by considering these challenges. Hierarchical clustering (HC) organizes data into a hierarchical structure according to the proximity matrix, and a binary tree generally depicts results, or dendrogram [5]. The most successful application of HC is in the area of neuroimaging and bioinformatics [6]. In model-based clustering, data points in different clusters were generated by different probability distributions [7], and the number of clusters of a given dataset is identified by estimating the parameters using maximum likelihood (ML) or expectation-maximization (EM) [8]. The major flaw with these algorithms is that they have a slow convergence rate and are sensitive to the initial parameters [9]. Partition-based clustering is well known for its simplicity and efficiency in large-scale clustering datasets, in which the dataset is divided into several subsets. K-medoids [10] and K-means (KM) [11] are one of the efficient partition-based algorithms that solve well-known clustering problem. The K-medoids are the most appropriate data point within a cluster, which have two main advantages: first, it has no limitations on the type of attributes; second, the medoids are chosen by predominant data points within a cluster and, therefore, it is less sensitive to the outliers. However, KM is the centroid-based method that works only for the numerical attributes and can be affected by a single outlier [12]. Ruspini [13], and Bezdek [14] have presented the fuzzy versions of KM algorithms, where each data point can belong to more than one cluster with a distinct membership degree. The fuzzy version of algorithms are widely used in real applications, where data generated from the system are uncertain, imprecise, and vague, and difficult to handle with KM and K-medoids because providing a hard partition of the data, whereas the fuzzy clustering-based method provides a soft partition of the data [15]- [19]. The major problem with these algorithms remains the same, i.e., all features are treated equally important during the clustering process and easily affected by the outliers, and difficult to find a meaningful cluster in the dataset.
In the literature, various feature weight clustering algorithms have been studied. Huang et al. [20] have presented a weighted KM (WKM) algorithm in which variable weights are multiplied with the dissimilarity measure. These variable weights are updated iteratively, and unimportant variables are eliminated by choosing variables with smaller weights. Jing et al. [21] modified the WKM by adding the entropy term in the objective function. The entropy term helps minimize the within-cluster dispersion and maximize the negative entropy on the current data partition in each iteration. In [22], authors have presented sparse KM by putting the L 1 constraint on the feature weights. The L 1 constraint helps to make some feature weight become zero. To improve the learning performance of the FCM, Wang et al. [23] have presented the weighted FCM (WFCM), in which variable weights are multiplied with the dissimilarity measure similar to the KM. The variable weights in WFCM are learned through the gradient descent technique by following the approach presented in [24]. In [25], the authors have presented two different feature-weighted FCM algorithms. The first approach has added the L 2 constraint on the feature weight; however, in the second approach, they have added the discriminant exponent on the feature weights. In [26], a new approach has been presented, which automatically computes individual feature weight and simultaneously This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ reduces the dataset's irrelevant features. Although these variants of feature-weighted clustering perform well, they are unable to find the optimal number of clusters with consistent performance in the data.
Clustering becomes a crucial tool for gene expression data analysis, which helps to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition [27]. Several clustering algorithms have been studied for the gene expression dataset in the literature. In [28], Sherlock has analyzed the various clustering approaches for large-scale gene expression data and discussed the advantages and drawbacks of the algorithms. Smet, et al. [29] have presented an adaptive quality-based clustering of gene expression profiles to determine the actual number of clusters. In [30], authors have applied the fuzzy c-means clustering method to investigate trends in patient conditions, which doctors can use for disease diagnosis. Kannan, et al. [31] have proposed an effective fuzzy c-means by incorporating the membership function of fuzzy c-means for cancer subtypes in the gene expression cancer database. Herwig, et al. [32] have presented a sequential k-means algorithm with additional refinements that can handle high-throughput data in the order of hundreds of thousands of data items measured on hundreds of variables. In [33], a kernel-based clustering method for gene selection with gene expression data has been discussed in which weights are iteratively learned at the time of optimizing the clustering objective function. Handhayani and Hiryanto [34] have presented intelligent kernel k-means for clustering gene expression data without prior knowledge of clusters.
The variable weighting of features in gene expression data clustering helps identify the meaningful clusters with improved performance. It also overcomes the problem of an equal weighting of features in KM, K-medoids, and FCM based algorithms. The cluster-dependent features have two main advantages: First, they guide the clustering process to partition the data into more meaningful clusters. Second, they can be used in the subsequent steps of a learning system to improve their learning behavior. One more advantage of feature-dependent clustering is that the weighted feature simulates more dimensions to identify meaningful clusters when data is sparse and contains several outliers.
The rest of the paper is organized as follows: Section II presents the robust FCM with feature discrimination. Experimentation and comparisons with state-of-the-art methods are discussed in Section III. Finally, Section IV concludes the complete paper.
The major contributions are briefly summarized as follows: (a) Features of data play a crucial role in characterizing them into different groups. The effect of these features is incorporated in cluster identification by simultaneously weighting them in an unsupervised manner. As shown in (1), the objective function of FCM is modified by the monotonically decreasing function, which controls the weights of the features during the clustering process. (b) The weighted feature helps to eliminate the problem of trivial solutions during the clustering process since each feature in the clusters is multiplied by different weights based on the dispersion of the feature in the cluster.
(c) The weighted feature helps to simulate more dimensions for identifying meaningful clusters when data is sparse and contains more outliers.

II. PROPOSED METHODOLOGY: ROBUST FCM WITH FEATURE DISCRIMINATION
The features in the data might be relevant or irrelevant, but they have a different contribution to the clustering process. Considering the effect of different features' contributions during the clustering process provides the best clustering performance with the optimal number of clusters in data. In this section, the fuzzy c-means objective function is modified using weighted Euclidean distance between the features with a monotonically decreasing function, which controls the contribution of the features during the clustering process. The monotonically decreasing function also overcome the trivial solutions during the clustering process since the feature weight is always finite. The proposed approach described below are as follows: Let us consider a data matrix x = [x 1 , x 2 , . . . , x n ] ∈ R p×n , p and n are the number of features and number of samples respectively. Here, To group the data matrix x into c clusters, the following objective function can be minimized is an c × n, fuzzy partition matrix, μ i j is the membership degree value of the i th cluster of j th sample, m is any real number greater than 1, V = [v ik ] is an c × p, matrix containing the cluster centers, W = [w ik ] is an c × p, weight matrix, w ik is the weight value of k th feature to i th cluster, η i is the input parameter used to control the feature weight.
The objective function, as proposed in (1), is a constrained nonlinear optimization problem whose solutions are unknown. The main objective is to minimize F with respect to U , V , and W using alternating optimization methods. In the alternating optimization, we first fix U =Û and W =Ŵ and minimize F with respect to V . Then we fix U =Û and V =V and minimize F with respect to W . Afterward, fix V =V and W =Ŵ and minimize F with respect to U . First for the fixed U and W the objective From (3) the cluster center can be obtained using There are two cases arises depending on the value of w ik .
Case 1: If w ik = 0, the k th feature is completely irrelevant relative to the i th cluster. Hence, for any value of v ik , the values of this feature will not contribute to the overall weighted distance computation. Therefore, in this case, any random value can be chosen for v ik .
Case 2: If w ik = 0 , the k th feature has some relevance to the i th cluster, then the (4) is written as Then for given V =V , the constraint optimization problem in (1) is changed into unconstrained minimization problem using Lagrangian multiplier technique as follows: (6) where, α = [α 1 , α 2 , . . . , α n ] and δ = [δ 1 , δ 2 , . . . , δ k ] are the vectors containing the Lagrangian multipliers. If the variables (Û ,Ŵ ,α,δ) are the optimal values ofF(U, W, α, δ), then the gradient with respect to these variable are vanishes. and, the (7) and (8) can be simplified as By substituting (11) in (9) we have Similarly, (12) in (10) p k=1 By simplifying (13) and (11) we obtain the fuzzy partition, i.e., j th sample to the i th cluster as Similarly, from (14) and (12) we obtain the feature weight of k th to i th cluster as  (16) is the essential to the performance of the proposed approach since they reflect the importance of the second relative to the first term in (1). The control parameter η i is used to control the feature weights, since, η i is positive, and makes the weight w ik as given in (16) is inversely proportional to total dispersion of i th cluster to all features in data. The small of this term will make large w ik , i.e., k th features in i th cluster are more important. If η i in (16) is too large, the second term will dominate, and all features in cluster i will be relevant and assigned equal weights of 1 p . The parameters η i in (16) is computed by (17) in iteration, t, as follows The parameter K is a constant, and the superscript (t − 1) denote their values in iteration (t − 1). The proposed approach is briefly summarized in Algorithm 1.

III. EXPERIMENTAL RESULTS AND ANALYSIS
This section presents the performance of robust FCM with feature discrimination on binary and multi-class gene expression datasets. The proposed approach is also investigated on other biological datasets, as shown in Table II. The proposed approach and state-of-the-art methods are validated on the same platform, and detailed experimentation are explained in the following subsections: Algorithm 1 Robust FCM With Simultaneous Feature Discrimination 1: Give the number of cluster c, initial iteration number t = 0, randomly initialize cluster center V (0) , randomly initialize feature weight W (0) , and initialize η 0 randomly 2: Repeat 3: Update the feature weight matrix W using (16) 4: Update the partition matrix U using (15) 5: Update the cluster center matrix V using (5) 6: Update the parameter η i using (17) 7: t = t + 1 8: Until The objective function obtains its local minimum value or centers stabilize

A. Performance Measure and Cluster Validation
The performance of the proposed approach is analysed in terms of various clustering performance measures such as accuracy rate (AR), rand index (RI) [41], Fuzzy rand index [42] and normalized mutual information (NMI) [43]. These performance measures are mathematically written as follows: where, n(c j ) is the number of data points correctly retrieved in cluster j , n is the total number of data points. The large value of AR represents better clustering performance. The RI index as defined in (19) is used to measure the similarity between the two clustering partitions. Let c is the number of true clusters, and c * is the number of clusters obtained through the clustering algorithm. For a pair of points (x i , x j ), f 1 is the number of pairs of points that are the same in clusters c and c * , f 2 is the number of pairs of points that are same in cluster c and different in cluster c * , f 3 is the number of pairs of points that are different in cluster c, and same in cluster c * , and f 4 is the number of pairs of points that are different in clusters c, and c * . However, NMI is used to measure the information of a sample to contribute to making the correct classification decision, and its values always lie in between 0 and 1. As shown in Table I the performance of the proposed approach is better in the term of AR, RI, and NMI except for the colon and lung cancer dataset where NMI is small in comparison to state-of-the-art methods. Based on these clustering performance measures, we can assure that the proposed clustering process with feature discrimination provides the better partition of dataset either of datasets that are high-dimensional or sparse. The above clustering performance measures are supervised, i.e., the number of cluster c is known. However, it is generally unknown in clustering. Therefore, to show the effectiveness we have used different cluster validity indices, i.e., Partition Coefficient (PC) [44], Xie and Beni's (XB) Index [45], Classification Entropy (CE) [46], and Dunn's Index (DI) [47]. These validity indices are mathematically written as follows: The value of PC always lies in between [1/c, 1] if PC is closer to 1 represents the best partition, whereas closer to 1/c, the partition becomes fuzzier. The CE is used to measure the fuzziness of the cluster partition, smaller the values of CE represents the optimal cluster. In XB, the numerator defines the compactness of the fuzzy partition, and denominator denotes the strength between clusters, smaller the values of XB move towards the optimal cluster. However, DI is used to measure the compactness in the well-separated clusters. The large value of DI represents better clustering.
The cluster validity index, i.e., PC, CE, XB, and DI, are computed and compared with state-of-the-art methods, as given in Table III. The cluster validity measure shows that the proposed approach can provide the optimal clustering result with improved clustering performance in terms of AR, RI, FRI and NMI.

B. Effect of Parameter Variation on Clustering Performance
In this subsection, we have investigated the effect of parameter variation on the performance of the proposed approach, along with state-of-the-art. The proposed approach is validated on small, high dimension and sparse datasets and compared with seven different state-of-the-art methods, i.e., K-means (KM) [11], entropy weighted K-means (EWKM) [21], Agglomerative fuzzy K-means (AFKM) [35], fuzzy c-means (FCM) [15], simultaneous clustering and attribute discrimination (SCAD1) [25], and SCAD2 [25] and a feature-reduction FCM (FRFCM) [26]. The KM algorithm gives equal importance to all features because it cannot discriminate between relevant and irrelevant features. This problem is eliminated by the EWKM, where different feature weight is assigned during the clustering. But the weight values are dependent on the initial cluster center, and if it changes, the algorithm will lead to different performances and be unable to converge. In the AFKM, the entropy help to find the optimal cluster, but the problem remains the same, i.e., all the features have equal importance during the clustering. FCM does the soft partition of the data. Still, it treats all features equally during the clustering process. In the SCAD, each feature has a different feature weight; however, the data partition is not consistent. In the proposed approach, the monotonically decreasing function controls the weights of the features  during the clustering and helps find the optimal number of the partition. The weight and fuzzy partition control parameters of the EWKM and AFKM are heuristics; for SCAD, weight control parameters are updated automatically in each iteration, but the fuzzy partition is not consistent. However, in FRFCM, the weight control parameters discard the unimportant features but are unable to preserve the high-dimension datasets' actual cluster. In the proposed approach, the weight control parameter is updated automatically in each iteration, and an optimal partition is obtained due to the relevant feature weight assignment during the clustering process. The computational complexity of the proposed approach is also computed and compared with state-of-the-art methods, as shown in Table IV. In the proposed approach, the hyperparameter m ∈ (1, 3] and variable K ∈ [1,10] are varied to find the best partition of the data. As shown in figure 1 and 2, for the IRIS and ovarian cancer dataset, AR, RI, and NMI values are plotted by varying the values of K for the fixed hyperparameter. For the IRIS and ovarian cancer dataset, as shown in figure 1 and 2, the AR, RI, and NMI are plotted for five different values K at m = 2. In the case of SCAD2, the weight exponent (q) is varied from q ∈ [2,10], however in the case of SCAD1, the variable K ∈ [10, 100] and m = 2 for both the cases. Whereas, for the EWKM and AFKM, the variable is varied from K ∈ [10, 100] and for the FRFCM, the variables are varied from K ∈ [1, 10] with m = 2. Figure 1 and 2 showed that the performance of the proposed approach is better and stable in comparison to state-of-the-art methods. In figure 3 partition coefficient are also plotted for ALLAML, lymphoma, and lung dataset under different parameter values, which shows that the proposed approach even able to provide the optimal cluster in the dataset.

IV. CONCLUSION
This paper has presented a novel clustering algorithm for clustering high-dimensional data with improved performance. In this approach, the objective function of FCM is changed by adding the monotonically decreasing function, which controls the weights of the features during the clustering process and helps in identifying the better clustering structure of the data. The major advantage of the proposed approach is that the clustering performance is consistent and insensitive to the initial cluster center due to different feature weights assigned to each cluster in the clustering process. The clustering performance is compared with various state-of-the-art methods, which shows that the proposed approach is a new clustering algorithm to partition the data with improved performance. In the future, the correlation between the features can be considered to minimize the effect of redundant features, and the proposed approach is extended for the mixed attributes.