A Novel Cluster Prediction Approach Based on Locality-Sensitive Hashing for Fuzzy Clustering of Categorical Data

This paper addresses the problem of fuzzy clustering for categorical data. During the last two decades, many attempts have been made to extend the k-means algorithm, making it applicable to clustering for categorical data, due to its simplicity and efficiency. However, as k-means-like algorithms are local optimization methods, their clustering results are highly sensitive to initialization. In this paper, we propose to use Locality-Sensitive Hashing (LSH) to reduce the categorical data dimensions and predict the initial fuzzy clusters in low-dimensional space. Different from the existing cluster initialization optimization methods that aim to create only crisp initial clusters, the proposed method aims at predicting ‘high quality’ fuzzy clusters at the initialization step before proceeding in the k-means-like fashion. The numerical results show that the proposed method yields relatively accurate results on 16 UCI datasets and outperforms all other related approaches in terms of both crisp and fuzzy clustering effectiveness.


I. INTRODUCTION
Unsupervised learning is an important research branch in the machine learning field, especially in the context of big data. While supervised learning requires labeling the data for training learning models, unsupervised learning aims at directly exploring the correlations or hidden information from the unlabeled data [1]. In the era of big data, the data labeling task is extremely costly and time-consuming, and unsupervised learning has recently gained increasing attention [2,3].
Clustering is one of the most important techniques in unsupervised learning [4]. Basically, the aim of clustering is to find the autocreated groups in the data so that similar objects are placed in the same cluster while dissimilar objects are placed in different clusters [5]. The cluster analysis approaches can be classified into two categories: Hierarchical clustering seeks to build the hierarchy of clusters, while flat clustering seeks to find the distinct clusters in the data [4]. In practice, flat clustering techniques are more prominent for their intelligibility with lower complexity. There are several families of algorithms that can handle the flat clustering problem, such as Expectation-Maximization (EM) [6], Genetic Algorithms (GA) [7], Self-Organizing Map (SOM) [8], Cuckoo search [9], and k-means-like algorithms [4,10,11,12]. Moreover, k-means-like algorithms are a class of algorithms that use a representation to present the cluster center and seek to minimize the total distances between objects and their nearest cluster representations [4]. However, for special data types such as categorical data, the cluster representation must be defined differently. Note that the categorical data type is the kind of data that cannot be represented by numerical values such as shape or color of an object. For that reason, the center of gravity of a categorical cluster cannot be calculated directly. In that case, using medoids [13], modes [14], representatives [15,16], or centers [17] can be useful instead of using means as for the case of numerical data. However, with different kinds of cluster representations, the distance measure between objects and representations must be defined appropriately [18].
Because the clusters of k-means-like algorithms are crisp, the labels of objects laying near the boundaries might be inaccurately represented. This weakness can be simply solved by fuzzy clustering approaches [19,20]. In the context of fuzzy clustering, all objects share their membership to all clusters in the way that the membership degrees are inversely proportional to the distances between objects and corresponding cluster representations. Several k-means-like algorithms can tackle the fuzzy clustering problem of categorical effectively, namely Fuzzy-k-modes [21], Fuzzy-k-representatives [16]. However, their results are still not stable because k-meanslike algorithms are locally optimal methods [4].
It is obvious that a good initial state can lead to a better optimization for k-means-like algorithms [4,22,23]. For crisp categorical data clustering, several attempts have been made to predict the initial clusters at the initialization stage to improve the efficiency of clustering algorithms [14,22,23,24]. Recently, in [24], we proposed using the Locality-Sensitive Hashing (LSH) technique to predict the initial clusters for categorical clustering, which yields better results than other initialization methods without consuming considerable computations. In this paper, we further extend the LSHbased approach for predicting the initial fuzzy clusters for fuzzy clustering of categorical data, and apply it to a newly developed fuzzy clustering algorithm called Fk-centers [17]. Note that most previously proposed initialization methods only aimed to generate crisp clusters at the initial stage for fuzzy clustering.
Briefly, the contributions of this paper can be summarized as follows: • We first develop a new LSH-based initialization technique to predict initial fuzzy clusters for categorical data and then incorporate it into the Fk-centers algorithm resulting in the so-called LSHFk-centers algorithm. • We design a series of comprehensive experiments for testing the clustering effectiveness of the proposed method compared with state-of-the-art fuzzy categorical data clustering methods. The rest of the paper is organized as follows: Section II states the categorical data fuzzy clustering problem, the literature reviews, and research backgrounds. The principle of the proposed method is then shown in section III. After that, section IV shows the experimental designs and results. Finally, section V yields the conclusion and possible future works.

A. PROBLEM STATEMENT
The important notations of the fuzzy clustering problem are defined as follows. Denote X = {x 1 , . . . , x N } as the dataset comprising N categorical objects to be fuzzily clustered into k fuzzy clusters. Each categorical object x i in the dataset X is a vector of D categorical values x i = [x i1 , . . . , x iD ] (i = 1, . . . , N ), and the number D is the number of attributes or dimensions of dataset X. The unique categorical values on different attributes are different and independent from each other, where A d (d = 1, . . . , D) is the domain of the d-th attribute, which contains all unique categorical values of the dth attribute.
Note that k is the target number of fuzzy clusters in the dataset X, and a fuzzy cluster is a group of objects each of which shares its membership degrees to all fuzzy clusters. Let u ij be a shared membership degree of object x i (i = 1, . . . , N ) to the j-th cluster (j = 1, . . . , k) such that the greater the value of u ij , the higher the linkage of object x i and the j-th cluster, and the membership degrees satisfy the following constraints: Then, the membership matrix U = [u ij ] N ×k can entirely show the clustering statuses of all objects to all clusters.
To find the optimal values of the membership matrix U , different evaluation functions can be applied, such as minimizing the total distance between objects within the same clusters and maximizing the total distance between objects in different clusters. For k-means-like algorithms, the clusters are represented by representations, and k-meanslike algorithms seek to minimize the total distances from objects to their nearest representations. Therefore, the set of representations is also one of the outputs of fuzzy clustering algorithms.

B. DISSIMILARITY MEASURES FOR CLUSTER ANALYSIS
Dissimilarity measures support the formation of the cluster for measuring the similarity/dissimilarity between objects in the same or different cluster(s). For metric spaces, the basic Euclidean metric can be simply applied with high efficiency. However, the situation becomes slightly more complex when dealing with categorical data because categorical values are discrete and unordered. The simplest technique to measure the dissimilarity between two categorical values is the overlap function, which counts the number of mismatch values in all attributes and uses it as the distance: It is clear that the overlap function is simple and easy to deploy. However, it cannot work effectively in some cases, such as subdividing a domain into multiple clusters of similar categorical values [17]. In that case, a context-based dissimilarity measure can show better performance by providing a richer value range. Distance Learning Dissimilarity for Categorical Data (DILCA) is the first context-based dissimilarity measure [25]. More specifically, DILCA defines the distance between two values a di , a dj of categorical attribute A d , denoted by δ(a di , a dj ), depending on how the values of the other attributes A d within the context of A d are distributed in the data objects, where the context of A d is defined as consisting of the attributes A d that are highly correlated to A d . Formally, where Pr(a di |a d ) is the conditional probability of a di given a d , and N d is the number of categorical values of A d .

C. FUZZY K-MEANS
Fuzzy k-means (Fk-means for short) is the basic approach for conducting fuzzy cluster analysis for numeric data. Because all k-means-like algorithms for fuzzy clustering are based on Fk-means, we briefly remark on the Fk-means as follows.
The center of gravity (centroid) is used to represent the cluster center. Denote C = {c 1 , . . . , c k } as the set of k centroids, and each centroid c j is a vector in the same metric space as the data objects. The membership degrees of each object are calculated by the reciprocal value of the distance of that object to all cluster centroids: where α is the parameter controlling the degree of fuzziness such that α ∈ [1, +∞). With α = 1, the fuzzy cluster analysis model is considered back as crisp cluster analysis.
Fk-means uses the centers of gravity to recalculate the centroids of all clusters when the degrees of membership matrix are updated: Note that the fuzziness parameter α is also used to update the centroids, which determines the different magnitudes of the objects near and far from the cluster centroids.
To evaluate the convergence of the Fk-means algorithm, the total fuzzy distance from all objects to all cluster centroids is used: Technically, a lower value of P(U, C) means a better clustering outcome.
The above three equations are the foundation principles of Fk-means, and the algorithm of Fk-means can be briefly described in the following four steps: • Step 1: Randomly select k objects and let them centroids for k clusters. • Step 2: Calculate/recalculate the membership matrix U following equation (4). • Step 3: Recalculate the centroids C following equation (5).
• Step 4: Repeat Step 2 and Step 3 until value P(U, C) (see equation (6)) converges or the number of maximum iterations is reached.

D. K-MODES AND FUZZY K-MODES
k-modes is the first attempt to handle crisp clustering of categorical data, and fuzzy k-modes (Fk-modes) is an extension for fuzzy clustering. k-modes and Fk-modes use the categorical object with the highest frequencies on all attributes as the cluster's mode. Let m j = [m j1 , . . . , m jD ] be the mode of the j-th cluster; the formulation of m j is: Therefore, the modes of k-modes and Fk-modes are categorical vectors, which implies that the membership degrees formula in equation (4) can be used.
When using modes, there is a high probability that the information of values with high frequencies will be lost, especially with domains with high numbers of unique categorical values [15,16]. San et al. [15] proposed k-representatives for dealing with such problems by capturing all the probabilities of all unique categorical values into the so-called representatives. Let r j = [r j1 , . . . , r jD ] be the representative of the j-th cluster. Then a representative is defined as: It is clear that representatives are not categorical vectors, so equation (2) cannot be used for calculating the distance from an object to a representative. San et al. [15] also introduced a simple function for this task: Pro(x id ) (9) In terms of extension of k-representatives for fuzzy clustering of categorical data, Kim et al. [16] proposed using fuzzy centroids, which are expansions of representatives as follows: is the probability of categorical value v in the j-th cluster on the dth attribute in terms of fuzzy membership: Continuing the work of k-representatives, Chen et al. [26] introduced using a kernel combination of uniform distribution and observed distribution to better represent cluster. In detail, the concept of Pro j (v) in equation (8) is replaced by the kernel probability estimate KPro j (v), which is defined as: where λ j is the smoothing parameter for adjusting the uniform distribution contribution to the j-th center. With Least Squares Cross-Validation (LSCV) [27], the optimal value of λ j can be learned statistically based on the distribution of categorical values on that attribute: is the frequency of appearance of categorical value v in the j-th attribute.

G. PREVIOUS WORK: FUZZY K-CENTERS
Recently, Nguyen et al. [17] proposed a fuzzy clustering algorithm so-called Fk-centers that extends the center to fcenter to make it workable with fuzzy clustering. Particularly, the smoothing parameter equation (13) is modified to match the concept of fuzzy clusters: where FFrequency j (v) is the frequency of the categorical value v in the j-th fuzzy cluster: Because the structures of f centers are the same as centers, the dissimilarity measures can be reused. As such, in this paper f center is selected for representing the fuzzy clusters. Moreover, we further focus on enhancing the performance of Fk-centers algorithm by optimizing its initialization process.

H. LOCALITY-SENSITIVE HASHING (LSH)
LSH is a well-known dimension reduction method that is widely used for approximate nearest neighbor search in highdimensional big data [28,29,30]. Technically, LSH uses a family of locality-sensitive hash functions to project highdimensional data into lower-dimensional data, which seeks the consistency of similarity across the new space.
For the principle of LSH, a family of l hash functions H = [h 1 , . . . , h l ] is called (R, cR, p 1 , p 2 )-sensitive if for any objects x i and x j : Several kinds of hash functions can have good (R, cR, p 1 , p 2 )-sensitivity such as random projection functions, random hyperplane functions, or threshold functions.
For fast traversing and hash value storage, we need to build a hash table to store the hash values of all objects in the data. In detail, each locality-sensitive hash function gives a binary value for an input object vector. Therefore, l hash functions can give a maximum of 2 l different hash values in total. When the number of objects N is greater than 2 l (with N 2 l ), numerous objects have the same hash value. The objects having the same hash value are stored in the same bucket in the hash table. Therefore, the key to that bucket is the hash value of all objects in it. Interestingly, the keys can be used to determine the similarity of objects in different buckets by taking the Hamming distances.
Certainly, building the hash table is fast because the hash functions can be conducted independently. In this research, we take advantage of the similarity of buckets of LSH to predict the initial fuzzy clusters for the fuzzy clustering problem.

III. THE PROPOSED METHOD
The outline of the proposed method is shown in Figure 1, which includes multiple processes grouped into two main stages: clustering initialization and clustering iteration. In this section, these two stages are detailed following the flows in Figure 1.

A. CLUSTERING INITIALIZATION 1) Dissimilarity matrices creating
The first process is building the dissimilarity matrix for each categorical attribute. Particularly, a dissimilarity matrix of a domain is formed by the pairwise distances of all unique categorical values in that domain: Table 1 indicates an example of the semantic dissimilarity matrix for the d-th categorical attribute with N d different categorical values. Because the distance function is semantic and the distance of a categorical value to itself is zero, we just need to calculate N d (N d -1)/2 values in total.
In this study, we recommend using the context-sensitive dissimilarity measures such as Association based similarity measure [31], distance learning for categorical attribute based on context information [32], and Distance Learning Dissimilarity for Categorical Data (DILCA) [25]. Namely, the DILCA is used in our implementation. These matrices not only support predicting the clusters but also make the dictionaries for fast access during the clustering process.

2) LSH hash functions creating
We are inspired by the threshold hash function for selecting thresholds in selected dimensions to subdivide each dimension into two domains with high inner locality sensitivity. In this research, we propose to subdivide the categorical attributes into multiple two subdomains.
First, the dissimilarity matrices in the previous step are converted into full undirected graphs so that we can subdivide them easier. An example of an undirected graph for the dth attribute is shown in Figure 2.
Maximum cut a dNd Second, the Stoer-Wagner algorithm [33] is utilized to find the maximum cuts on the graphs, which aims to separate the categorical values in a domain into two locality-sensitive groups.
Third, let A d0 and A d1 be two subsets of A d , which are created from the maximum cut in the previous step ( . Thus, the hash function h d for the dth attribute can be formed as: Fourth, because the attributes affect the clustering performance differently [26,34], we propose selecting the l hash functions that have the highest inner-separation scores to be used for building the hash table. Because the values in the dissimilarity matrices are normalized by the DILCA principle, a conventional method for selecting the well-cut maximum cuts is to evaluate the average weights of edges in the cuts.
Denote MC d as the set of edges in the maximum cut of the dth attribute. The average weight can be calculated as: where Weight({a dj , a dj }) = δ(a dj , a dj ) is the weight of edge {a dj , a dj } in the graph of the dth dissimilarity matrix. Then, the l hash functions with the highest average maximum cut weights are selected for building the hash table: subject to: where γ i is the corresponding average maximum weight value of hash function h i . VOLUME 4, 2016 After the hash functions are generated, we can simply create the hash table of multiple buckets so that the objects with the same hash value are put into the same bucket. The singleton hash value of objects in a bucket is also used as the key value of that bucket. Because there is a l hash function, there are a maximum of 2 l different hash values that the objects can have. In our proposed method, the buckets are used to predict the initial clusters. Therefore, the number of buckets 2 l should be greater than the number of clusters k.  To archive fast traversing the hash table, we use an inverted file structure that stores the objects indices of a bucket in the same segment in ascending order, which supports fast merging of a bucket to another bucket [35,36].

4) Fuzzy initial clusters prediction
It is clear that the largest buckets are potentially located near the natural clusters in the dataset because they have dense objects with the same locality-sensitive scores. For that reason, we propose using the k largest buckets as the core buckets for k initial clusters. Denote B * 1 , . . . , B * k as the k largest buckets in hash table H; the objects belonging to these buckets fully belong to the corresponding clusters: The objects in smaller and remaining buckets share their membership degrees with k initial clusters based on the distances of the buckets they belong to the core buckets. At this time, cluster representations have not yet been formed, but we can approximate the distance from object to core buckets by using Hamming distances between corresponding bucket keys: where Key(B i ) is the bucket that holds categorical object x i .
Next, if we consider the core buckets to be representations, the equation (4) can be applied to calculate the degrees of membership of objects outside core buckets.

5) LSH-based initial cluster prediction algorithm
Algorithm 1 LSH-based initial cluster prediction algorithm Require: X, k, D, A d (1 ≤ d ≤ D), l Ensure: k initial clusters 1: Create dissimilarity matrix for each categorical attribute A d (1 ≤ d ≤ D) using the DILCA dissimilarity measure. 2: Convert dissimilarity matrices into dissimilarity graphs. 3: Find the maximum cut for each dissimilarity graph using Stoer-Wagner algorithm. 4: Calculate the average weight of edges in each cut using equation (17) and select l attribute with the greatest values. 5: Generate the hash function via equation (18 We summarize the processes of the proposed initial cluster prediction method in Algorithm 1. However, the initial clusters are represented by the membership matrix U instead of the set of cluster representations C. Note that it is necessary to have a hash table with the number of nonempty buckets greater than the number cluster k for our method to work effectively. However, because the bucket sizes are unpredictable, the selection of a suitable number of hash functions l is tricky. Ideally, our method can work effectively when the number of non-empty buckets is the same as the number of clusters k, but in practice, we suggest selecting the number of hash functions l so that the number of maximum buckets is twice the number of clusters k, thereby l ≈ log 2 (k) + 1. When the number of nonempty buckets is smaller than k, we can increase the number of hash functions l to 1 or just proceed with empty core bucket(s).

B. CLUSTERING ALGORITHM
In this section, we summarize our proposed method (LSHFkcenters) in Algorithm 2, with iter_max and being the parameters controlling the convergence of the algorithm.

Algorithm 2 LSH-based fuzzy k-centers clustering algorithm (LSHFk-centers)
Require: X, k, D, A d (1 ≤ d ≤ D), l , iter_max, Ensure: k centers of k clusters R and membership matrix U that can locality minimize the value P(U, C) (equation (6)). 1: Find the initial membership matrix U by Algorithm 1.

IV. EXPERIMENTS AND RESULTS
In this section, we describe the details and results of designed experiments that aim to show the clustering performance of our proposed method compared to its original method and other state-of-the-art approaches.

2) Testing computer and datasets
Our method and all related works are developed by Python programming language, and the experiments are carried out by a high-end computer cluster with Intel Xeon G-6240M 2.6GHz (16 cores × 4) CPUs. Therefore, each algorithm can run 64 times on 64 cores in parallel in a node.
Next, we select 16 common UCI categorical datasets that are historically widely used in categorical data clustering, classification, and mixed data [43]. The details of these categorical datasets are shown in Table 2.

3) Evaluation metrics
First, to justify the ability of our method to crisp clustering, we select to using the purity score, which is the most commonly used clustering evaluation metric [44]. Denote G = {G 1 , . . . , G k } as the set of k output clusters of a clustering algorithm andĜ = {Ĝ 1 , . . . ,Ĝ k } as the set of ground-truth clusters of the corresponding dataset. The purity metric seeks the best match of these two clusters and then count the number of correct assignments: Note that we can simply convert the fuzzy membership matrix to a crisp membership matrix as: Second, for the fuzzy clustering effectiveness evaluation, the fuzzy silhouette (FSilhouette) scoring is utilized [45], and FSilhouette seeks for the average normalization of the VOLUME 4, 2016 silhouette scores product with the amplitude of degrees of membership of all objects: where µ p,i and µ q,i are the first and second largest values of u ij for (1 ≤ j ≤ k), respectively, and s i is the crisp silhouette score of x i [45].
Finally, to estimate the complexities of all compared methods, the average total clustering time (including initialization time) of each method is measured. First, the prediction accuracy of the proposed method is visualized when applied to the UCI soybean-small dataset. This dataset is the most commonly used categorical dataset for demonstration because of its simplify. In detail, Figure  4.a shows the ground-truth labels 1 B 14 , B 15 , B 12 , and B 11 are selected as core buckets, while objects in B 6 must share their membership degrees to the core buckets. As a result, in Figure 4.b, because the Hamming distances between buckets B 6 and B 14 are 1 and are the smallest, the membership degrees of objects in B 6 to the core bucket (cluster) B 14 are highest. Intuitively, our prediction model gives relatively accurate results compared to the ground-truth labels, which can predict absolutely accurate results for the cluster 3 and cluster 4, with an accuracy of 87.2% (41/47 objects).

FIGURE 5. Purity scores on UCI datasets
Second, because the fuzzy clustering algorithms can solve the crisp clustering problem by assigning the objects to their nearest clusters, we can compare the crisp clustering performance of our proposed method with that of other competitors. Figure 5 shows such comparison results as the average purity scores and standard deviations of all compared methods on 16 testing datasets with α = 1.1. In this experiment, all the methods are tested with singleton initialization 64 times with different random seed numbers. As a result, our method and FSBC are the methods with the best average purity score of 0.582. However, because our method gives consistent results over different runs, LSHFk-centers become the most stable method with a standard deviation of 0. FEkmeans can also achieve a standard deviation of 0 but its average purity score is less than that of LSHFk-center.

E. COMPLEXITY ANALYSIS
Finally, the complexity comparison is shown in Figure 6. Note that those clustering time metrics are extracted from the VOLUME 4, 2016 same experiment that also provides the Fsillhouette scores in Table 3; hence, we can clearly show the trade-off between speed and accuracy. It is clear that genetic algorithms take much more clustering time than the k-means-like algorithms because they have up to 4 × k chromosomes. Moreover, the membership chromosome-based methods such as NSGA, MOFCentroids, and MaOFCentroids are even more timeconsuming for taking extra processes to convert the membership chromosomes into cluster representations for evaluation. Additionally, FSBC has comparable crisp clustering accuracy with our proposed method, but it takes 3.59 times more computation time than LSHFk-centers (115.43 (seconds)/32.59 (seconds)). However, because of the tradeoff between accuracy and complexity, LSHF-k-centers is nearly 2 times slower than its original method k-centers, and the average clustering times are 32.59 (seconds) and 16.83 (seconds), respectively. In the same manner, simple methods such as Fkmeans, FE-k-means, k-modes, k-reps, and Fk-modes can run fast, but their clustering accuracy scores are relatively low.

V. CONCLUSION
In this paper, we introduced LSHFk-centers as the first novel approach to predict the fuzzy categorical clusters of categorical data. Particularly, in the initialization stage, we create hash functions based on the separations of categorical values in each attribute. After that, the LSH hash table can be established from multiple buckets of similar objects by using selected hash functions. LSHFk-centers selects the k largest buckets as the core buckets of fuzzy clusters, and the objects of the remaining buckets must share their membership degrees with the core buckets. As a piece of evidence, our fuzzy clustering prediction model itself can predict up to 87.2% accuracy compared to the ground-truth labels on the soybean-small dataset. Moreover, the proposed method is 8% more accurate than its original method (Fk-centers) in terms of fuzzy clustering effectiveness. Combined with the advantages of centers, our clustering algorithm outperforms all other related works in terms of clustering effectiveness in both crisp and fuzzy clustering evaluations. However, due to the extra process for building the LSH hash table, our method takes a considerable amount of time compared to its original method k-centers.
We recommend that future research focus on two following approaches to increase the performance of the LSHFkcenters algorithm: • Investigate the usefulness of different measures to the construction of the hash table. Perhaps a different dissimilar measure can give better accuracy than DILCA while taking fewer computations. • In this study, the LSH hash function is based on properties of a single attribute only. Future work can use multiple-attribute hash functions to achieve higher locality-sensitive factors.
For the sake of reproducing and creating open sources for future research, we published the source code of LSHFk-centers into the PyPI repository as https://pypi.org/project/ lshkcenters/ .