FOCT: Fast Overlapping Clustering for Textual data

Text clustering is used to extract specific information from textual data and even categorizes text based on topic and sentiment. Due to inherent overlapping in textual documents, overlapping clustering algorithms have become a suitable approach for text analysing. However, state-of-the-art algorithms are not fast enough to analyse a large volume of textual data within tolerable time limits. In this research, we propose our text clustering algorithm, FOCT, which is a fast overlapping extension of SOM, one of the best algorithms for clustering textual data. We apply some heuristics to extract special characteristics presented in textual data and establish a very fast overlapping clustering algorithm. We use fast methods to represent the vectors of documents, compute the similarity of documents and neurons and update the weights of neurons. In our algorithm, each document can belong to one or more neurons and this is in line with what many documents have in their essence. We analyse the efficiency of the proposed algorithm over k-means, OKM, SOM and OSOM clustering approaches and experimentally demonstrate that it runs 12 to 690 times faster, and the overlap size of FOCT clusters is closer to the overlap size of the original data. The quality of clusters is also measured by four different internal and external evaluation criteria where FOCT clusters represent up to 64% better quality.


I. INTRODUCTION
Overlapping clustering algorithms are unsupervised methods for knowledge extraction from data. These algorithms classify a set of objects so that each object can belong to one or more groups [1]. Overlapping clustering algorithms are momentous because a large number of real-world datasets (such as textual data, computational biology, and social networks) contain innate overlapping. In this paper, we especially target text overlapping clustering methods which are used for different tasks, such as grouping similar articles, news or tweets, the analysis of customer feedback and comments, and finding meaningful implicit subjects in documents. In addition to the challenges in usual clustering tasks, such as evaluation, parametrization and scale-up, overlapping clustering algorithms also face their issues, including high computational complexity and the impact of overlap size on the number of clusters [2], where N'Cir et. al. highlight the slow processing speed as one of the most critical challenges in overlapping clustering algorithms for big volumes of data [2]-[4].
Self-Organizing Map (SOM) is one of the common algorithms for clustering textual data. It is based on artifi-cial neural networks (ANN) and generally partitions highdimensional data into a low-dimensional space (typically two-dimensional). It is experimentally demonstrated that SOM is one of the most suitable algorithms for clustering textual data. In text clustering, the inputs are usually highdimensional sparse vectors; while the SOM output is often a two-dimensional map that can be understood and interpreted by humans. In addition, SOM is relatively resistant to noisy documents. Therefore, these features make SOM suitable for text-based applications such as digital libraries [5], [6].
Considering the importance of text clustering (which often have overlap between categories), and the ability of SOM for text clustering, presenting an overlapping clustering algorithm based on SOM that is suitable and fast for textual data, can be effective to extract knowledge from big textual datasets.
The basic SOM algorithm does not consider the overlap existing between clusters. The first overlapping version of SOM, called OSOM, was introduced by Cleuziou in which an instance can belong to one or more neurons [7]. However, Cleuziou conducted all experiments using non-textual datasets. In this paper, we will experimentally analyse the efficiency of OSOM and show that it is extremely slow for text clustering.
In this paper, we propose a new Fast Overlapping Clustering algorithm for Textual data (FOCT), which is an overlapping extension of SOM. FOCT takes advantage of a set of fast methods and techniques to appropriately represent the vectors of documents and accelerate the computation of similar documents and neurons and speed up the process of updating the weights of neurons. The efficiency of our algorithm is compared with k-means, OKM (Overlapping K-Means) [8], SOM [9] and OSOM [7] algorithms. We experimentally demonstrate that that FOCT is fast and efficient in overlapping clustering of textual data.
The rest of this paper is organized as follows: Section II presents relevant literature. In Section III our suggested algorithm is introduced. Our evaluation measures, data and feature extraction method, and experiment conditions and their results are addressed in Section IV. Finally, Section V conclusions the paper.

II. LITERATURE REVIEW
N'Cir et al. in chapter 8 of "Partitional Clustering Algorithms" book reviewed the most important overlapping clustering algorithms. They introduced a categorisation of overlapping clustering methods based on their conceptual approach [2]; the summary of these main categories are shown in Table 1.

Categories
Description Example

Correlation
These algorithms are the overlapping extentions of the correlation clustering algorithms. They are defined as optimization problems to allow overlaps by assigning set of one or more labels for each data object.
[10]- [13] Generative These algorithms are based on biological processes. They hypothesize that each data is the result of a mixture of distributions.
[14]- [16] Graphical These algorithms are based on graph theory and are mostly used for community detection in complex networks.
[17]- [20] Hierarchical These algorithms try to reduce the differences between the original dataset and the obtained hierarchical structure.
[21]- [23] Partitional These algorithms are the most popular overlapping clustering research group among researchers and consist in modifying the clusters resulting from a standard method into overlapping clusters or in proposing new objective criteria to model overlaps.
[8], [24]- [30] Topological These algorithms are overlapping extensions of topological maps (such as SOM). Our proposed algorithm is in this class. [7] Textual data have inherent overlapping and some researchers have focused on this issue (e.g. [31]- [34]). In this research, we introduce a topological overlapping clustering algorithm that is suitable for textual data. The proposed algorithm is compared with k-means, OKM (Overlapping K-Means), SOM and OSOM algorithms. The k-means and OSM are non-overlapping clustering algorithms, and their performance is our baseline. The OKM [8], and OSOM [7] were selected among the overlapping clustering algorithms. The OKM is one of the most common base algorithms among the overlapping clustering algorithms, and according to Cleuziou's claim [8] is suitable for textual data. The structure of OSOM is the most similar algorithm to ours; both of them are topological and this method is relatively new. In the remaining of this section, a little more details of OKM and OSOM algorithms will be mentioned.
Cleuziou, in 2008, has presented the OKM algorithm, an overlapping version of k-means. In OKM, a new objective criterion has been proposed, and this criterion generalizes the least squared objective criterion used in k-means. The goal of the OKM objective function is minimization under constraints of multi-assignment. In a nutshell, OKM prospects the space coverage rather than the space partitions like kmeans does. Using this proposed criterion each sample can belong to one or more clusters [8].
In 2013, Cleuziou introduced an algorithm called OSOM (Overlapping SOM), in which each sample can assign to one or more neurons of the map. He showed SOM as a topological method that is suitable for the overlapping clustering because it provides a structure with the following characteristics [7]: • In this clustering method, the number of clusters is not an input parameter. The input data is structured on a map of neurons; and the number of neurons is usually much greater than the number of potential final clusters; • Using the topological methods, the topological correctness of the overlaps is convinced in a simple and straightforward way. Topological connections are clear and explicit on the map. Using these methods, we can easily propose efficient assignment heuristics on the overlaps; • Topological methods can be applied as a first clustering structure on the data; and they can be used for a further hierarchical structuration.
In OSOM published results none of the investigated datasets were textual [7]. In this research, we experimentally show that OSOM algorithm is not suitable for textual data.
In summary, Table 2 shows a comparison of the used clustering algorithms in this research. More details of complexity are provided in Section III-E.

III. THE FOCT ALGORITHM
Because of inherent overlapping in textual data, overlapping clustering algorithms extract more appropriate results from textual datasets. In fact, overlapping clustering algorithms assign textual documents to more than one cluster; this is exactly a common and essential feature, which can be observed in many real-world documents. SOM is a classic and suitable algorithm addressing the problem of unsupervised text clustering. There are research works developing new methods for text clustering and mining based on SOM. As mentioned in the previous section, Cleuziou [7] introduced an overlapping SOM algorithm, named OSOM. In this section, we present a topological overlapping clustering algorithm based on OSOM. In section IV, it is experimentally demonstrated that OSOM, unlike the presented algorithm, is not suitable for textual data in the real world.
Textual datasets usually have considerable numbers of documents where the extracted feature vectors from these documents are most of the time sparse and high-dimensional. Furthermore, the OSOM complexity is relatively high, and it cannot run big volumes of textual data with acceptable speed. These limitations are the most important challenges to apply OSOM algorithm for textual data.
In this section, we propose a novel algorithm, FOCT (Fast Overlapping Clustering for Textual data) that addresses these challenges. FOCT deploys a number of optimisations to shrink the amount of computation and therefore speed up the process of text clustering. These optimisations are listed below:  achieve a faster algorithm, the winner neuron sets are obtained by a greedy heuristic. • The neurons' weights are updated using a fast method which increases the corresponding weight values of the document words (by simple sum operation).
Applying the above optimisations lead to a fast algorithm that address the problem of fast textual data clustering. A simple flow chart of our algorithm is showed in Figure 1.
In the remainder of this section, each of the above optimisations is explained in detail. We then describe our proposed algorithm, FOCT, and analyse its complexity.

A. VECTORS OF DOCUMENTS AND NEURONS
In text clustering, most algorithms suffer from highdimensional feature vectors and many iterations which result in a remarkable execution cost. Here we explain how to present the feature vectors in FOCT. In Section IV-B, the feature extraction process will be elucidated.
In most SOM-based approaches, the vectors of input instances and neurons are represented with the same dimension number. Using the equal length vectors for textual data consisting of high dimensional and sparse vectors leads to significant memory consumption and longer runtime.
To tackle this issue, in this research, each document is represented as indexes of words (instead of a high dimensional vector). The index representation significantly reduces the length of vectors, which leads to less memory consumption and shorter execution time. Although each document is represented by indexes, neurons have their high-dimensional vectors. In other words, the vector of each neuron determines the weight of each word for that neuron. The number of neurons is often much less than the total number of documents. Figure 2 shows how to create document indexes using lists of words. In this procedure, for each document, the index of each word in the vector of the current document is extracted (using VectorW and List) and added to the corresponding index of that document (Vector). This procedure with minor changes can be used dynamically for online applications.

B. FAST SIMILARITY COMPUTATION
After representing documents in form of indexes of words, we need a function to compute the similarity between neurons and documents, which are represented in different ways. Figure 3 shows the proposed algorithm computing the similarity between documents and neurons. The algorithm considers equal importance for all words extracted from documents. To obtain the similarity between a given document and a neuron, the algorithm first extracts the weight of each word existing in the document from the neuron weight vector and then sum up the weights.  Proof: As shown in Figure 3, Similarity function consists of two nested loops. The outer for loop counts the total number of neurons (k), and the inner one scans a feature vector of size d. Therefore the time complexity of this function is O(k.d).

C. CANDIDATE AND WINNER NEURONS SETS
The main difference between our overlapping algorithm (FOCT) and the non-overlapping one (SOM) is in finding the optimal winner neurons set. To find such an optimal set, one straightforward approach is to create all possible combinations and find the optimal set. Since the number of neurons subsets can be exponential (2 K subsets with K neurons in the map), this approach has exponential complexity. To address this problem, we introduce a greedy heuristic algorithm (such as Cleuziou's OSOM algorithm [7]), which is called SWN (Set of Winner Neurons), to speed up the process of finding optimal winner neurons and obtain more appropriate topological organization of the map. The steps in the algorithms are listed as follows: 1) Find the most similar neuron to the current document (first winner neuron) and assign the document to it. 2) Investigate the neighbors of the most similar neuron as candidate neurons and find the most similar neuron to the current document among them. It should be noted that the Similarity function (see III-B) is invoked to find the most similar neighbor neuron (Step 3 and 4 in Figure 4). 3) If the winner neuron in the candidate neurons set has at least one of the required conditions, then the current document will be assigned to it too; and after that, the candidate neurons set will be updated with neighbors of the new winner neuron and this process will be repeated; otherwise, the process of winner neurons set creation will be finished. one of the following conditions must be satisfied (step 7 in Figure 4): 1) The similarity of current document with average weights of new winner set is more than the similarity of current document with average weights of previous winner set.
2) The absolute difference of the current document similarity with average weights of new winner neurons set and the current document similarity with average weights of previous winner neurons set is less than a small threshold ( ). is one of the effective factors for the overlap size of FOCT output.
Lemma III.2. The time complexity of the SWN function is where k is the number of neurons and d represents the maximum length of feature vectors.
Proof: As shown in Figure 4, the single while loop in SWN function iterates k times in the worst case scenario. The Similarity function, with the time complexity of O(k.d) (Lemma III.1) is invoked in each iteration. Therefore, one can conclude that the execution cost of function SWN does not exceed O(k 2 .d).

D. UPDATING WEIGHTS OF NEURONS
After finding the winner neurons set that contains the most similar neurons to the current document, the weight vector of each winner neuron and its neighbors are updated. To update the weight vector of winner neurons, the update rate, represented by ϕ, is added to dimensions of the neuron's vector that its corresponding word exists in the document. Then, other documents which are similar to the current document have more chance to assign to this neuron. However, documents that are completely different from the current document still have a chance of being assigned to this neuron because other dimensions of the winner neuron weight vector are not adjusted. In order to overcome this problem, ϕ is also subtracted from other dimensions of the winner neuron weight vector that were not indexed in the current documents. The update rate ϕ decreases in each iteration.
To obtain a more appropriate topological organization of the map, the weight vectors of winner neurons neighbors are also updated. Updating neighbors is implemented by adding ϕ/2 value to the dimensions of neighbor weigh vectors that were indexed for the current document. The procedure of updating the weight vector of each winner neuron and its neighbors is presented in Figure 5.

E. FOCT AND ITS COMPLEXITY
In previous subsections, it was explained how to compute similarity (SIMILARITY procedure), find the winner neurons set (SWN procedure) and update winner neurons and their neighbors (UPDATE procedure). Now, we propose our clustering algorithms, FOCT, which use these procedures as building blocks. FOCT, as a fast extension of SOM, is an overlapping clustering algorithm, mainly suitable for textual data. Figure 6 shows the algorithm. In each iteration of this algorithm, it takes a document as input and performs the following steps: 1) compute the similarity of document and neurons, 2) find the winner neurons set and 3) update weight vectors of neurons. In FOCT, ϕ t is the update rate of t th iteration and decreases in each iterations. Equation 1 formulates how to calculate ϕ t . In this equation, ϕ i is the initial value of ϕ, and ϕ f represents the final value of ϕ.
One of the most challenging problems in overlapping algorithms is their time complexity. The overlapping clustering algorithms can be drastically more complex than usual ones. It is because of considering overlapping structures which leads to a significant increase in the search space size [2].
The complexity of basic SOM is O(k.t.m), where k is the number of neurons, t determines the number of iterations, and m represents the training time [9]. The OSOM, another overlapping algorithm based on SOM, has an exponential complexity equals to O(t.2 2.k ) [7].
As proved in Theorem III.4, in this paper, we have proposed FOCT that performs overlapping clustering in a time complexity of O(k 2 .t.d).
Theorem III.4. The FOCT algorithm has a time complexity of O(k 2 .t.d).
Proof: The main for loop of the FOCT algorithm iterates t times, and a set of functions is invoked in each iteration where the costliest one is SWN with time complexity of O(k 2 .d). This way, it is proved that the time complexity of FOCT is O(k 2 .t.d).
In comparison with SOM and OSOM, we can see that the maximum theoretical value for m (training time), in the FOCT algorithm, is k.d (k and d are the number of neurons and the maximum length of feature vectors respectively).
Since in text mining, the binary vectors of documents are too sparse, FOCT represents each document as indexes of words (instead of using high-dimensional vectors used in SOM and OSOM). This leads to a significant decrease in the size of feature vectors compared to the vector sizes in SOM and OSOM. In addition, due to employing a fast similarity computation method (Section III-B) and a fast neuron weights update method (Section III-D), the training time of FOCT algorithm is less than the time taken by SOM and OSOM. In Section IV, we practically analyse the efficiency of FOCT and show that it is faster than others.

IV. EXPERIMENTS
In this section, we evaluate our proposed overlapping text clustering. We first describe the evaluation measures and then introduce a dataset, which satisfies the particular requirements for overlapping clustering applications. Our feature extraction method is also explained. Finally, experiment conditions and obtained results are elucidated.

A. EVALUATION MEASURES
Overlapping clustering evaluation is a controversial issue. Clustering evaluation measures are divided into two main categories: internal and external. The internal measures aim to quantify the original information that is retrieved by clusters; whereas the external measures compare the obtained clusters with reference classes [2], [35].
Cleuziou proposed some measures to evaluate overlapping clustering algorithms, considering both internal and external measures. These measures use the Frobenius norm on two N × N matrices A and B. One of these matrices is obtained from the input data and the other from clusters (more detailed information in [7]).
Cleuziou defined four matrices, each one containing a strategic information [7]: • Matrix U (2) includes the distances between the prototypes of the neurons correlated to the data, • Matrix U (3) includes the initial distances between the data samples. In our research, Cleuziou's measures [7] are used to compare different clustering algorithms. Table 3 gives an overview of the matrix instantiating for Q (A, B) and a short description of these measures. This measure shows the relevance of labels and initial distances. It is inherent to data and independent of the clustering algorithm. It indicates matching between data descriptions and their labels. The lower value of this measure shows the higher matching between descriptions and labels.
This external measure is used to evaluate the relevance of the map topology and the class labels. In other words, considering the sample labels, how close the similar samples are on the map.
This external measure is used to evaluate the relevance of the obtained clusters and the class labels. In other words, considering the sample labels, how similar the samples are in each cluster.
This internal measure is used to evaluate the relevance of the map topology and initial distances. In other words, how close the similar samples are on the map.
This internal measure is used to evaluate the relevance of the obtained clusters and initial distances. In other words, how similar the samples are in each cluster.
It should be mentioned that the Q parameter in Table 3 represents the distance of two matrices. A lower value for Qs shows a higher matching between the two matrices. In our research, in addition to the above measures, the overlap size is also considered. The overlap size is defined as the average number of clusters that each instance belongs to. In other words: where c i is the number of clusters that the i th instance belongs to, and N is the number of instances [2].

B. DATA AND FEATURE EXTRACTION
We use the Reuters-21578 dataset to build our experimental dataset. Reuters-21578 is one of the popular benchmarks in information retrieval and text mining research areas and consists of 21578 articles where each one belongs to one or more categories (topics) [36]. To build the experimental dataset, the articles from Reuters-21578 are selected which satisfy the following two conditions: It should be noted that the Reuters-21578 dataset is a fairly clean dataset since all the articles in the dataset have been revised and proof-read, and therefore, it does not require to apply much pre-processing such as noise detection [37].
After building the experimental dataset (including 9907 articles), the feature vector of each article in the dataset is extracted from its 'Title'. We consider 'Title' to make the size of the feature vector smaller. The main steps of feature vector extraction are: 1) Extracting document words. 2) Removing stop-words: Stop-words are kind of words that are prevalently used in textual documents, and therefore, cannot be effective in distinguishing (classifying) documents. Articles like a and the as well as pronouns such as it and them are examples of stopwords. These common words can be discarded during the feature extraction steps [38], [39]. 3) Reducing words to stems: Stemming is the process of reducing words to their stems. This process reduces the number of words and leaves us with fewer forms of a word to deal with, and therefore it would reduce the length of the feature vectors. In this researcher, we use the Porter stemming algorithm, which is the most widely used stemmer for English texts [38]. 4) Selecting frequent word: In this step, the words with a frequency less than 10 are removed. Kohonen in 1995 showed that not only removing words with low frequency (even if they are meaningful) has no negative influence on clusters, but it even sometimes leads to obtaining better results, reduces the length of feature vectors and makes them less sparse. [9]. 5) Creating binary feature vectors: In the last step, the binary feature vectors are created that their elements determine the existence or non-existence of words in a document. As mentioned earlier, we use a different feature vector representation in FOCT that leads to faster processing and less required memory. In this research, WVT (word vector tool) is employed for feature extraction, which is a flexible Java library for statistical language modelling [40]. Table 4 summarises the characteristics of our experimental dataset obtained by applying the aforementioned steps. VOLUME 4, 2016

C. EXPERIMENT CONDITIONS AND RESULTS
In this section, we first explain the conditions that our experiments take place and summarize the experimental results. In this paper, the experiments are conducted under the following conditions: • In each experiment, the number of iterations, which is is the number of selected documents (T max ), are considered to be 2000, 4000, 6000, 8000, 9907 documents. We chose the documents randomly and without any replacement.  Table 5 shows Q ext topo and Q int topo measures. It compares the topologies of maps for SOM, OSOM and FOCT algorithms with respect to the classes (external validation) and to the initial description of data (internal validation) respectively. Table 6 considers only the obtained clusters and ignores the topological organization using Q ext classif and Q int classif measures.
The main challenge of current overlapping clustering algorithms is their slow speed in the clustering of big volumes of data. That is why one of the most important criteria in the evaluation of overlapping clustering algorithms is their running time. Table 7 shows the CPU times of algorithms for the above experiments.
The size of overlap is another measure that is important to evaluate overlapping clustering algorithms. Algorithms with an overlap size closer to the overlap size of the original data return better clustering solutions. Table 8 shows the overlap size for OKM, OSOM and FOCT algorithms in our experiments.

D. DISCUSSION
As mentioned before, a lower value for Q label , as an algorithm-independent measure, shows the higher match between descriptions and labels. Regarding our experiment results, while Q label is equal to 0.5175 for total instances (including 9907 documents), this value fell in the range of 0.5175 ± 0.003 when all instances were not used. These figures show that descriptions and classes are not well-match enough. This is one of the most important reasons for the weakness of external results (Q ext s). Table 5 compares Q ext topo and Q int topo for SOM, OSOM and FOCT algorithms. While Q ext topo values for SOM and FOCT are close together, with the maximum difference of ±0.02, but FOCT outperforms SOM when it comes to the Q int topo measure. The results show that the maps obtained by the FOCT algorithm have appropriate topologies, and they can be efficient for clustering. In these experiments, OSOM showed the weakest results.
As shown in Table 6, while the overlapping algorithms OSOM and OKM fall behind FOCT, but non-overlapping algorithms SOM and k-means outperformed FOCT in terms of the Q classif measure where SOM for all and k-means for some of the experiments showed better results than FOCT.
In a nutshell, in all experiments, FOCT outperforms two overlapping algorithms, OKM and OSOM, but is behind the non-overlapping algorithms, k-means and SOM, in some experiments (especially based on the Q classif measures). The main reason for this observation is that non-overlapping algorithms only consider the main category of each document while overlapping ones analyze overlapping between document categories. Ignoring other categories makes it easier for non-overlapping algorithms to put more similar documents in each cluster so these results are not far from expectations. This research aims to present an overlapping clustering algorithm for textual data due to the importance of overlapping in textual documents. Figure 7 shows the Qs per epochs charts and compare FOCT with the other overlapping algorithms. In all these charts, the bottom line belongs to FOCT. It implies that FOCT outperforms the other overlapping clustering algorithms based on all the measures.
In terms of execution time, as one of the most important criteria in clustering algorithms, the FOCT algorithm ran much faster than the other algorithms.
Concerning the overlap size, which is shown in Table 8, the overlap size of this research data, including 9907 documents, is 1.2516. In Table 9, we summarised the maximum and minimum overlap sizes obtained by the overlapping algorithms. As shown in Table 9, the overlap size of clusters found by FOCT is closer to the overlap size of the original data. This signifies that FOCT is the best algorithm in terms of overlap size.

V. CONCLUSION
Nowadays, text clustering techniques are applicable in various areas for extracting specific knowledge from text and even categorising textual documents based on topic and sentiment. Due to inherent overlapping in textual documents, overlapping clustering algorithms have turned into a suitable approach for analysing text.
In this paper, we propose FOCT, a fast overlapping clustering algorithm for textual data. FOCT is a fast overlapping extension of SOM, and it employs some heuristics  The efficiency of FOCT is analysed over two nonoverlapping (k-means and SOM) and two overlapping clustering algorithms (OKM and OSOM). We experimentally demonstrated that FOCT outperforms the aforementioned clustering algorithms in terms of topological measures and also execution time. In addition, FOCT shows a better over-lap size compared to the other overlapping algorithms. This implies that FOCT clusters are closer to the clusters of original data.
The overlap sizes of most of the available text datasets are small (e.g. Routers overlapping size is 1.25). As future works, we find and compile some text datasets with different and more overlapping sizes. A few datasets with different overlapping sizes can better evaluate and compare FOCT and other overlapping algorithms.