A Novel Hybrid Clustering Approach Based on Black Hole Algorithm for Document Clustering

In information retrieval and text mining, document clustering is a big challenge because the amount of document collection has been increasing, day by day. The problem of clustering is NP-hard, use of meta-heuristic algorithms to solve these problems could be an effective method. When the solution space is large, traditional methods are unable to find a solution in a reasonable amount of time. K-means is a heuristic clustering algorithm, two main issues with heuristic algorithms are the early convergence and trapping in local optima. Moreover, finding the right number of clusters is one of the main drawbacks of the k-means algorithm. The correct value of k is always confusing, different researchers used different methods to solve this problem. To overcome these mentioned problems, this study presents a novel Hybrid approach for document clustering. One of the challenges in existing BH algorithm is the input data type. Recently, the algorithm was only accepting textual data. Another flaw in the existing model is that it doesn’t choose how many clusters k to form automatically, and the centroids are chosen at random in it. In this paper, we have constructed a Hybrid cluster identification approach which consists of the Elbow method and Silhouette score for cluster k identification. This paper mainly offers three novel combination of model to represent text documents, namely i) K-mean++ - BH + TF-IDF with fix k ii) K-mean++ - BH + W2V with fix k iii) Hybrid Black Hole with automated k. The proposed improvements have validated on the document clustering problem. Cluster analysis based on two evaluation measures, external (Purity) and internal measures (Silhouette score) are used to report the findings. Experiments have been carried out on the four al-phanumeric datasets (Doc50, Reuters, WebKB and News20) as well as on two numeric datasets (Iris and Wine) respectively. The complete result analysis is reported in detail with respect to each research contribution to compare the performance of the proposed algorithm with existing clustering methods. Result shows that the proposed Hybrid BH algorithm outperforms better than the existing clustering methods for all datasets. The clustering of data with and without stop words is examined; additionally, the two alternative word embedding used for data exploration in conjunction with proposed model are also evaluated. In the present study, proposed Hybrid BH algorithm handles the optimal value of k efficiently. This is one of the major contributions of the paper, concluded that Hybrid Black Hole is an effective algorithm for cluster analysis.

different document sets can vary significantly. There-91 fore, before using the clustering method, it is required 92 to perform a proper text pre-processing step. 93 2) Secondly, the choice of initialization technique for cen- 94 troid selection in k-means is important.

95
3) Thirdly, the correct identification of k-value is still a 96 challenging task while performing document cluster-97 ing.

98
The focus of this research is on an optimization based 99 approach to clustering problems. We will use qualitative 100 research to find the value of clusters, how many clusters are 101 formed from the collected data. To the best of our knowledge, 102 the hybridization of black hole algorithm [11] with heuristic 103 algorithm (k-mean++) [12] has not been used to cluster 104 documents. Their stochastic nature improves clustering by 105 recovering from poor solution initialization and avoiding 106 local optima. 107 In this paper, we propose a novel hybrid clustering 108 approach based on black hole algorithm for document cluster-109 ing. The complete paper is presented in the following order; 110 Section 2 covers the literature review on existing approaches 111 of document clustering. The methodology of proposed work 112 and a detailed description of each module is explained in 113 Section 3. Section 4 focuses on the result and also covers the 114 answers to the research questions of this study. Section 5 con-115 cludes the whole research with conclusion, enhancements, 116 and possible future work. 118 An analysis of several pieces of literature on document 119 clustering not only gives good knowledge but also helps to 120 identify emerging challenges in the area of clustering [13]. 121 There are numerous methods to solve the document clustering 122 problem. 123 Lakshmi and Baskar [14] offered a novel DIC-DOC 124 k-means algorithm (dissimilarity-based Initial Centroid 125 selection for document clustering using k-means). Using this 126 suggested method, the document with the lowest standard 127 deviation of term frequency is selected as an initial centroid. 128 The remaining initial centroids are picked based on how 129 dissimilar they are to the centroids that have already been 130 chosen. In this study, WebKB and Reuters 8 are the two data 131 sets used to validate the value of clusters. Two documents are 132 compared by using a cosine similarity measure. Using three 133 external measures: entropy, purity, and F-measure, efficiency 134 of proposed algorithm is compared to different clustering 135 algorithms over a range of k values. The identification of 136 k-values is not addressed in this work. Abdolreza [11] offered 137 a novel algorithm based on black hole phenomenon which is 138 used to solve the clustering problem. This research is con-139 ducted out on six numeric datasets: Iris, Vowel, Wine, Glass, 140 Cancer, and CM, using error rate and intra-cluster distance 141 as evaluation measures. The findings of the experiment, uses 142 six benchmark datasets, indicates that the proposed black 143 hole (BH) algorithm surpasses existing test algorithms (PSO, 144 VOLUME 10, 2022 K-means, and GSA). The presented mathematical idea of BH 145 algorithm can be used in combination with other algorithms, 146 which is much more successful than using it individually.

II. RELATED WORKS
The fitness function used to analyze the CSAK-means  Mohammad et al. [23] introduced a new hybrid-mean 201 algorithm that combines the Black hole (BH) algorithm with 202 bisecting k-means algorithm (BK). The presented hybrid 203 algorithm (BH+BK-means) combines the global searching 204 ability of BH algorithm with the quick convergence capability 205 of K-means algorithm. Experiments on various real datasets 206 (CMC, Glass, Iris, Vowel) have shown that using a composite 207 solution with bisect k-mean and black hole algorithms to 208 find cluster centers is better than using single k-mean and 209 black hole algorithms. Maintaining the sequence of the hybrid 210 algorithm (BH-BK) is highly useful. The overall search per-211 formance and efficiency of BH algorithm are reduced when 212 BK-means clustering is performed before BH clustering 213 module. The average intra-cluster distance and error rate are 214 used to determine and compare the performance of the pro-215 vided algorithm. Experiments on real datasets indicate that 216 the novel hybrid BH+BK-means method exceeds individual 217 algorithms for finding cluster centers. The pre-determination 218 of the value of k is not handled.

219
Yogesh and Ashish [24] utilized the particle swarm opti-220 mization (PSO) approach with K-harmonic means (KHM) 221 for clustering. To overcome KHM's limitations, such as 222 the lo-cal optimum problem, PSO is made adaptive with 223 the use of fuzzy logic. Comparison of suggested method 224 named Enhanced fuzzy PSO-based clustering method with 225 K-harmonic means (EFPSOKHM) shows that the proposed 226 algorithm produces better clusters than existing algorithms. 227 Five numeric benchmark datasets (Cancer, Iris, Wine, CMC, 228 and Glass) are used to validate the effectiveness of proposed 229 approach. The pre-determination of the value of k is not 230 handled. For addressing the exploration issue in the origi-231 nal black hole, Haneen et al. [25] suggested a new cluster-232 ing algorithm named levy flight black hole. In this algo-233 rithm, the movement of all stars generally depends on step 234 size, produced via Levy distribution. This novel cluster-235 ing approach was tested on six datasets namely Iris, CMC, 236 Glass, Can-cer, Wine, and Vowel collected from the UCI 237 machine learning laboratory [22]. The algorithm performance 238 is tested via two evaluation measures sum of intra-cluster dis-239 tance and error rate. Experiment results demonstrated LBH 240 approach escape easily from the local optima and clustered 241 data objects efficiently. The number of clusters k is not 242 handled.

243
Literature illustrates that several algorithms have been 244 developed to deal with document clustering (NP-hard) prob-245 lems but optimal solutions are not guaranteed. There exists 246 no algorithm that finds the optimal solution to NP-hard 247 problems. Many problems are solved by hit or trial method 248 but it does not work for all types of problems. For exam-249 ple, K-means clustering algorithm is treated as an optimiza-250 tion algorithm but it could not find optimal clusters as it 251 depends on initial centroids chosen. These centroids are 252 selected randomly through hit and trail method. To cope 253 up with the NP-hardness of clustering problem, researchers 254 have drawn their inspiration from nature [26], [27], [28], 255 [29], [30], [31]

303
A. DATASET COLLECTION 304 We presented results on four standard alpha-numeric text 305 datasets [19]: Doc50, News20, WebKB, and Reuters, and 306 two numeric datasets: Iris [34] and Wine [34]  Doc50 is a subset of news20 dataset which contains 50 docu-311 ments. It is the most basic dataset having a minimum possible 312 unique tokens. Some of them are lengthy emails, while others 313 are simply e-mail chunks. They have a lot of stop words and 314 special characters in them. News20 dataset contains documents from the newsgroups 317 dataset. The data is organized into 20 newsgroups, each with 318 its own topic. Some newsgroups are closely associated, while 319 others are completely unrelated. It is the famous dataset for 320 machine learning research in text applications.  Reuters is a subset of the original Reuters21578 dataset which 331 consists of 12 classes. Each class includes documents on a 332 particular topic. In each class, the total number of documents 333 ranges between 50 to 100. It is a set of documents containing 334 news articles. The Iris dataset is divided into three classes, every 338 50 instances related to a different species of iris plant. There 339 are 50 samples in the Iris dataset, each one with four different 340 characteristics (sepal and petal length and width). Iris dataset 341 is commonly used in data mining, classification, and cluster-342 ing purposes and also for algorithm testing.  Complete statistical information, as well as the difference in 352 dimensionality between these datasets, is given in Table 1

358
Proposed Approach has three main module which are 359 explained in Figure 1. Each module is further subdivided into phases, which will 366 be discussed in detail as follows.  By applying the word embedding technique, we started 404 encoding collected datasets using python. We convert Alpha-405 numeric data into readable form, remove unnecessary infor-406 mation by using the word embedding techniques (Word2vec 407 and TF-IDF). Correct feature selection decreases the high 408 dimensionality of the feature space and improves data com-409 prehension, resulting in improved cluster creation. We com-410 pare our findings with and without stop words in our work, 411 so we have set stop word removal as an optional parameter 412 to check its impact on the results. In our case, we have 413 performed results analysis with and without using stop words 414 in data The first exploratory step in the clustering process is to rep-417 resent the text documents uniformly. The goal is to organize 418 the documents coherently. The machine learning algorithms 419 are not capable of working directly with raw text, therefore 420 the unstructured form of documents must be transformed into 421 a vector of numbers. In the document representation phase, 422 each document is represented by k features with the high-423 est selection metric score. The word embedding technique 424 word2vec stores the relationship between words, every word 425 is represented in a 32-bit vector. Word2vec consists of two 426 models CBOW (continuous bag of words) and Skip-gram. 427 In this work, the CBOW model is used. For document repre-428 sentation in W2V model, we have used CBOW (continuous 429 bag of words) because it is much quicker to learn a model 430 than skip-gram and also has better accuracy for common 431 words. For numeric data, we will not be using word2vec, 432 for that we will only use min-max scaling to normalize all 433 the inputs.
Eq. (1)   After the successful vector formation for all types of 442 data inputs, the pre-processing module (module 1) is 443 now complete. Now, data is in a standardized format. 444 Next, we will discuss the cluster identification phase, 445 in which we determine how many clusters are to be 446 chosen.

448
In this module, the number of clusters will be determined. values of k efficiently. This is one of the major contributions 466 of the paper. In proposed hybrid approach, we take the aver-467 age of both method findings (Elbow analysis method and Sil-468 houette score) and then proceed. There are two possibilities 469 in this module. First, if the user knows how many clusters are 470 required, they can manually enter the number of clusters they 471 needed. In the second situation, when the number of clusters 472 to be selected is unknown, then proposed hybrid approach is 473 used to automatically determine the number of clusters. The Silhouette score method is used to select the optimal 476 number of clusters present in the data. The cohesion is mea-477 sured based on the distance between all the points in the same 478 cluster and the separation is based on the nearest neighbor 479 distance. It is recommended that user provide this number if 480 he/she already knows the number of clusters. If not, the best 481 number of clusters is selected by proposed hybrid function. 482 VOLUME 10, 2022 We select the cluster on which we have the best silhouette 483 score. For example, if elbow analysis suggests that the best elbow 508 is on cluster 3 and silhouette co-efficient suggests that the 509 best score is on cluster 5, then we simply take the average 510 of the resultant of two methods which will result in k=4 511 clusters. After the optimal k value determination, the cluster 512 identification module (module 2) is now complete. Next, 513 we will discuss the hybrid black hole phase, in which we 514 achieve the best solutions for cluster formation. is destroyed, k-mean++ is invoked to generate a new star. 524 We will repeat this process until the mentioned stopping 525 criteria will meet that is, run until N iterations and run until 526 purity threshold will meet. If any of the mentioned stopping 527 conditions will meet, the algorithm stops and we report the 528 best solutions of k cluster formation. 529 We proposed a global optimal solution by embedding 530 a k-mean++ solution to Black Hole Algorithm. Proposed 531 approach uses the global optimal property of the Black Hole 532 algorithm. We simulated the idea of event horizon in our 533 algorithm using inspiration from real-world black hole phe-534 nomena. We have used purity and silhouette score as the event 535 horizon in our algorithm.

551
Where objective functions is Here K is the set of possible solutions (in terms of puri-555 ties/ silhouette score) generated by defined k-means++. It is 556 nearly impossible to maximize all of the objective functions 557 at the same time with a single solution k in K because the 558 objective functions usually vary. The set B is the set of poten-559 tial solutions k in K for which no other survivable solution is 560 as good as k in all objective functions and completely better 561 than k in at least one objective function.

562
Updation of next star position is based on, where 'ki(t)' is the current position of a star at iteration 't , 566 'ki(t + 1) is the next position of stars at iteration '(t + 1) 567 and 'kBH is the best solution among all at each iteration. For 568 j = 1, . . . , m, the set B is explicitly defined as in equation 7. 569 Which satisfies the following mentioned criteria: fi(b) = best global solutions achieved.

572
After the successful completion of module 3, the best 573 solutions of k cluster formation is reported.

574
A general pseudo code of proposed Algorithm is men-575 tioned in Algorithm. Here in Eq.(8), 'nj is the number of documents in cluster 612 'j , 'n is corpus size and 'Pi is the ratio of the majority class 613 in that cluster p(j) = 1/n j max(nij) where'nij' is the number 614 of documents of class i in cluster j. 615

616
Peter J. was the first to propose the Silhouette method [40]. 617 The silhouette score measures how close an object belongs 618 to its own cluster (cohesion) as compared to other clusters 619 (separation). Silhouette score is a method used to check the 620 validity of clustering. It combines two factors cohesion and 621 separation. The similarity between the object and the cluster 622 is referred to as cohesion. It's referred to as separation when 623 compared to other clusters. The Silhouette Score is used to 624 determine how good a clustering technique is. Its value ranges 625 from -1 to 1. The silhouette plot shows how close each 626 cluster's point is to its neighboring cluster points. When a 627 class label is unknown, the silhouette coefficient is a more 628 relevant estimator. The Silhouette value is close to 1, indicat-629 ing that the object and the cluster have a close relationship. 630 A value of 0 specifies that the object is on or near the decision 631 boundary between two neighboring clusters, while negative 632 values indicate that the objects may have been assigned to 633 the incorrect cluster [41].   It is defined as: Here in Eq.(9), the Silhouette value of the ith vector in the 642 cluster Sj is given by Here in Eq. (10), 'a(i) is the mean distance between 'i 645 and all other data points in its own cluster and 'b(i) is the 646 mean distance between 'I to all data points in other cluster 647 centroids [42].

649
We explain our experimental results of the conducted 650 research in this section.     Step by step presented discussion on results according to   k-means select the initial centroids randomly. Due to this 693 nature of initialization sensitivity in k-means, the clustering 694 algorithm trend the following problems; (i) To affect the final 695 formed clusters offers low quality clustering solutions and 696 (ii) provide solutions with local optima because initial set of 697 center are not distributed over the dataset.

698
To avoid this problem of initialization sensitivity in 699 k-mean, k-mean++ algorithm is used and enhancement in 700 results by considering the four datasets are shown in Table 4. 701 K-mean++ is a smart centroid initialization technique based 702 on probability distribution instead of randomly picking all the 703 centroids. It yields a much better performance as compared to 704 baseline algorithm.

705
Table 5 clearly depicts that data with stop words badly 706 effects the algorithm results as compared to data without stop 707 words. Remove stop words is determined by the nature of 708 data. In proposed work, datasets (Doc50, Reuters, WebKB 709 and News20) are used which are based on emails, webpages, 710 university courses and news document respectively. Not elim-711 inating stop words from data curse to degrade performance 712 of clustering algorithm. In both context, results comparison 713 of Table 4,5, k-mean++ performs more efficiently than base 714 algorithm k-mean.

715
The pictorial representation of comparison based on purity 716 result of heuristic algorithms (k-mean and k-mean++) with 717 and without using stop-words on four datasets are shown in 718 Figure 2 and   According to proposed conducted analysis in Table 4,5 it 750 is clearly evident that heuristic algorithm k-mean++ per-751 forms better than k-mean. So due to this reason, embedded   determine optimum solution. Focuses on the shortcomings of 760 heuristic algorithm and explores the search space effectively, 761 proposed hybridization of algorithm is a best choice for clus-762 ter analysis. 763 Table 6 shows the results of existing black hole and Black 764 hole with K mean++ for alpha numeric dataset without stop 765 wrds. It shows that existing existing black hole improves the 766 results as compared to the existing black hole on all dataset. 767 In Table 7 the results of existing black hole and black hole 768 with k mean++ is presented with stop words for all text 769 dataset. Black hole with K mean++ is also perform better 770 on all dataset. 771 Table 6 and 7 illustrates that proposed hybridization of 772 algorithms (k-mean++-BH) performs better than the existing 773 BH algorithm in all datasets also by eliminating an unwanted 774 information from data improves model performance Due to 775 the problem of random selection of initial centroids in exist-776 ing BH algorithm [19], takes a large number of iterations 777 for each datasets in comparison to the optimization based 778 K-means++ -BH clustering algorithm. The improvement 779 in the results show that it possess the capability of greater 780 convergence in objective function values. There has been 2% 781 improvement observed on Doc50 and WebKB datasets, 4% 782 on Reuter's dataset and News20 dataset 15% improvement 783 has reported respectively. 6, 7 Tables expresses that that using 784 this hybridization of methods (K-means++-BH) generate 785 higher compact clustering than either using each algorithm 786 individually. This section is prepared to perform deep analysis of the 790 impact of two different word em-bedding on results. The 791 detailed result analysis is specified in terms of two different 792 word embedding; one is TF-IDF and another is W2V which 793 are mentioned in Tables 8,9.    The comparison results of two embedding method is pre-799 sented in Table 9 with stop words. It shows that proposed 800 method has highest purity score with TF-IDF as comapred 801 to W2V. respectively whereas from the combination of k-means++ 822 + BH using W2V consumes 90% purity in Doc50 dataset 823 83 % on Reuters dataset, 75% on WebKB dataset and 65% 824 on News20 datasets without stop words. 825 Figure 5 shows the results of TF-IDF and W2V word 826 embedding techniques with stop words. It shows that TF-IDF 827 has highest purity results on all dataset as compared to W2V 828 on all dataset.

829
Tables 4, 5, 6, 7, 8 and 9 have reported two key findings: 830 i.) First, is using word embedding whether with or without 831 stop words, has a considerable impact on the result.

832
Working without stop words in documents reduces the 833 number of features, which could result in a slight computa-834 tional benefit. However, eliminating stop words and keeping 835 stop words in data mainly depends upon the used datasets and 836 the addressing problem but they should be removed if they 837 are overused in data and reduces the effect of other important 838 terms.

839
ii.) Second, is word embedding TF-IDF give much better 840 results as compared to W2V embedding.

841
Due to the reason of used data with less semantic infor-842 mation, W2V performs well on da-ta having terms which 843 are included in its pre-trained model. Whereas TF-IDF gives 844 results based on keyword occurrence in data. Deciding which 845 embedding method to use mainly depends on the datasets 846 as well as the problem being tackled. It has been evidence 847 to literature that TF-IDF achieves better results than W2V 848 embedding.   increases, the distortion score will start to decrease in a linear 867 manner. The graph begins to move almost parallel to the 868 X-axis at this point. The optimal k-value is the one that 869 corresponds to this point. Therefore, for the given Doc50 870 dataset, concluded that the optimal number of clusters is 6.  optimal number of clusters is 7. From Figure 9 obtained the 881 value 19 as the most optimal number of clusters as it has the 882 maximum silhouette. Figure 8 and 9 illustrates that, k-value 883 identification is performed by using hybrid methods. In this 884 case, proposed model use k=13 as an optimal number of 885 clusters. 886 Figure 10, obtained the value 3 as the most optimal number 887 of clusters by the Elbow method. At point 3 elbow found, 888 concluded that the optimal number of clusters is 3.

889
From Figure 11, obtain the value 9 as the most optimal 890 number of clusters as it has the maxi-mum silhouette score. 891 Figure 10 and Figure 11 displays, k-value is 3 by the Elbow 892 methods and Silhouette score identify k-value as 9. In this 893 case, proposed model gives k=6 as an optimal number of 894 clusters.

895
The same K value identification procedure is performed on 896 news 20 dataset and shown in Figure 12 and 13. 897 Figure 12, obtained the value 13 as the most optimal num-898 ber of clusters by the Elbow method. At point 13 elbow found, 899 concluded that the optimal number of clusters is 13.

900
From Figure 13, obtain the value 29 as the most optimal 901 number of clusters as it has the maximum silhouette score. 902 VOLUME 10, 2022      elbow and silhouette analysis are shows in Figure 14 Figure 14, obtained the value 4 as the most optimal number 913 of clusters by the Elbow method. At point 4 elbow found, 914 concluded that the optimal number of clusters is 4.

915
From Figure 15, obtain the value 2 as the most optimal 916 number of clusters as it has the maximum silhouette score. 917 Figure 14 and 15 displays, k-value is 4 by the Elbow methods 918 97322 VOLUME 10, 2022   and Silhouette score identify k-value as 2. In this case, pro-919 posed model gives k=3 as an optimal number of clusters. 920 From Figure 16, obtain the value 2 as the most optimal 921 number of clusters as it has the maximum silhouette score.     all datasets. In proposed algorithm, two improvements are 938 inculcated to address the issues related to traditional BH 939 algorithm [19].

940
These issues are convergence rate and diversification. 941 Every execution of k-mean++-BH algorithm consists of 942 k-mean++ algorithm followed by BH and finally optimal 943 solution is generated after specified number of parameter 944 setting. In proposed algorithm, candidate solutions are gen-945 erated by heuristic algorithm, exploration process of Hybrid 946 Black Hole explores search space efficiently. Recently used 947 BH uses global optimal solution through standard k-mean 948 (locally optimum) solutions. However, sometimes locally 949 optimal solution cannot converge on globally optimal solu-950 tion. To improve diversification and obtain global opti-951 mum solution, proposed method provide an optimal solution 952 through the interaction of multiple local best solutions.Every 953 local solution interprets as a star, and the best solution among 954 all the best local solutions is selected called black hole. 955 Further, proposed Hybrid BH algorithm is used to optimize 956 the candidate solution of heuristic algorithm and determines 957 the global best solution. 958 Figure 18 displays the graphical view of the results by 959 each method. Experimental analysis represents that proposed 960 method performs better than existing methods. 961 Table 11 shows the overall percentage of improvement 962 of our proposed method on all dataset as compared to 963 existing method. It shows that performance on each dataset 964 improves significantly. The performance of proposed method 965 is increased 13% on news20 dataset that is highest improve-966 ment of our proposed method. 967 VOLUME 10, 2022   Figure 19 shows the graphical view of the results achieved 980 by each method. Experimental study depicts that proposed 981 method performs better than existing methods.     Table 13 presents Silhouette score of proposed model on the 1003 four alpha-numeric datasets, respectively. This measures is 1004 used to calculate the dis-similarity of clusters.  Table 14, we observed silhouette score of numeric 1008 datasets, Silhouette score of 0.6 is reported for Iris dataset 1009 whereas Silhouette score of 0.57 is achieved on Wine dataset. 1010 The results clearly show the compactness of formed cluster by 1011 proposed model.

1013
With the rapid growth of document collections available in 1014 the field of information retrieval, organizing a large number of 1015 text documents is a core problem in the field of data mining. 1016 The process of grouping documents with similar proper-1017 ties/content, known as document clustering, is an important 1018 part of document organization and management. In document 1019 clustering, the documents are organized without the inter-1020 vention of human, fast information retrieval, topic extraction 1021 and filtering, so it is similar to data clustering. The most 1022 well-known algorithm used for clustering is k-means but due 1023 to the certain problems like the efficacy of k-means is depen-1024 dent on initial seeds chosen for clustering. Another problem 1025 is, k-means do not guarantee to form global optimal clusters. 1026 It easily gets trapped into local optimal clusters formed and 1027 hence could not improve results thereafter and determination 1028 of number of clusters k is not handle automatically and 1029 centroid initialization in k-means is random so clustering 1030 result under this method is less efficient. Document clustering 1031 is gaining popularity as an important and needed technique 1032 for un-supervised document organization and faster informa-1033 tion retrieval. In this work our aims to automatically group 1034