GAE-Based Document Embedding Method for Clustering

Document embedding methods for clustering using deep neural networks have been proposed recently. However, the existing deep neural network-based document embedding methods for clustering have a problem of either generating document embeddings dependent on a given number of document clusters or generating document embeddings that do not take into account the characteristic of high similarity between documents belonging to the same document cluster. In this paper, we propose a new document embedding method for clustering by using a graph autoencoder. To this end, we construct an undirected and weighted sparse graph from a set of documents wherein each document is represented by a node, and all the weighted edges created in the graph have high cosine similarities between the two end nodes. We then apply the proposed graph autoencoder to the graph to compute node embedding vectors. Each node embedding vector in the graph is used as a document embedding vector. This paper presents in-depth experimental analyses of the proposed method. Experimental results on various real document data sets demonstrate that the proposed approach affords the significant performance improvement over the existing document embedding methods.


I. INTRODUCTION
In recent years, research on natural language processing and text mining techniques is being actively conducted owing to the increase in digital text documents on the Internet. In particular, there is no label information of data owing to the nature of online media where data is accumulated in real time; therefore, the demand for clustering for expansion or new insights based on unsupervised learning are increasing [1]. Clustering is an unsupervised learning-based algorithm that divides data into multiple clusters using information from given data without label information, and is in the spotlight in many fields from computer science to social network [2].
Traditional document clustering methods represent documents as very high-dimensional numeric vectors by using Bag Of Words (BOW) concept [3] or Term Frequency The associate editor coordinating the review of this manuscript and approving it for publication was Kostas Kolomvatsos .
Inverse Document Frequency (TF-IDF) [4]. The very highdimensional numeric vectors of documents are then clustered by the well-known partitional or hierarchical clustering algorithms [5]. However, the performance of these traditional document clustering methods was not good due to the high dimensionality of document vectors. To address this problem, Deep Neural Network (DNN)-based document embedding methods have been proposed in order to reduce extremely high-dimensional vectors to low-dimensional vectors [6], [7], [8], [9], [10].
Deep Embedded Clustering (DEC) [6] has been proposed to simultaneously learn feature representations and centroids of clusters. DEC uses a pre-trained multi-layer autoencoder to map the input data to latent feature space and simultaneously learns an optimized cluster by repeatedly redefining the centroid of the cluster through an auxiliary distribution derived from a soft assignment from the data point. However, the local structure preservation cannot be guaranteed by its VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ clustering loss. Consequently, the feature transformation may be misguided, leading to corruption of embedded space [11]. Furthermore, DEC computes the embedding vectors depending on a given K number of clusters. If K is changed, DEC has a problem of recomputing the whole embedding vectors. Spectral Clustering with Deep Embedding (SCDE) [7] has been proposed to improve spectral clustering using the deep autoencoder. SCDE first computes good embedding of the raw document data using the deep autoencoder of DEC [6] and then applies spectral clustering on the embedding to cluster documents. It also proposes a method to estimate the number of clusters based on a softmax autoencoder. However, SCDE has one major problem. It uses the deep autoencoder of DEC to obtain the document embedding vectors for the application of the spectral clustering method. As a result, it suffers from the same problems as those based on DEC.
In spherical text embedding [9], Joint Spherical Embedding (JoSE) is a method that learns spherical text embeddings in the spherical space in an unsupervised manner. To this end, a two-step generative model is employed that jointly learns unsupervised word and paragraph embeddings by exploiting word-word and word-paragraph co-occurrence statistics. JoSE demonstrates much better clustering performance than the existing document (or paragraph) embedding methods such as averaged word embedding using Word2Vec [12], SIF [13], BERT [8], and Doc2Vec [14].
Sentence-BERT(SBERT) [10] is a modification of the pretrained BERT network that uses Siamese and triplet network structures. It can derive semantically meaningful sentence embeddings for large-scale semantic similarity comparison, unsupervised tasks such as clustering, and information retrieval via semantic search. Previously [10], the performance of SBERT for common semantic textual similarity tasks was evaluated, wherein it outperformed other state-of-the-art sentence embedding methods. However, the document embeddings of SBERT for clustering are not effective for the document datasets where its member documents have a large number of words. SBERT (Model: all-mpnet-base-V2) affords an effective document embedding only for a document with a maximum of 384 words.
In this paper, we propose a new document embedding method for clustering using a Graph AutoEncoder (GAE). For this purpose, we first propose an effective graph construction algorithm from a set of documents. The constructed graph is sparse where a node represents a document, and a weighted edge represents a cosine similarity between documents. We then propose our GAE to compute node embedding vectors from the constructed graph. We use each node embedding vector as a document embedding vector for clustering. These document embedding vectors are independent of the number of document clusters. Consequently, they can be reused for K -mean clustering algorithm for any value of K without recomputing the document embedding vectors with different K values such as SCDE and DEC.
The main contributions of our proposed method are summarized as follows: • We propose an efficient algorithm to construct a sparse graph from a set of documents.
• We develop an effective document embedding method optimized for clustering using a GAE.
• The proposed GAE-based document embedding method does not depend on the number K of document clusters.
The remainder of this paper is organized as follows. Section II first describes how to construct an undirected and weighted sparse graph from a set of documents. It then describes the GAE model architecture which computes document(node) embedding vectors. Extensive experimental analyses of our document embedding method are presented in Section III. Finally, Section IV provides our concluding remarks.

II. METHOD: GAE-BASED DOCUMENT EMBEDDING METHOD
In this section, we describe a document embedding method for clustering with a graph autoencoder model. We first describe constructing a sparse weighted graph from a set of documents using GAE. We then describe our GAE model architecture from which document embeddings are computed as node embedding vectors.

A. GRAPH MODELING
Given a set D = {d 1 , . . . , d N } of documents where each d i represents a m-dimensional tf-idf vector of a document i, we construct an undirected and weighted graph G(V , E) to use a graph convolution neural network. We model each document d i in D as a node in the graph G(V, E) where all the undirected and weighted edges in E are created only when their two end nodes have large cosine similarities. These cosine similarities between two nodes are used as the weights of edges in the graph. Thus, very high dimensional tf-idf vectors of all the documents are transformed into an undirected and weighted graph structure.
To identify the edges having large cosine similarities between documents, we first construct the cosine similarity matrix C for the document set D = {d 1 , . . . , d N } as follows. Each entry C ij in the matrix C represents the cosine similarity between the documents d i and d j . In C, all the diagonal entries C ii are set to zero.
By using the largest cosine similarity of each row in the C matrix, we then normalize all the entries of the cosine similarity matrix C as the following matrix NC. MC i in NC is defined as MC i = max 1≤j≤N C ij and If the above condition holds, then we add (d i , d i l , NC i,i l ) to Z . Here, α is a real constant value between 0 and 1. This condition ensures that each edge (d i , d i l , NC i,i l ) in TopK i is selected as a direct edge of node d i only when the node d i l has the normalized cosine similarity larger than or equal to α * NC i,i K with some of the neighbor nodes in the created edges of the node d i . Consequently, the constructed graph is sparse.
The nodes in the graph G(V , E) constructed from this method are directly connected to only those nodes with large cosine similarities with them. The similarities among documents are captured by a graph structure where the weights of the edge represents the degree of similarity between two nodes (i.e., documents). Algorithm 1 gives the graph G(V , E) construction method from a set of documents.

B. GAE MODEL ARCHITECTURE
We construct the weighted adjacent matrix A N ×N and the degree matrix D N ×N from G(V , E) obtained from We normalize the weighted adjacency matrix A for GAE using A N ×N , D N ×N , and an identity matrix I N ×N . The normalized weighted adjacency matrixÂ is defined as follows: We use a multi-layer GAE to generate the node embedding vectors Z of the nodes (i.e., documents) in the graph G with the following layer-wise propagation rule of the encoder: One hot encoding vector of N dimension is used as a feature vector of each node in the graph. We do not use the tf-idf vector of a document as the feature vector of the corresponding node in the graph. Let X N ×N be a matrix consisting of the N -dimension feature vectors of the N nodes in the graph. W (l) and B (l) are layer-specific trainable weight and bias matrices. VOLUME 10, 2022 H (l) ∈ R N ×D is the matrix of activations in the l th layer; H (0) = X . The decoder of our GAE computes the pair-wise similarity between the node embedding vectors of the final layer of the encoder. Then, the decoder reconstructs the unweighted graph adjacency matrixM , which is defined aŝ where z u is the embedding vector of node u and σ (·) denote a sigmoid activation function. We train our proposed GAE by minimizing the negative cross-entropy loss given the aboveunweighted adjacency matrix M and the reconstructed adjacency matrixM .
We define the pair of connected nodes in the graph as positive samples and the pair of unconnected nodes as negative samples to compute the negative cross-entropy loss function. Positive samples refer to all the edges (i.e., connected nodes) present in the graph G(V , E). For each training epoch, the different negative samples are generated equal to the number of positive samples (i.e., |E|) through random sampling among unconnected nodes in the graph. Let E neg be a set of negative samples where |E| = |E neg |. The negative cross-entropy loss function of our proposed GAE is given as follows: Our loss function enforces the connected nodes with edges to have similar node embedding vectors and unconnected nodes without edges to have dissimilar node embedding vectors. When all the layers are trained in this manner, we use the output of the final layer of the encoder as the final node embedding vectors. We then run K-means and spherical K-Means (SK-Means) algorithms on these node embedding vectors to compute document clusters.

III. EXPERIMENTS
In this section, we present the experimental results of several document datasets to evaluate our proposed document embedding method. We compare our proposed GAE-based Document Embedding Method (GDEM) with the following baseline document embedding methods: DEC [6], SCDE [7], JoSE(Joint Spherical Embedding) [9], and SBERT(Model: all-mpnet-base-v2) [10]. The sentence and paragraph in SBERT can be mapped to the 768-dimensional density vector space and can be used for tasks such as clustering or semantic search. SBERT performs the best at 2022.09.

A. DATASETS
We performed experiments on the following six text datasets, which are widely used for evaluating the performance of document clustering methods.
• 20Newsgroups: It is a collection of 18,846 text documents categorized into 20 different newsgroups. All • Opinosis: Opinosis dataset contains sentences extracted from user reviews on a given topic. In total there are 51 such topics, with each topic having approximately 100 sentences. All the meaningful words from the dataset are extracted to compute tf-idf vectors of size 5263 for GDEM.
• WebKB: WebKB is a dataset that includes web pages from computer science departments of various universities. 4,518 web pages are categorized into 6 imbalanced categories. All the meaningful words from the dataset are extracted to compute tf-idf vectors of size 6887 for GDEM. Table 1 summarizes the detailed characteristics of the six document datasets.

B. EVALUATION METRIC
We use three widely used external measures as metrics: Normalized Mutual Information (NMI) [15], Adjusted Rand Index (ARI) [16], and Clustering Accuracy (ACC) [6]. MNI has a range of [0,1], with one being the perfect clustering and zero the worst, and ARI has a range of [-1,1], with one being the best clustering performance and minus one the worst. ACC gives the ratio between the number of clusters to the number of ground-truth categories. ACC also has a range of [0,1], with one being the perfect clustering and zero the worst.

C. IMPLEMENTATION
We first construct a graph G(V , E) from a given set D of documents by using Algorithm 1 where α = 0.2. We then implemented our proposed GAE to have two layers of the encoder to generate the node (i.e., document) embedding vectors Z of the graph G(V , E) as follows.
Two-layers learnable weight and bias matrices W (0) , B (0) , W (1) and B (1) are used. H (0) is a matrix X consisting of the N -dimension one-hot encoding feature vectors of the N nodes in the graph. We implemented all algorithms based on Python, PyTorch, and PyTorch Geometric. We used different encoder architectures depending on the size of the dataset. We used 128-64 dimensions for 20newsgroups and 64-32 dimensions for the rest of the document datasets as the GAE encoder architecture. We set K to 20 for 20newsgroups and set K to 50 for the rest of the datasets to compute TopK i for each dataset. We trained GAE for 200 iterations using Adam with a learning rate of 0.01. We use both K-Means and SK-Means for clustering methods.
We visualize the effectiveness of GDEM by showing the t-SNE of the document embedding vectors after learning 10 and 200 epochs of GDEM for the six datasets in figure 1. Figure 1 Table 2 illustrates the evaluation of K-Means and SK-Means clustering methods for the six datasets based on each document embedding method. We use the ground truth number of clusters K given in the table 1 for K -means and SK -means clustering methods. The best performances of 20Newsgroups, BBC, WebKB, Reuter-8and Reuter10k among the five document embedding methods are listed  in bolded numbers in the table. Our proposed document embedding method (GDEM) outperforms the baseline methods by large margins on 20Newsgroups, BBC, WebKB, and Reuter10k datasets. For Opinosis and Reuter-8 datasets, GDEM performs less than SBERT but much better than the rest of the document embedding methods.
From the tables 1 and 2, we find that SBERT gives good embeddings for the document datasets where their member documents are short documents. A short document consists of a small number of words. The characteristics of Opinosis and Reuter8 datasets from the table 1 show that they have short documents whose average number of words in a document is small. The average numbers of words in a document for Opinosis and Reuter-8 are 17 and 102, respectively. Our GDEM gives good and effective embeddings for the document datasets with medium-and large-sized documents, shown in the tables 1 and 2.

2) CLUSTERING PERFORMANCE COMPARISON WITH DIFFERENT NUMBER OF CLUSTERS
In this experiment, we vary the number of clusters K to evaluate the effectiveness of our embedding method (GDEM) compared with DEC, SCDE, JoSE, and SBERT in terms of SK -means and K -means clustering algorithms. Note that we do not recompute the document embeddings for SCDE and DEC methods as K is increased. The document embeddings of SCDE and DEC are computed only once with the fixed value of K for each dataset where K is set to the ground truth number of clusters given in table 1.
As K increases, figures 2, 3, 4, and 5 show that K -means and SK -means clustering methods using GDEM gives the much better ARI and NMI than using DEC, SCDE, JoSE, and SBERT for 20Newsgroups, BBC, Reuter10k, and WebKB. SBERT performs better than the GDEM, DEC, SCDE, and JoSE for Opinosis and Reuter-8 datasets having short documents. Interestingly, the document embedding methods that depend on K , such as SCDE and DEC, perform worse than GDEM, JoSE, and SBERT as K changes. Our GDEM also demonstrates a very good clustering performance for the document datasets with medium-and large-sized documents as K changes.

IV. CONCLUSION
We propose an effective document embedding method, GDEM, for clustering using a GAE. We first proposed an effective graph construction algorithm to use the GAE. Our graph construction algorithm creates a sparse undirected graph from a set of documents, which effectively captures the high similarities between documents by its weighted edges. The node embedding vectors computed by GDEM from this graph are used as the document embedding vectors. Unlike most existing DNN-based document embedding methods, the document embedding vectors obtained by GDEM are generated independently of the number of document clusters. We performed extensive experimental analyses of GDEM over various real document datasets compared with the existing DNN-based document embedding methods. We demonstrated that GDEM affords excellent and effective document embeddings for the datasets with medium-and large-sized documents.