Finding the Number of Latent Topics With Semantic Non-Negative Matrix Factorization

Topic modeling, or identifying the set of topics that occur in a collection of articles, is one of the primary objectives of text mining. One of the big challenges in topic modeling is determining the correct number of topics: underestimating the number of topics results in a loss of information, i.e., omission of topics, underfitting, while overestimating leads to noisy and unexplainable topics and overfitting. In this paper, we consider a semantic-assisted non-negative matrix factorization (NMF) topics model, which we call SeNMFk, based on Kullback-Leibler(KL) divergence and integrated with a method for determining the number of latent topics. SeNMFk involves (i) creating a random ensemble of pairs of matrices whose mean is equal to the initial words-by-documents matrix representing the text corpus and the Shifted Positive Pointwise Mutual Information (SPPMI) matrix, which encodes the context information, respectively, and (ii) jointly factorizing each of these pairs with different number of topics to acquire sets of latent topics that are stable to noise. We demonstrate the performance of our method by identifying the number of topics in several benchmark text corpora, when compared to other state-of-the-art techniques. We also show that the number of document classes in the input text corpus may differ from the number of the extracted latent topics, but these classes can be retrieved by clustering the column-vectors of one of the factor matrices. Additionally, we introduce a software called pyDNMFk to estimate the number of topics. We demonstrate that our unsupervised method, SeNMFk, not only determines the correct number of topics, but also extracts topics with a high coherence and accurately classifies the documents of the corpus.

We introduce a method for SeNMFk coupled minimization based on KL-divergence and concatenation of the TF-IDF and SPPMI matrix.

II. INTRODUCTION
The paradigm shifts from printed books towards proliferation of digital libraries, from printed newspapers to electronic news dissemination, and the increase in the huge repositories of web documents or hypertext from web and web-enabled applications created a colossal amount of text data. Distilling useful knowledge and keeping up with relevant information from this data is a burgeoning challenge. The extraction of relevant information and useful insights from text is called text mining [2]. Typically, text data is organized in text corporalarge collections of documents, in which each document can be modeled as a combination of recurring themes or hidden (latent) concepts called topics. These topics represent the latent (not directly observable) thematic information across a text corpus, and their extraction is known as topic modeling.
Topic modeling has been used extensively to automate the analysis of large text corpora. For example, topic modeling was utilized in social media analytics to understand conversations between people in online communities, as well as to extract useful patterns in what is shared on social media [3], [4]. It can be used to track the evolution of a software project to shed light on how the project is changing over time [5]. Topic modeling also has an immense potential in scientific research. It has been used for various task, from discovering topics in political speeches [6] to analyzing safety information in drug labels to find new uses for currently existing drugs [7]. It can even be used as a tool to accelerate the research process itself, by finding relevant publications in a constantly growing collection of scientific literature [8].
The term-frequency-inverse document frequency (TF-IDF) matrix is a heuristic document-term matrix that reflects the importance of each word in a collection of documents [9]. The latent semantic structures or topics underlying this matrix can be extracted via topic modeling, expressed as a low-rank factorization of the matrix. In the decomposition of the TF-IDF matrix, each topic consists of a collection of words, and each document is approximated by a linear combinations of these topics. As the elements of the TF-IDF matrix are strictly non-negative, the dimensionality reduction through Non-negative Matrix Factorization (NMF), where all the elements of the factor matrices are required to be non-negative, is an obvious choice for the topic modeling task.
NMF is a unsupervised learning method that approximates a non-negative matrix, X ∈ R F×N + , by a product of two factor matrices, W ∈ R F×K + , and H ∈ R K ×N + , such that X ij ≈ K s=1 W is H sj [10]. Here F, N and K correspond to the number of words, the number of documents and the number of topics respectively. W and H are both non-negative and share a latent dimension, k. NMF is based on a generative statistical model, where the number of the latent features of superimposed components is the latent dimension K [11]. NMF minimization is equivalent to the expectationminimization (EM) algorithm. In this probabilistic interpretation of NMF, the observed variables are the columns x 1 , . . . , x N , of the matrix, X , which is generated by the latent variables, h 1 , . . . , h K , which are the columns of the matrix H . Thus, each observable x i is generated from a probability distribution with a mean x i = K s=1 W is h s [12], and the influence of h s on x i is through the basis patterns represented by the columns of the matrix W , w 1 , . . . , w K . When NMF is applied to a text corpus for text mining, the basis patterns, represented by the columns of the matrix W , are the topics whose linear combinations span the entire corpus, and the number of topics is equal to the dimension K . In this case, each document is represented as a linear combination of the extracted topics with coefficients given by the corresponding columns of the matrix H .
Identification of the correct number of topics, called the latent dimension of a corpus, is one of the main challenges for the existing methods for topic modeling. Estimating the correct number of topics is extremely important: underestimating the number of topics results in a loss of information, i.e., omission of topics, underfitting, while overestimating leads to noisy and unexplainable topics and overfitting.
In order to address this challenge, we propose in this paper Semantic Non-negative Matrix Factorization (SeN-MFk), an NMF method with a Kullback-Leibler divergence (KL) [13] foundation that incorporates: (i) the semantic structure through term-context relations with a coupled minimization of the TF-IDF matrix X, representing the corpus, and the Shifted Positive Point-wise Mutual Information (SPPMI) matrix, M, representing the semantic information; (ii) determination of the number of topics based on their stability investigated via random sampling of pairs of X and M matrices and a subsequent custom clustering of the sets of topic vectors with different number of topics. KL divergence measures the difference between two probability distributions by quantifying the information lost when one of the distribution is used to approximate other distribution. This method allows us to find the latent dimension of the semantic-enhanced topics based on their stability.
Overall, the main contributions of this paper are as follows: • We present in details our new method, SeNMFk, for topic modeling.
• We benchmark SeNMFk on nine well-known text corpora and demonstrate that SeNMFk accurately determines the number of topics in these text corpora.
• We characterize and compare the extracted latent topics with state-of-the-art topic models applied to the same datasets.
• We introduce a new statistical measure called L-statistics (the name is motivated by the L-curve) to analyze the improvements of the reconstruction error with increasing the latent dimension, which helps to identify the number of latent topics.
• We present in detail H-clustering, a procedure that clusters the coordinates of the documents in the topic space, which allows SeNMFk to accurately determine the classes of same-topic documents in the analyzed text corpora.
• We present a high performance software called pyDNMFk, to estimate the number of latent components, which scales over data of large sizes on CPU and GPU for accelerated performance.

III. RELATED WORK A. TOPIC MODELING
The use of topic models increased due to its applicability in analyzing huge amounts of text data for exploratory analysis [14]- [16]. Some of the widely used topic modeling methods are: Latent Semantic Analysis (LSA) [17], Probabilistic LSA (PLSA) [18], Latent Dirichlet Allocation (LDA) [19], Non-negative Matrix Factorization (NMF) [20], and others. LDA follows a generative probabilistic model, which assumes that each document can be represented as a probabilistic distribution over latent topics, and that the topics over entire corpus are shared by a common Dirichlet prior. Also, each latent topic is a probabilistic distribution over words and the distributions of words in the topics share a common Dirichlet prior [21]. LDA is successfully applied in various areas [15], [22], [23].
Another class of topic modeling methods are nonprobabilistic and are based on dimension-reduction techniques, such as LSA and NMF, which decompose a matrix-based representation of the text corpus using a bag-of-words (BoW) or TF-IDF format. LSA is a factor analysis method that performs an eigenvector-based orthogonal decomposition, known as singular value decomposition (SVD), on the word-document matrix. The orthogonality constraint forces SVD to choose topics that do not overlap with one another, but leads to negative factors or components. In spite of the differences between the two modeling approaches, probabilistic and factor-based, they are closely related. NMF with KL-divergence is underpinned by a gen-erative Poisson model [24], which is equivalent to LDA under a uniform Dirichlet prior [25], and NMF and PLSA optimize the same objective function [26].
One advantage of non-negative matrix factorization is that it provides interpretable factors, and it is widely used in text mining and document clustering [20], it also has a semi-supervised version implemented in UTOPIAN [27]. A significant limitation of the NMF approaches has been the lack of word-semantic integration that have been successfully used in deep learning as word embeddings [28], which is equivalent to factorizing the co-occurrence matrix [29].
Recently, word semantics was introduced in NMF as a regularization term based on Shifted Positive Point-wise Mutual Information (SPPMI) matrix [29]. Similarly, NMF was utilized for topic modeling of corpora of short texts [30] as well as in the Non-negative Matrix Tri-Factorization (NMTF) [31]. Such methods, which integrate NMF and word semantics, have demonstrated a better coherence of the extracted topics and high quality of the document clustering. However, they all require as a prerequisite the knowledge of the number of the latent topics.

B. MODEL DETERMINATION
The biggest challenge in topic modeling is model determination, or the identification of the accurate number of topics in the raw text corpus. The existing widely-used model determination techniques in topic modeling are as follows.
Perplexity and Coherence: These techniques use the perplexity or the coherence metrics to determine the accurate number of hidden topics. Perplexity is a heuristic measure commonly used to characterize topic quality [19], which quantifies how well a probability model predicts a sample. It has been demonstrated that perplexity does not reflect the semantic coherence of topics [32]. Therefore, coherence was introduced as a new metric [33] for estimating the quality of the extracted topics. Hierarchical Latent Tree Analysis (HLTA) [34] is another method for topic modeling, that utilize coherence metric on a hierarchy of co-occurrence patterns to identify the most coherent topics.
L-curve methods: This approach can be used in different dimensionality reduction methods or in factor analysis. It determines the optimal latent dimension by finding the elbow point of the curve showing the relative error of reconstructed data from its factors at a given latent dimension, i.e., the number of topics. NMF methods commonly try to utilize the L-curve method to identify the latent dimension K [35]. The L-curve method may not always show an accurate elbow point, especially in the presence of noisy data.
Probabilistic methods: For probabilistic methods, one of the popular latent dimension identification technique is the automatic relevance determination (ARD), introduced in reference [36] and utilized for NMF in reference [37]. ARD is also used in Bayesian matrix factorization. The prediction of the latent dimensionality significantly relies on assumptions on the prior distribution, which often are difficult to estimate solely from the initial data.  Stability methods: Some of the techniques identifies topics based on the stability of the clusters of the given data. Brunet et al. [38] presented a consensus clustering based framework to determine the stability of the NMF solutions on the given data X obtained from random initial conditions for each explored K . A term-centric stability approach has been proposed in reference [39] that investigate the agreement between the ranking of the terms of several NMF-minimizations based on the Frobenius norm.
Unfortunately, for effective topic modeling and model determination, the stability approaches have difficulties identifying the correct number of topics, although the stability metrics can be used for characterization of the quality and stability of topics.
In contrast, our method, SeNMFk [1], is semanticallyassisted and is an improvement over recently developed method, NMFk, [40]- [44] for determination of the number of latent topics. NMFk has previously been applied to factorize the world's largest cancer genomics dataset and to introduce mutational signatures [45], [46], as well in various other fields [47]- [53]. NMFk estimates the number, K , of latent features based on their stability. It creates a set of random matrices whose mean is equal to the initial data for each K and clusters the set of latent features resulting from NMFminimizations. Finally, K is determined based on the stability of the clusters and the accuracy of the reconstruction of initially data.

IV. SeNMFk
A diagram illustrating our topic modeling method, SeN-MFk, is presented in Fig. 1. SeNMFk includes the following components: 1) Custom resampling: We generate an ensemble of M random pairs of matrices, [X q , M q ] q=1,...,M , with means equal to the TF-IDF and SPPMI matrices, X and M, and built by perturbing each of the elements of these matrices using random uniform noise. 2) SeNMF minimization: We explore different numbers of latent topics by running the Semantic Assisted NMF (SeNMF) algorithm for each number of topics, k, in an interval [k min , k max ] and for each of the M random realisation pairs, (X q , M q ). SeNMF is a joint factorization done through a coupled minimization with a KL-divergence of (X q , M q ), to account for the semantic information. 3) Custom clustering: For each k ∈ [k min , k max ], we cluster the set of the M * k latent topics obtained by SeNMF. To extract the latent dimension, we determine the dependence of the stability of the obtained clusters and the improvement of the reconstruction error on the latent dimension, k. The final latent topics, W , are the medoids of the obtained stable clusters, and let H denote the corresponding document-membership matrix. 4) H-clustering: We run analysis of H to extract the document classes/clusters in the text corpus. In the following subsections we will describe these steps in detail. Specifically, SeNMF minimization is explained in subsection IV-A. Custom resampling, custom clustering, and H clustering are explained in subsections IV-B and IV-D. We include details of L-Statistics in subsection IV-C.

A. SeNMF: SEMANTIC ASSISTED NMF WITH KL-DIVERGENCE
SeNMFk is solving the following optimization problem, Fig. 2, where generalized KL divergence between matrices X andX, and α is a regularization parameter controlling the weight of the semantic matrix decomposition. The first term in the minimization, D KL (X||WH), is the error of the approximation of the TF-IDF matrix X by a product of a term-topic matrix, W , and a topic-document matrix, H . The term importance in each topic is determined by the k columns of W , where W i,j indicates the weight of word i in topic j. Similarly, the topic importance in each document is determined by the k rows of H, where H i,j encodes the activation of topic i in document j. The second term in the minimization, D KL (M||WG), injects contextual information into the term-topic matrix, W . This regularization compels the columns of W , the topics, to represent the contextual information stored in the columns of M. In the optimal solution, the rows of G encode the importance of each topic to represent the contextual information of each word, so G i,j measures importance of topic i to represent the contextual information of word j.
We efficiently solve this problem by concatenating the TF-IDF matrix X with the SPPMI matrix, M, since and applying our open-source software pyDNMFk (see section Software) to the concatenation, [X|αM].

B. MODEL DETERMINATION
SeNMFk uses the stability of the extracted latent topics for alternative values for the latent dimension k in order to determine the correct k. Figure 1 illustrates the workflow for SeN-MFk, while Algorithm 1 represents the SeNMFk pseudocode. As a first step, SeNMFk concatenates the TF-IDF matrix X and the scaled with the regularization parameter α PMMI context matrix αM. This concatenation results in a new matrix, Y = [X|αM]. Next we generate a random ensemble of r matrices: (Y 1 ; Y 2 ; . . . ; Y r ) with mean value equal to Y . Further, we factorize each one of these matrices, Y q , and con-secutively explore all possible values of the number of latent topics, k = 1, 2, 3, . . . , N − , k min , k max , r 1: for k in k min to k max do 2: for q in 1 to r do 3: Creation of ensembles 4: Silhouettes for quality of clusters 11: e k = reconstructErr(X, Reconstruction error 12: end for 13: We use custom clustering similar to k-means with cosine similarity, constrained in such a way that each cluster contains one column of each solution W k . In the random ensemble, we have r matrices and, correspondingly, r solutions. This means that in each of the k clusters (for the explored k) we will have r elements/columns. For each k we determine the stability of the clusters employing Silhouette statistics [54], which determine how compact are the clusters and how separate are they one from another. Silhouette values are between [−1, 1], where −1 means bad clustering, while +1 means perfect clustering. SeNMFk determines the number of the latent topics as the maximum k with average Silhouette close to one. To be able to determine which one of several k with average Silhouettes close to one to choose we employ L-Statistics (for more details see, next subsection L-statistics, as well reference [42], [43].

C. L-STATISTICS
For certain datasets, the plateau of the reconstruction error and Silhouette statistics may not allow to unambiguously determine the number of latent topics (see, e.g., Figure 5 for s-ng-2 and s-ng-5 datasets). For example, we can have high-values of the average Silhouette and a plateau in the reconstruction error for several different values of k. To make VOLUME 9, 2021 our method more accurate, we introduce here a new statistical measure, called L-statistics (the name is motivated by the name L-curve), that helps in the identification of the number of topics in this case. In L-statistics, we use the distributions of the column reconstruction errors, i = X :,i − X rec :,i / X :,i , where, X rec = W H. Thus, for each k, we have a distribution of column errors k → ( 1 ; 2 ; . . . ; N ), where each k is a vector of column errors for a fixed number of topics as shown in Fig 3. The L-statistics compares the distributions of column errors corresponding to different k by employing the two-sided Wilcoxon rank-sum test [55]. The Wilcoxon rank-sum test is a non-parametric test that evaluates whether two samples are taken from the same population. The Wilcoxon rank-sum test is used to check the hypothesis that data in the first sample and the data in second sample come from continuous distributions with equal medians, against the alternative that they are not. Hence, the p-value of the two-sided Wilcoxon rank-sum test allows to determine if the distributions of the column errors come from the same population or not (the smaller p-value means that there is stronger evidence in favor of the alternative hypothesis). This helps to recognize the latent dimensions: We are looking for the distribution of the column errors after which the next distributions (that correspond to a bigger k) fail the zero hypothesis of the Wilcoxon rank-sum test and hence the model is fitting the noise (Fig. 3). The L-statistics used in conjunction with the requirement that the minimum Silhouette statistics is greater than 0.55 is the criterion we use to determine the latent dimension, k. This is an empirical value which still indicates a stable cluster [1].

D. H-CLUSTERING: IDENTIFYING ACCURATELY CLASSES OF SIMILAR DOCUMENTS
In topic modeling, it is broadly accepted that the number of latent topics is equal to the number of document classes in the text corpus, that is, that there is a one-to-one correspondence between the word and document latent structures. This requirement is most likely inherited from the initial PLSA formulation, which strictly requires the number of the word latent classes to be equal to the number of document latent classes. However, in practice, this could be a very strong assumption and may not be fulfilled exactly, since the BoW and the text corpus are two different types of objects that can have their own latent structures, which (although related) may not be the exactly the same. This was previously pointed out when the bi-mixture PLSA has been introduced [56].
To account for the possibility that the latent structure of document classes is not the same as the topic latent structure, we additionally cluster the documents based on their latent coordinates, H , in a step we call H-clustering. Specifically, after identifying the latent topics, we cluster the columns of the matrix H with a standard k-means algorithm. With each column of H specifying a documents coordinates in latent space, H-clustering effectively clusters the documents. With the number of document clusters being unknown a priori, we systematically evaluate different document clusters using the standard Silhouette statistics, and use this to select the most suitable number of document clusters.

V. EVALUATION
We apply our method, SeNMFk, to nine text corpora with known numbers of latent topics and document classes, curated by domain experts. The next subsections include the description of the data, the preprocessing, and the experimental setup. These are followed by the model selection and importance of H-clustering for identifying the distinct latent topics in a given corpus. Finally, we present two measures that we use, the normalized mutual information (NMI) and the coherence of extracted topics, and demonstrate the consistency and quality of the extracted document classes as compared with the ground truth. Each of these evaluations also contains comparisons with other state-of-the-art topic models.

A. DATA
The nine text benchmarks, as shown in Table 1, include: (a) three datasets chosen for the topic modeling in [39], namely, bbc, bbc-sport, and guardian-2013, which belong to renowned news agencies -BBC and Guardian; (b) four variations of twenty news groups (ng20) [57], (s-ng-2, 3,4,5); and (c) two variations of Reuters corpus [57], 5). Of the variations of ng20 and Reuters, the documents of ng20 variations belong to 2, 3, 4 and 5 different classes for the twenty news groups. For Reuters dataset, largest number of classes for the respective benchmark decides the selection of different variations. For example, Reuters-4 contains all the documents of the Reuters corpus that belong to the top 4 categories. The respective number of ''ground truth'' latent topics are in Table 1.

B. PREPROCESSING
Preprocessing-converting raw text into a multidimensional sparse, vector space representation-is a significant step in any text mining project. We proceed in several distinct stages.  First, we remove punctuation and the common stop words in the English language from the text corpus. From that preprocessed text, we ignore the words that occur <20 times. The resulting term-document matrix is transformed into standard log-TF-IDF matrix, X, with document-wise L 2 normalization. This term frequency-inverse document frequency(TF-IDF) normalization weights the terms in term-document matrix by the logarithm of the inverse-relative frequency of the term in the document, i.e., if the i th term occurs in d i documents out of d total documents in the corpus, the TF-IDF normalization is the product of the frequency of the term with log(d/d i ). We prepare the word-context co-occurrence matrix using a document length context window. The co-occurrence matrix is further modified into the SPPMI matrix, M.

C. EXPERIMENTAL SETUP
As described in the SeNMFk section, for each one of the benchmark text corpora, we create a random ensemble of matrices, Y = (Y 1 ; Y 2 ; . . . ; Y r ) based on concatenation of the corresponding TF-IDF representation of the benchmark, X, and a scaled SPPMI matrix, M, with random resampling. Each member of this ensemble was sampled from a narrow uniform distribution with an element-wise mean equal to the concatenation of X and M. For example, each element, y q ij , of a member of the ensemble, Y q , is chosen as The amplitude of the randomization, ε, was varied between 7%, 5%, 3%, 1%, and 0.3%. We also explore different size of the ensamble Y in the ensemble from 20 up to 640, and if the stability of the clusters does not show bigger than 15% change from the previous set, then it is stopped. In each case, we run the SeNMF until convergence, which usually needs ≈ 10,000 steps. For each SeNMFk run, we used Non-Negative Double Singular Value Decomposition (NNDSVD) [58] to initialize the minimization. The regularization parameter α was chosen to be one.

D. MODEL SELECTION: FINDING THE NUMBER OF TOPICS
Our model selection procedure uses two criteria, a high average silhouette value, and a low relative reconstruction error. The silhouette and relative error values for selection of the number of latent topics for bbc and Reuters-4 text corpora are shown in Figure 4. From the results, one can clearly see the calculated number of latent topics marked in rectangles. Figure 5 shows the topic determination for other benchmarks. The numbers of topics with high minima and mean silhouette values indicate stable or reliable SeNMF solutions. Lower silhouette values indicate that either the found clusters are not well separated, or are not tight clusters. The relative error, on the other hand, decreases monotonically with the number of topics. Typically the error decreases significantly from one VOLUME 9, 2021 topic, up to the correct number of topics, followed by only marginal improvements beyond the correct number. This is characteristic of extracting legitimate topics that contribute significant representative power up to the correct number of topics, while the further decline after the correct number of topics is emblematic of the over-fitting phenomenon, as the model begins to fit noise. The p-values of the L-statistics test for bbc at k = 5 and Reuters-4 at k = 5 are 1.01e − 18 and 1.89e − 17, respectively. Because these p-values are significantly less than 0.05, we conclude that the distributions of column errors between k − 1 and k have different medians, which supports the identification of the correct number of latent topics.

E. H-CLUSTERING AND THE NUMBER OF DOCUMENT CLUSTERS
The labels or the ''ground-truth'' for the documents in a corpus are typically based on either manual clustering of the documents, or automatic clustering given additional metadata. While for many corpora and many topic modeling methods this results in the number of topics matching the 'ground truth' number, occasionally (except in the PLSA case) the number of document clusters may not exactly match the number of latent topics [56]. Discrepancies between the number of document clusters and the number of topics can occur for many reasons. One common scenario is that a cluster of documents has significant variance in its term usage requiring multiple latent topics to adequately span this variance. This results in ancillary latent topics when compared to the number of document clusters.
Therefore, in order to make a ''ground-truth'' comparison for our method, we need to determine the number of document clusters from our selected factorization. To accomplish this, we cluster the columns of H, the coefficients of the documents in the latent space, using k-means. The correct number of document clusters is selected through evaluating the silhouette statistics on the resulting clusters (see Section IV-D). To visualize the clusters in H , we employ generalized barycentric coordinates [59] with vertices corresponding to each topic. Figure 6 demonstrates this phenomenon. Here, we show that the extracted five latent topics in bbc correspond exactly to five document clusters. The values of the silhouette statistics for the k-means clustering of H are presented in Fig. 6(a), and the peak at k = 5 is clear. So, the bbc corpus has the same number of latent topics and document clusters. This fact is confirmed by the structure of the centroids of the five clusters as seen in Fig. 6(b), where each centroid corresponds to one of the extracted topics with very high accuracy. This is further exhibited in Fig. 6(c), where we see the documents form tight clusters around each topic.
On the other hand, in bbc-sport (Fig. 6), our method extracted k = 6 topics (see the results in Table 2). However, the clustering of H shows 5 distinct stable clusters (Fig. 6(d)) in X, and we can see that the centroid of the second cluster is predominantly a combination of the second and  third topic (Fig. 6(e)). Considering the barycentric plot shown in (Fig. 6(f)), we see the same property, that is, the class corresponding to the green color is heavily used by both the second and third topics. This shows that the corresponding documents have a wide variance along one dimension and our H-clustering finds it beneficial to combine two topics to represent this class, rather than one.
We have observed similar results with Reuters-4 (Fig 8), and Reuters-5 corpora (Fig 9), where the clustering of matrix H reveals the correct number of clusters. Table 2 shows results on the identification of the correct number of latent topics. We compare our results with those presented at the state-of-the-art methods of Greene et al. [39] and consensus clustering [38]. The existing methods fail to find k in most of the benchmarks except bbc, while NMFk and our SeNMFk methods find the correct k in bbc, guardian-13, and all ng20 corpora. With the additional clustering of the H matrix, we find the ''ground-truth'' number of clusters in Reuters-4,5 and bbc-sport as seen in the Column H-clust. Column L-statistics presents the p-values for the L-statistics test between the selected k and next k + 1solution.

G. NMI AND THE QUALITY OF DOCUMENT CLUSTERING
NMI, normalized mutual information [60], is one of the widely used methods to evaluate the quality of the clusters in general. Here we use this to characterize the quality of document clustering. NMI measures the mutual dependence between the predicted and the ''ground-truth'' labels. The closer the predicted labels match the ground truth labels, the higher the mutual dependence. Table 3 shows the average NMI scores of seven methods for all the nine benchmarks. We observe that our methods show better NMI scores in the majority of the benchmarks. Among all the approaches, LDA performs the worst, while [61] shows better NMI for guardian-2013, and  Skmeans performs better on bbcsport, but it does not predict the number of classes. For the remaining benchmarks, one of our methods, NMFk or SeNMFk, with H-clustering outperforms the rest. SeNMFk solutions are obtained by taking medians of the corresponding clusters of ensemble of factor matrices obtained after resampling, as mentioned clearly in earlier sections, hence they are robust and stable. We have to point out that the NMI values for Reuters-4,5 are much smaller in comparison to the other text corpora, which indicates that the document classes (extracted by all presented methods) are not ideal.

H. C v COHERENCE AND QUALITY OF THE LATENT TOPICS
To characterize the quality of the extracted latent topics, we employ the coherence metric, specifically C v coherence as mentioned in [62]. C v is demonstrated to be closest type of coherence measures to the human estimation of coherence. Also, C v coherence considers the semantic information of  word-word pairs, hence it does not depend on the ''groundtruth'' labels. This C v coherence measure is based on calculating a context vector for a word using normalized point-wise mutual information (NPMI). Table 4 shows the calculated C v coherence of LDA, Skmeans, NMFk [45] and SeNMFk. Note that, for guardian, the coherence is not shown as the original corpus is unavailable at this time, and we do not report coherence for our H-clustering method since this does not change the topic vectors. By using this metric, for majority of the analyzed text corpora, our methods show better C v coherence. The SeNMFk derived topics for the benchmarks are presented in Appendix.

VI. SOFTWARE
pyDNMFk 1 is a high performance software written at LANL by our team for performing non-negative matrix factorization   on datasets in a distributed memory fashion [63], [64]. As the implementation of the SeNMFk is based on the standard NMF for decomposition, custom clustering, and L-stats for estimating k, pyDNMFk is utilized for the objective of find-   ing the number of latent topics. It scales from a laptop to clusters with numerous nodes. So far, we have tested the software with 52k+ cores and on dense datasets (>50TB) and large sparse datasets (>9EB). An illustration of pyDNMFk on 50TB and 65TB dense dataset is shown in Fig. 7 where the framework is correctly able to identify the number of latent features, i.e., 7 and 10 respectively within 2 hours. It utilizes MPI4PY for message passing interface communications between nodes on the cluster for performing the decomposition of large sized tensor. It supports both distributed SVD and non-negative SVD initialization for NMF. It also provides a distributed custom clustering algorithm presented for automatic estimation the number of latent features. pyDNMFk currently supports the minimization of KL-divergence and Frobenius norm. The optimization for NMF is based on: multiplicative updates (MU), Block Coordinate Descent(BCD) and Hierarchical Alternating Least Squares Algorithm (HALS). pyDNMFk facilitates the transition between single-machine to large scale clusters to enable users to both start simple and scale up when necessary. The library supports also both distributed CPU and GPU implementations for accelerated performance on large datasets.

VII. CONCLUSION
In this paper, we introduce an enhanced NMF algorithm called SeNMFk. SeNMFk is a KL divergence based semantic-assisted NMF that correctly identifies the number of topics in text corpora. This is accomplished by employing a random ensamble based on the initial TF-IDF and SPPMI matrices as well as custom clustering to determine the unknown number of latent topics. The number of topics does not always coincide with the number of classes in the input text corpus. To deal with that, we use k-means clustering of the column-vectors of H in combination with Silhouette statistics, and show that the resulting algorithm correctly identifies the number of clusters in benchmark text corpora. We demonstrate the accuracy of our method on nine text corpora, and compare our results against three state-of-theart algorithms in predicting the number of topics, NMI, and the coherence of the extracted topics.

See tables 5-13.
RAVITEJA VANGARA received the M.S. and Ph.D. degrees from the Department of Chemical and Biological Engineering, The University of New Mexico. He is currently a Postdoctoral Research Associate at the Theoretical Division, Los Alamos National Laboratory. He develops unsupervised machine learning techniques which involve graphical clustering methods, nonnegative matrix and tensor factorization techniques for pattern recognition, and latent feature extraction.
MANISH BHATTARAI received the M.S. and Ph.D. degrees from the Department of Electrical and Computer Engineering, The University of New Mexico. He is currently a Postdoctoral Research Associate with the Theoretical Division, Los Alamos National Laboratory (LANL), Los Alamos, NM, USA. At LANL, he is part of the Tensor Factorizations Group, which specializes on large scale data factorization and improving the laboratory's high-performance processing and computing abilities. He has extensively worked on developing HPC empowered ML algorithms for mining big data, such as distributed matrix and tensor factorization. His current research interests include: machine learning, computer vision, deep learning, tensor factorizations, and high performance computing.
ERIK SKAU received the B.Sc. degree in applied mathematics and physics, and the M.Sc. and Ph.D. degrees in applied mathematics from North Carolina State University, Raleigh, NC, USA. He is currently a Scientist with the Information Sciences Group, Los Alamos National Laboratory. His research expertise includes optimization techniques for matrix and tensor decompositions.
GOPINATH CHENNUPATI received the Ph.D. degree from the University of Limerick, Ireland. He is currently a Machine Learning Scientist at Amazon Alexa, Sunnyvale. His works include building automatic speech recognition systems and federated learning. He works on high performance computing (HPC), performance modeling, natural language processing (NLP), deep machine learning, and high performance linear algebra.
HRISTO DJIDJEV received the M.Sc. degree in applied mathematics and the Ph.D. degree in computer science from Sofia University, Bulgaria. He is currently a Computer Scientist with the Information Sciences (CCS-3) Group, Los Alamos National Laboratory (LANL). Before joining LANL as a Scientist, he worked as an Assistant Professor at Rice University and as a Senior Lecturer at The University of Warwick. He is also currently a Research Adjunct Professor at Carleton University, Ottawa, Canada.
TOM TIERNEY received the Ph.D. degree in physics from UC Irvine, in 2002. He is currently a Senior Scientist at Los Alamos National Laboratory (LANL) and leads an applied sciences research team of over 50 technical staff. He has been with LANL, since 1997. He has over 100 publications in inertial confinement fusion, high energy density physics, radiation hydrodynamics and transport, and advanced diagnostics.
JAMES P. SMITH received the Ph.D. degree in physics. For two decades, he held leadership roles at Los Alamos National Laboratory (LANL), including the Group Leader, the Division Senior Scientist, and the Technical Director for the Principal Associate Directorate of Global Security. He has led many large programs, including over 100 technical staff on the DHS National Infrastructure Simulation Analysis Center delivered analytic products and represented the USA as a Science Delegate to the Quad Working Groups. BOIAN S. ALEXANDROV received the M.S. degree in theoretical physics, the Ph.D. degree in nuclear engineering, and the second Ph.D. degree in computational biophysics. He is currently a Senior Scientist at the Theoretical Division, Los Alamos National Laboratory. He is specialized in big data analytics, nonnegative matrix and tensor factorization, unsupervised learning, and latent feature extraction. VOLUME 9, 2021