Gaussian Mixture Variational Autoencoder for Semi-Supervised Topic Modeling

Topic models are widely explored for summarizing a corpus of documents. Recent advances in Variational AutoEncoder (VAE) have enabled the development of black-box inference methods for topic modeling in order to alleviate the drawbacks of classical statistical inference. Most existing VAE based approaches assume a unimodal Gaussian distribution for the approximate posterior of latent variables, which limits the flexibility in encoding the latent space. In addition, the unsupervised architecture hinders the incorporation of extra label information, which is ubiquitous in many applications. In this paper, we propose a semi-supervised topic model under the VAE framework. We assume that a document is modeled as a mixture of classes, and a class is modeled as a mixture of latent topics. A multimodal Gaussian mixture model is adopted for latent space. The parameters of the components and the mixing weights are encoded separately. These weights, together with partially labeled data, also contribute to the training of a classifier. The objective is derived under the Gaussian mixture assumption and the semi-supervised VAE framework. Modules of the proposed framework are appropriately designated. Experiments performed on three benchmark datasets demonstrate the effectiveness of our method, comparing to several competitive baselines.


I. INTRODUCTION
Topic models [1], [2] provide us with methods to discover abstract word and phrase patterns that best summarize and characterize a corpus of documents. Applications of topic models ranges from organizing daily documents such as emails [3], news [4] and social media posts [5], to understanding professional documents such as scientific papers [6], medical records [7] and technical reports [8]. Latent Dirichlet Allocation (LDA) [9] and its variants [10]- [12] are among the most popular models for extracting topic structure from doc-The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Afzal .
uments. These probabilistic generative models adopt Markov China Monte Carlo (MCMC) and Variational Inference (VI) to approximate the true Bayesian inference that is often intractable due to high dimensionality of latent variables [13]. The standard inference approaches suffer a lot from high computation cost and tedious mathematical derivations, which hinders practitioners to explore large datasets [14].
The limitations motivate neural network-based black-box inference methods [15]. Variational AutoEncoder (VAE) [16] provides us a framework to alleviate the above-mentioned limitations, by training an inference network that maps the representations of documents to an approximate posterior distribution directly. Recently, several studies have VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ endeavoured to re-develop topic models under the VAE framework [17]- [20]. In this paper, we study topic modeling under the framework of VAE. And we aim to overcome two drawbacks in most of the existing studies. First, despite that VAE is powerful for posterior inference, the posterior distribution is usually chosen to be a known and simple one, which limits the flexibility of the model [21]. For topic modeling, latent variables can be considered higher and more abstract concepts that could generate the words of documents. The usually chosen multivariate diagonal Gaussian is a rigid choice for generating documents due to its unimodal structure.
Second, the framework of VAE is mainly designed for unsupervised learning. The extension to supervised or semisupervised learning is non-trivial [22]- [24]. Labeling data is admittedly very expensive for text data in reality, but obtaining a small subset of labeled data is much more practicable. And these labeled data have proven to be valuable to improve the performance of learning algorithms [25]- [27]. Semi-supervised topic modeling has attracted much attention recently. However, adopting the VAE framework for semisupervised topic learning is still rare [28].
We propose a Semi-supervised Variational AutoEncoder with Gaussian Mixture posteriors (S-VAE-GM) to address the above challenges in topic modeling. We assume that each document is endowed with both class labels and latent topics. Topics can be seen as a relatively high dimensional vector of latent variables that follow a multivariate Gaussian distribution. Each class label, which can be observed for a subset of the data, corresponds to a Gaussian with its specific parameters. This assumption means that topics weigh differently for different classes. A document corresponds to a discrete distribution over classes. And this naturally results in a mixture of Gaussians for a document. The Gaussian mixture model for latent space is rational for the assumption of the class-topic hierarchy in a document, and it alleviates the unimodal limitation. The assumptions are illustrated in Figure 1. Specifically, the representation vectors of a document will be input into two networks to encode the parameters of latent variables, one network for the parameters of each Gaussian component, and the other one for the mixture weights of Gaussians. Latent variables are sampled from this Gaussian mixture density, and they are then used to generate the original document by independently generating the representation of each word in the document. We carefully derive the objective for semi-supervised topic learning. The objective takes into consideration the KL divergence between two Gaussian mixture densities and the cross entropy term for training the classifier with labeled data. Experiments performed on three standard datasets demonstrate the effectiveness of our model.

II. RELATED WORK
In this section, we review previous studies on topic models using the VAE framework, and VAE models with multimodal approximate posterior distributions. The schematic of the proposed model. A document is associated with a discrete distribution over class labels. And each label corresponds to a multivariate Gaussian distribution over latent topics with its own parameters. The class weights are used as the mixture parameter for Gaussian components on the one hand. On the other hand, they can be used to train a classifier with labeled data, and predict most relevant labels for unlabeled data.

A. VAE-BASED TOPIC MODELS
The computation burden of inferring the posteriors in topic models challenges the application and development of them. Several previous studies adopted variational autoencoder, which trains a black-box inference network for approximating posteriors, to avoid variational updates.
Miao et al. [17] proposed a Neural Variational Document Model (NVDM). The model approximates the posterior distribution of latent variables by building an inference network, which is implemented by a deep neural network conditioned on text. Experiments on two standard news corpora show the lowest perplexity compared to several baseline methods. The NVDM is a fully unsupervised document model. And the approximate posterior of latent variable is assumed to be Gaussian. To tackle the drawback of costly computation of the posterior distribution in applying traditional topic models, Srivastava and Sutton [18] proposed the AVITM model for latent Dirichlet allocation. The Dirichlet prior of topic proportions of documents hinders the development of reparameterization trick. It is solved by constructing a Laplace approximation to the Dirichlet prior. The authors also solved the component collapsing problem for their model. Experiments on two datasets demonstrated the effectiveness and efficiency of their model. AVITM can be seen as a neural extension of LDA, while NVDM does not consider the Dirichlet prior. Yang et al. [19] investigated a new type of decoder, a dilated CNN in VAE-based text modeling. By controlling the size of context, the model can outperform LSTM language models. Xiao et al. [20] introduced a topic distribution variable to explicitly involve Dirichlet latent variable. Based on LSTM specifications, the authors reported gains over several baselines on the tasks of text reconstruction and classification.
The studies mentioned above are mainly designed for unsupervised learning. And the posteriors of the latent variables are uni-modal Gaussian or Dirichlet. In this paper, we aim to develop a semi-supervised deep variational model for text data. In addition, to enhance the interpretability and flexibility of the model, we use a multimodal distribution for the posteriors of latent variables.

B. VAE WITH MULTIMODAL POSTERIORS
In order to increase the flexibility of variational autoencoder, some studies aim to enrich the form of the posterior distributions of latent variables. A natural choice is using multimodal posteriors.
Dilokthanakul et al. [29] proposed a VAE-based unsupervised clustering method that assumes the observed data is generated from a mixture of Gaussians. The generation process is analogous to Gaussian discriminant analysis. Their method selects a cluster label at first, and then the Gaussian corresponding to this label is activated. Experiments on image datasets demonstrate the effectiveness of clustering and image generation. Liu et al. [30] adopted Gaussian mixture distribution to capture the true posterior of latent variables. During implementation, the weights of the components of Gaussian mixture are set to be uniform. The authors reported state-of-the-art experimental results on several image datasets. In order to attenuate the effect of outliers in VAE-based clustering, Zhao et al. [31] proposed a truncated Gaussian-mixture VAE to model major and minor clusters separately. Experiments on a synthetic image dataset and a medical image dataset demonstrated the effectiveness in disentangling minor and major clusters.
All the studies mentioned above adopted some forms of mixture models for the posteriors of latent variables. However, a majority of existing studies aim to design unsupervised models for image data. While in this paper, we focus on semi-supervised learning for text data.

III. MODEL FRAMEWORK
We now describe our novel S-VAE-GM model in which we use a Gaussian mixture model (GMM) for the approximate posterior distribution of latent variables.

A. PROBLEM DESCRIPTION
A corpus of documents can be denoted as X = {x 1 , x 2 , . . . , x N }. x i ∈ R |V| is the representation vector of the i-th document where V denotes the vocabulary. x i is assumed to be i.i.d. sampled from some distribution p(x). And we assume that the generation of x is conditioned on a continuous latent variable z ∈ R K , where K is the dimension of latent space and it indicates the number of latent topics. The dataset can be divided to two parts, a subset with labels X l and a subset without labels X u . The set X l can be further represented as X l = {(x i , y i )}, where y i ∈ {y 1 , y 2 , . . . , y M } denotes the label of document x i .
We aim to learn a) the distribution of labels for each document i, denoted as p(y j |x i ), (j = 1, . . . , M ), b) the distribution of topics for each document i, denoted as p(z k |x i ), (k = 1, . . . , K ), and c) the distribution of words over the vocabulary for each topic k, denoted as p(v t |z k ), (t = 1, . . . , |V|) where v t is the t-th word in vocabulary V.

B. GENERATIVE PROCESS
By leveraging the framework of VAE, we propose a probabilistic model to describe the generation of the text data under semi-supervised settings. The objective is to model p(x). Latent variables are often incorporated in the model to encode intuitive features associated to observed ones. Similar to the work of Kingma et al. [23], we introduce a latent class variable y to the unlabeled data and a continuous feature variable z to all the data. According to the laws of probability, p(x|y, z) and p(y, z) will be modeled instead in order to involve these latent variables. And we assume that the joint distribution is factorized as p(y)p(z). The generation model is described as p(x, y, z) = p(y)p(z)p θ (x|y, z). And the observed data can be generated from latent variables y and z according to the following process, which corresponds to the ''decoder'' in VAE The variable π follows a symmetric Dirichlet distribution with hyper parameter α. π is a M -dimensional vector, in which each element π j represents the weight of the corresponding label y j . Label y is often modeled as a categorical distribution based on π due to the Dirichlet-multinomial conjugacy. Latent variable z follows a standard normal distribution, and we will use a GMM to approximate its posterior distribution. Data x can be either a continuous embedding vector or a discrete bag-of-words vector. f is a likelihood function parameterized by θ . The likelihood is usually approximated by a neural network based on the latent variables y and z.

C. VARIATIONAL OBJECTIVE
Typically, the log likelihood of the data log p(x) will be maximized to obtain an applicable model. However, the integral of the marginal likelihood p(x) = p θ (x|y, z)p(y, z) dy dz is computational intractable since that we have to sample over a massive number of latent variables which are often with high dimensionality.
The framework of VAE aims to deal with the above problem [16]. The conditional likelihood p θ (x|y, z) will be modeled as an decoder. And the priors of latent variables p(y, z) will be inferred by the posterior p(y, z|x). To comply with the principles of variational inference, we use a simpler distribution q φ (y, z|x) to approximate the true posterior. The approximate posterior is evaluated by an encoder of VAE parameterized by φ using variation inference. The objective is therefore minimizing the KL divergence (KLD) between the true and approximate posteriors. A few mathematical VOLUME 8, 2020 derivations result in a lower bound of log p(x) for a single data point, namely the Evidence Lower BOund (ELBO). The optimization objective of VAE is then transformed to the maximization of the lower bound.
Specifically, the forms of the ELBO for labeled and unlabeled data points of our model, are identical to the forms of Equation (6)-(9) of Kingma's work. The differences are the implementations details of the approximate posteriors and the decoder network. Inspired by Kingma [23] and Keng's work [32], we elaborate the details of the objective in the following.
We assume that the prior of latent variables can be fully factorized as We also make the assumption that the approximate posterior of latent variables can be factorized as

1) LABELED DATA
For labeled data, y can be observed. The latent variable is only z. And the ELBO is analogous to the vanilla VAE. The ELBO for a data point is written as The objective that we want to minimize for all (x, y) ∈ X l is wherep l is the empirical distribution of labeled data. To enable our model to train a classifier using partially labeled data, we involve a cross entropy term between the empirical distribution and the approximate posterior distribution of y into our objective.
where H denotes the cross entropy between two distributions. The classifier q φ 1 produces predicted class labels by learning a mapping from x to y. The added term will be used to supervise the training of the classifier.

2) UNLABELED DATA
For unlabeled data points, y is considered as a latent variable. The ELBO is The last line holds because of the definition of KLD. Here in Equation (6) the approximate posterior q φ has the form of q φ (y, z|x). Consider Equation (2), then Equation (6) can be written as The objective that we aim to minimize for all unlabeled data where the second term on the right side of the last line is actually the conditional entropy of y given unlabeled data x.

3) THE FINAL OBJECTIVE
The objective function for all data is This is our final objective to be minimized with our S-VAE-GM model.

IV. MODEL IMPLEMENTATION
In this section, we describe the details of implementations of the encoder and decoder in our S-VAE-GM model. The overall framework of the S-VAE-GM model is illustrated in Figure 2.

A. BASIC ASSUMPTIONS
Our latent variables aim to capture some high-level feature space information in order to model the generation of text data based on latent topics. To achieve this goal, we make the following assumptions 1) Each class y j , (j = 1, . . . , M ) corresponds to a normal distribution N (z|µ i , diag(σ 2 i )) over latent topics z. The dimension of the random vector z corresponds to the number of topics. 2) For a document, the weight or probability of the appearance of y j in the document is specified by π j . 3) A document is generated by a decoder network based on latent variables y and z, where y is sampled from a multinomial distribution parametrized by π and z is assumed to be sampled from a mixture of Gaussian distributions. The first assumption defines a class label as a probability mass function over K discrete topics. It is analogous to LDA's idea that a document can be generated from the distributions over topics. Different classes allocate different weights on topics. Hence, they can be semantically differentiated. The second assumption is the key to semi-supervised learning as well as to Gaussian mixture modeling of z. The dimension of π , which is denoted as M , is set to as the same as the number of possible y. On the one hand, π can be produced by training another encoder q φ 1 , which can be treated as a classifier to achieve transductive semi-supervised learning. On the other hand, learning the probabilities of different classes for a document just provides us an adequate way for learning the weights of Gaussian mixture components. The third assumption tells us that z, which will be used to generate the document x, is actually a random variable sampled from a mixture of Gaussians.

B. IMPLEMENTATION DETAILS 1) ENCODER
The approximate posterior distribution q φ is implemented as an inference model, which is parameterized by a neuralnetwork-based inference network (aka encoder) under the framework of VAE. According to Equation (2), we implement two different inference networks for q φ 1 and q φ 2 , respectively.
For latent variable y, we use a neural network, which is shown as the network with blue nodes in Figure 2, to produce the parameter π of the multinomial distribution of y, where π φ 1 (x) can be represented as an MLP network (parameterized by φ 1 ) that takes dense embedding vectors as inputs data x. The network acts as a classifier to produce predicted labels of unlabeled data. In addition, the produced parameter π is also used for modeling z.
For latent variable z, we use a Gaussian mixture inference network. Specifically, the parameters of each Gaussian component are specified by another neural network, which is shown as the network with red nodes in Figure 2. And π φ 1 (x) acts as the mixture weights. The approximate distribution of z is specified as where µ j and σ 2 j can be produced by an MLP network (parameterized by φ 2 ) that takes bag-of-words representation of documents as inputs data x. The network outputs M pairs of the parameters. Equation (10) tells how the latent variable z is produced by the encoder network. z will be sampled from q φ 2 using the re-parametrization trick.
The difference between our method and Kingma's M2 model [23] is that, in our method both network q φ 1 and q φ 2 are used to produce the parameters of a Gaussian mixture model that is used for sampling z, rather than directly using a Gaussian variable and the label y to generate x in the M2 model. The difference between our method and the VAEGH [30] is that the value of π is learnt from the encoder network, rather than set to 1/M . And unlike VAEGH, our model is developed for semi-supervised learning. In addition, our model is particularly designed for learning text data. The key to this task involves the above mentioned GMM modeling of the latent space, and the implementation of the decoder. The latter will be detailed as follows.

2) DECODER
Since that the generation of z from the GMM distribution already uses the parameters of p(y), here we simplify the decoder network as p θ (x|z) by only using z to generate x.
We adopt the implementation of the decoder in NVDM model [17] for our decoder network p θ (x|z). The input data x ∈ R |V| for encoder q φ 2 , which is used to produce the continuous latent variable z, is a sparse bag-of-words representation of a document. Hence, x can be decomposed into VOLUME 8, 2020 D word vectors w d , where D is the number of words in the document. w d is the one-hot vector of the d-th word in the document. If w d is at the t-th position of the vocabulary V, only the element w d (t) equals 1. The decoding process of a document can be seen as independently decoding each word. The decoder can be represented as which is shown as the network with yellow nodes in Figure 2.
We aim to obtain a topic-word matrix R K ×|V| to represent the semantic elements of topics. NVDM assumes that matrix R represents linear correlations between topics and words, and it will be used to evaluate a score and further the conditional probability of w d given z as This softmax regression model for words is shared by all documents. After training, the above model will learn the probability mass function over vocabulary for each topic.
To optimize our objective, there is another issue needs to be addressed. Since the prior and approximate posterior in vanilla VAE are both Gaussian, the KLD in the objective can be computed with a closed form solution. However, under our GMM assumption, we need to derive a tight bound for the KLD of two Gaussian mixture densities.

C. KLD BETWEEN TWO GMMs
Now we consider the computation of the KLD in Equation (3). Since that p(π ) is a symmetric Dirichlet distribution and p(y) is a Dirichlet-multinomial distribution, p(y) can be treated as a constant in (3). Given the GMM implementation of q φ 2 (x|y, z), we need to deal with the KLD between a Gaussian mixture distribution and a standard Gaussian distribution. Since standard Gaussian can be rewritten as a mixture of standard Gaussians with any normalized set of weights, i.e.
The KLD in the objective can then be evaluated between two Gaussian mixtures. Several studies [30], [33], [34] have demonstrated that the KLD between two mixtures does not have a closed form solution, but it has an upper bound. Since that we aim to maximize the ELBO, which means the KLD between two mixtures D KL q φ 2 (z|x, y) p(z) should be minimized, we can then transform the minimization of the KLD to the problem of minimizing the upper bound of the KLD between two mixtures. Specifically, the KLD between two probability mixture densities p = n i π i f i andp = n iπ ifi is upper bounded as follows [30] The details of the proof of (14) can be found in [30].
For two Gaussian mixture densities, we only need to derive the analytical form of the KLD between two multivariate Gaussians to obtain the analytical form of the upper bound of the KLD between two Gaussian mixtures. Fortunately, the KLD between two Gaussians has a closed form solution. Suppose we have two Gaussians N (µ 1 , 1 ) and N (µ 2 , 2 ). According to the definition of KLD, we have where n is the dimension of variable z. Substitute the j-th component of our GMM implementation of (10) into (15), we have the KLD between a diagonal multivariate Gaussian distribution and a standard Gaussian distribution as follows where µ jk and σ 2 jk are the k-th component of µ j (x, y; φ 2 ) and σ 2 j (x; φ 2 ), respectively. Substitute our implementation of the two Gaussian mixtures (10) and (13) into (14), we obtain the upper bound L(x, y) of L(x, y) as And the final objective that we aim to minimize becomes

D. OPTIMIZATION
By using the reparameterization trick, the derivatives of the objective w.r.t. model parameters θ , φ 1 and φ 2 can be estimated. Then the objective can be optimized by back-propagating the stochastic gradients w.r.t. model parameters.

V. EXPERIMENTAL SETUPS
In this section, we describe the details of the datasets used to perform experiments, the metrics for performance evaluation, the baselines for comparison, and the settings of experiments.

A. DATASETS
We use three standard, public available datasets to verify the effectiveness of our proposed model.
• 20NewsGroups. 1 This dataset is a collection of nearly 20,000 documents across 20 different newsgroups. The newsgroups covers the topics of computer hardware, science, politics, etc. These 20 newsgroups are considered as 20 different classes in our experiments. The total number of samples used in our experiments is 18,846, which is further divided to 11,314 for training and 7,532 for testing.
• IMDB. 2 This dataset contains 50,000 movie reviews. The reviews are polarized, containing either positive or negative sentiment. The binary sentiment labels are used to classify these reviews. Hence, the number of classes is 2. The dataset is divided equally for training and testing.
• AGNews 3 The original dataset is a collection of more than one million news articles. We adopt the reconstruction of the dataset for topic learning in [35]. The 4 largest classes are chosen from the original corpus. Each class contains 30,000 documents for training and 1,900 for testing. There are overall 120,000 training samples and 7,600 testing samples. Table 1 summarizes the statistics of the three datasets. For all of the three datasets, only 20% of the training samples retain their labels. Now the datasets for semi-supervised topic learning have been established.

B. EVALUATION METRICS
We use perplexity and pointwise mutual information to evaluate how well are the topics extracted by our proposed model.
• Perplexity. Perplexity [36], borrowed from information theory, is a widely used evaluation metric for topic models. It measures how likely the topic allocation is predicted by the trained model for held-out test documents. In other words, it measures how well a probability model predicts a sample. It is a transformation of cross entropy using exponential function. Here it is defined as where N is the number of documents in the corpus, D i is the number of words in document i. Since we cannot directly compute log p(x), we use the variational lower bound, as used in [17], [37], to report an upper bound on perplexity.
• Pointwise Mutual Information (PMI). PMI [38] measures how well the extracted topics are interpretable 1 http://qwone.com/~jason/20Newsgroups 2 https://ai.stanford.edu/ amaas/data/sentiment/ 3 http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html by computing the co-occurrence probability of pairs of topic words using test dataset. A pair of topic words ω i , ω j is scored as where p(ω) is the probability of seeing the word ω in a random document in the test set, and p(ω i , ω j ) is the probability of seeing both ω i and ω j co-occurring in a random document in the test set. And the PMI of a topic is evaluated by summing the PMI of the top ordered pairs of topic words, i<j PMI (ω i , ω j ). The topic words are sorted decreasingly by the weights produced by the proposed topic model. For the whole dataset, PMI is evaluated by averaging the PMI of each topic.

C. BASELINES
We use both non-neural network-based and neural networkbased baseline methods to compare the performances of our model.
• LDA. Latent Dirichlet Allocation (LDA) [9] is one of the most popular generative Bayesian probabilistic models for text corpora. A document is modeled as a mixture of latent topics. And a topic is modeled as a distribution over vocabulary. LDA utilizes Dirichlet-Multinomial conjugacy and several inference techniques to cluster documents according to topic distributions. It has been proved that LDA performs well on several tasks [39], [40].
• CTM. Correlated Topic Model (CTM) [10] is an extension of LDA. It incorporates correlations into topic models by using the logistic normal distribution to model the topic mixture of a document. Several studies demonstrate its effectiveness [41], [42].
• NVDM. Neural Variational Document Model (NVDM) [17] is among the first a few studies that utilize deep generative neural network model for text data learning. The model shows significant improvement, under the perplexity measure, over traditional LDA.

D. SETTINGS
To encode the weight parameters of GMM, we use dense embeddings of documents as the inputs of encoder φ 1 . The embeddings of the words are obtained from the pre-trained Glove model. And the embeddings of documents are obtained by summing the embeddings of the words in the documents. The dimension of embedding vector is set to 100. Encoder φ 2 uses bag-of-words representation of documents as its inputs. For datasets 20NewsGroup, IMDB and AGNews, the vocabulary size is set to 2,000, 10,000 and 5,000, respectively. These words are the most frequent words in the corresponding datasets, after deleting the stop words using the NLTK toolkit [43]. The words that are not contained in the vocabulary are omitted. The encoder φ 1 is designated as a classifier of documents. We perform experiments with the TextCNN [44] model and the MLP model. The TextCNN model tends to overfitting the data before the end of the training process of φ 2 . Hence, we select MLP for both φ 1 and φ 2 . For φ 1 , the size of the input layer is the dimension of the embedding vector, which is 100. The size of the hidden layer is set to 500. And the size of the output layer is constrained to be the same as the number of labels of a dataset, since we aim to obtain the weights for each class. The size of the number of topics varies in {25, 50, 100, 150, 200} in our experiments. For encoder φ 2 , the size of the input layer is the same as the size of the vocabulary of a dataset. During implementation, the inputs of φ 2 are encoded through the hidden layer into a token vector ρ = g(f MLP φ 2 (x)) where g is the nonlinear active function and f MLP φ 2 denotes the MLP with parameter φ 2 . The size of the hidden layer is set to 500, which is the dimension of the token vector. Then the token is used to produce the parameters of GMM through two different linear networks, i.e, µ = l 1 (ρ) and log σ 2 = l 2 (ρ). The sizes of l 1 and l 2 both equal the number of classes, which is the number of components of the GMM model. The active function used in our experiments is tanh(·).
The decoder takes the hidden variable, which is sampled from the GMM and with the dimension of the number of topics, to be its input. It uses a softmax network to transform the hidden variable into the one-hot representation of words. The weights of the decoder network correspond to the K ×|V| matrix R. z T R ∈ R |V| produces the logit scores for each word, and after the softmax layer, the probabilities of each word are obtained. The probabilities, together with the bag-of-words vector x, are used to compute the reconstruction error.
We use the Xavier uniform initializer in Tensorflow [45] to initialize the weights of our networks. We use the Adam optimizer with a learning rate of 10 −4 . During training, the batch size is set to 64. The maximum epoch is 1,000. After the training of each epoch, we compute the metrics mentioned in Section V-B. We save our model if the newly recorded metrics are better than the existed ones. The training process is terminated when the metrics in 10 consecutive epochs are not better than the recorded ones.

VI. EXPERIMENTAL RESULTS
In this section, we perform experiments with the proposed model on the above-mentioned three datasets. We report the results by comparison with the baselines.

A. COMPARATIVE RESULTS
The perplexity produced by our S-VAE-GM and other comparatives are shown in Figure 3. According to this figure, the following observations are made: (1) Our S-VAE-GM model achieves the lowest perplexity under almost all conditions. With the 20NewsGroup dataset, S-VAE-GM outperforms other baselines under every topic number. With datasets IMDB and AGNews, our method and NVDM achieve similar best performance. Non-NN methods have much higher perplexity values, especially when the number of topics increases. (2) The S-VAE-GM model is more robust, under the perplexity measure, against the change of the dimension of latent space. From these observations, we find that S-VAE-GM improves the performance of topic modeling, especially compared to non-NN methods.
The PMIs produced by S-VAE-GM and other methods are shown in Figure 4. With the 20NewsGroup dataset, both non-NN methods outperform NN-based methods. However, as the number of latent topics increases, the gap between these two kinds of methods is shrunk. With the other two datasets, S-VAE-GM and NVDM achieve similar best performance. We believe that the complexities among the results of PMI values have the following reasons: (1) The true structures of the latent space of the three datasets are very different. The interpretability of the produced topics with different methods varies a lot. (2) Topics may correlate with other topics. Hence, independent components assumption when modeling the latent space limits the performance of the methods. We will explore these issues in our future work.
The S-VAE-GM method and the NVDM method show similar performance except with the 20NewsGroup dataset. We believe this is due to similar implementations of the decoder. However, the semi-supervised learning framework gives the S-VAE-GM model the ability of predicting document labels.

B. ABLATION EXPERIMENTAL RESULTS
In order to verify the effectiveness of the Gaussian mixture model for latent space, we change the GMM of z to a single multivariate Gaussian and perform the experiments using the same settings of S-VAE-GM. The perplexity and PMI values using a single Gaussian under each number of topics are shown in Table 2 and 3, compared with multimodal Gaussian mixture model.
From the tables, we observe that with the 20NewsGroup dataset, using GMM significantly improves the performance under the criteria of perplexity and PMI. Perplexity and PMI are improved by an average of 21.30% and 6.49%, respectively. For all three datasets, perplexity is improved by an average of 7.61%, and PMI is improved by an average of 2.01%. Introducing GMM to latent space has improved the performance under a majority of conditions (23 out of 30).

C. LATENT TOPICS
Each dimension in the latent space is assumed to be associated with a latent topic. In Table 4, we list some randomly selected topics with their top 10 most relevant words.  As shown in the table, we can deduce corresponding topics based on the words. We observe that when the dimension of latent space increases, it is more likely to discover more precise topics. For example, we can associate the third column of words under dataset AGNews with the topic Sports. But after a careful examination, we find these words are all relevant to Baseball, a branch of Sports. We also observe some less relevant words. They are astronauts and NASA in the Political topic of IMDB, and chips in the Economy topic of AGNews. However, in view of that the topic Space is sometimes correlated with the topic Political in some movies, and that chips may represent the topic Industrial production, we believe that these words are not totally unrelated with their corresponding topics. Except for these words, all other words are closely related to their corresponding topics. Hence, the produced topics are locally interpretable and semantically coherent.

D. CLASSIFIER EVALUATION
In order to evaluate the performance of classifier φ 1 in S-VAE-GM, we compute the F1-score of the prediction results. Since only 20% of the data are used for training, the remaining 80% are used for testing the predicted results produced by φ 1 . During experiments, we observe that the number of latent topics, which varies in {25, 50, 100, 150, 200}, does not affect the performance of the classifier. Hence, we do not distinguish the results under different number of latent topics. For comparison, we train an MLP particularly for classifying the text data, with {20%, 40%, 60%, 80%} data as training samples, respectively.
In Figure 5, each box-plot indicates the distribution of F1-scores computed for specific labels. The results of TABLE 4. The latent topics learned by S-VAE-GM for three datasets. The numbers in parentheses indicate the dimension of latent space that produces the corresponding topic words. Each column is an ordered list of the top 10 words with the highest weights in that dimension.

FIGURE 5.
Classifier evaluation with F1 score. S-VAE-GM indicates the classifier in our method, which masks 80% data and leaves the rest 20% data with labels. We train another neural network as the baseline for comparison. The number marked on the x-axis means the percentage of labeled data used for training the baseline.
S-VAE-GM, compared to the results of the baseline with an increase in the percentage of training samples, demonstrate that the performance of classifier φ 1 is not markedly reduced in view of the combination of φ 1 and φ 2 and the incorporation of the cross entropy term in the combined objective.

VII. CONCLUSION
In this paper, we proposed a novel VAE-based semisupervised topic model with a Gaussian mixture model for latent space. We used two separate encoders for the parameters of the Gaussian mixture components and the mixing weights. Latent variables were sampled from the Gaussian mixture to decode the words in documents. We also used the weights, together with partially labeled data for training the classifier. The objectives for both labeled and unlabeled data were carefully derived, and the modules of the framework were appropriately designated. Experiments were performed on three public-available benchmark datasets, 20NewsGroup, IMDB and AGNews. The results under the perplexity and the PMI criteria demonstrated the effectiveness of the proposed topic model, comparing to both non-neural network-based and neural network-based baseline methods.
In the future, we intend to explore topic correlations using both GMM and normalizing flows [46], [47] under the VAE framework.