Short Text Embedding Autoencoders With Attention-Based Neighborhood Preservation

Shortness and sparsity often plague short text representation for clustering and classification. A popular solution is to extract meaningful low-dimensional embeddings as short text representation via various Dimensionality Reduction technology. However, the existing methods, such as topic models and neural networks, discover low-dimensional embeddings from the whole training sets without considering the geometrical information of short text manifold, resulting in an inability to provide a discriminative embedding of short text. In this paper, we propose a manifold-regularized method, namely Short Texts Embedding AutoEncoders (STE-AEs), aiming to incorporate the semantics from the neighborhood into a regularization training of AutoEncoders (AEs) to extract discriminative low-dimensional short text embeddings. STE-AEs first determines semantics neighborhood via an attention-based weighted matching distance and then preserves the local geometrical structure by incorporating a minimization of the weighted cross-entropy of nearby texts’ embeddings into a regularization training of AEs. Finally, the encoder can act as a parametrized mapping function between observations and embeddings. Furthermore, based on the activation values of the encoder for the training set, STE-AEs employs a regression model of Random Forest (RF) to determine the feature importance so as to find certain informative and readable words for embeddings interpretation. Through extensive experiments on three real-world short text corpuses, the evidence demonstrate that STE-AEs can capture the intrinsic discriminative explanatory factors, improving the performance of short text clustering and classification. Moreover, some understandable words can be efficiently discovered to promote the interpretability of low-dimensional embeddings.


I. INTRODUCTION
Short texts, including Tweets, search snippets, FAQs, product comments, and scientific abstracts, etc., are now widespread on the internet. Therefore, automatically analyzing the semantics of short texts is fundamental for a wide range of downstream NLP tasks, like Readers' Emotion Classification [1], Entity Disambiguation [2], Topic Evolution Mining [3] and Short Texts Sentiment Analysis [4]. However, due to the scarcity of word co-occurrence patterns in one short text, and a wide range of vocabulary spanned over a corpus, traditional texts representation models, such as tf − idf , often suffer from the serious data-sparse issue, resulting in performance degradation in clustering and classification task.
The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia .
Existing short text representation works tend to address the data-sparse issue from two aspect. A simple solution is to expand short text as lengthy pseudo-text based on some external knowledge base (or knowledge graph), like Word-Net, MeSH, Wikipedia, Open Directory Project, and handle the extension text via traditional text representation methods. For example, Hu et al. utilize Wikipedia concept and category information to improve semantic information of short texts representation [5]. Recently, Song et al. develop a bayesian inference mechanism to incorporate the semantics behind knowledge base (e.g., WordNet, Freebase and Wikipedia) for short texts conceptualization [6]. Huang et al. introduce a co-ranking framework to extract contextual keywords and combine the keywords with the attention-based strategy for short-text embedding [7]. Knowledge-guided Non-negative Matrix Factorization (KGNMF) is proposed for better short text classification by leveraging external knowledge as a semantic regulator with low-rank formalizations [8]. Li et al. propose a combined method based on knowledge-based conceptualization and a transformer encoder for short text understanding [9]. Although they can relieve the sparsity of short text representation than the conventional model, the expand-based way is not an ideal solution because of the heavy dependence on the accuracy and completeness of the constructed knowledge base, which is also a challenging task [10].
Alternatively, some works aim to reduce the variables set and assemble embedded variables as low-dimensional representations (or embeddings), which is benefit from such variables may reveal the different explanatory factors of variation embedded in the texts [11]. In the past decades, some famous topic-based solutions have been proposed to extract low-dimensional meaningful embeddings, such as probabilistic Latent Semantic Analysis (pLSA) [12] and Latent Dirichlet allocation (LDA) [13]. These models conceptualize each text as a list of mixing proportions of latent topics and interpreting each topic as a distribution of vocabulary [14]. Following the idea of topic model, a biterm topic model (BTM) consider the dependency between latent topics (embedded variables) and an unordered word-pair occurrence patterns in a local context so as to discover informative embeddings of short texts [15]. Besides, Zuo et al. propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling which introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity [16]. Moreover, various Neural-Networks-based approaches adopt a count vector of char, word, sentence or paragraph as input and compound into an intermediate low-dimensional embedding, providing a promising ability to uncover the semantics [17], [18]. Since then, Cedric et al. provide a learning procedure based on a novel median-based loss function to weighty aggregate word embedding for short text embedding [19]. Xu et al. incorporate context-relevant concepts into a convolu-tional neural network (CNN), called DE-CNN, for short text classification [20]. Different from the topic-based models, the intermediate low-dimensional embeddings are named as distributed representation, due to the fact that feeding signals are distributed over each independently activated neuron, which can provide different level embedded variables for text representation. However, the distributed representation brings a challenge to understand the specific meaning of each dimension.
In general, both topic models and neural networks are embedded with latent factors [21], preserving the salient statistical structure of intra-texts. Despite an improvement on short text representation, such methods take a global perspective that observation space as Euclidean to discover the embedded explanatory factors, resulting in the strong dependency of embeddings to all texts and the non-discriminatory of all texts. This will undoubtedly affect the effective capture of intrinsic discriminative explanatory factors. In fact, numerous studies demonstrate that natural observations, such as texts and images, concentrate in the vicinity of a smooth lower-dimensional manifold 1 (or manifold hypothesis), indicating the local geometrical structure of text manifold may be worthy of attention for text embedding [22]- [24]. In other words, nearby texts (or neighborhood) have a strong semantic dependency. For example, nearby texts tend to have more co-occurrence patterns of related words, which implies that they may express more similar semantics and closer relationships of their embeddings. Therefore, the preservation of neighborhood semantic dependency may have a positive effect on capturing the intrinsic discriminative explanatory factors. From this perspective, we propose a manifold-regularized Short Texts Embedding AutoEncoders (STE-AEs), aiming to regularize the training of AutoEncoders (AEs) by imposing an additional minimization of the cross-entropy of nearby texts' embeddings to preserve the local geometrical pattern of neighborhood. Our main contributions are summarized as follows: • Under the regularized training framework, STE-AEs can provide an explicit parametrized mapping function between observations and embeddings, ensuring the embeddings are local invariant around the neighborhoods and improves the discriminative effect. Meanwhile, taking the local perspective will allow STE-AEs to incorporate the semantics of neighborhoods when embedding a given input text, which can enrich the semantics of single short text against the issue of data-sparse.
• Based on the manifold learning framework, the k-nearest-neighborhood (KNN) graph is used to depict the manifold structure. To avoid the impact of the sparsity on the similarity measurement of nearby short text, STE-AEs develops an attention-based weighted matching distance (AWMD), providing a robust similarity mea-surement even for two texts with similar semantics but low co-occurrence vocabulary.
• Finally, to better understand the meaning of lowdimensional embeddings, STE-AEs develops a postprocessing solution that adopts a regression model of RF to determine some informative and understandable words using the activation values of the encoder and the training set, which provides a useful attempt to understand the low-dimensional embeddings based on neural networks.

II. RELATED WORK
In this section, our survey includes two lines of literature most relevant to our work: short text embedding and manifold learning, for a better understanding of the existing methods and challenges.

A. SHORT TEXT EMBEDDING
The embedding is a general term for a class of technologies, which can transform objects in a high-dimensional space into a relatively low-dimensional space for representation. Different from the traditional one-hot representation, it can provide continuous values vectors, and in an ideal state, it can achieve good maintenance of the semantic structure of the object. Due to the shortness and sparsity of short texts, to apply the embedding technology for short text representation, there are two aspects need to be modified. One is to introduce external knowledge with rich semantics, and the other is to improve the ability of embedding technology to capture the inner semantics of short texts [25]. A common solution is to introduce external knowledge (i.e., WordNet, Wikipedia, Open Directory Project) to expand short text as lengthy pseudo-text and then adopt traditional text embedding methods to represent the pseudotexts [5], [26]. Some subsequent works mostly adopt the semantically rich knowledge to formulate short texts as a series of conceptual word sets [6], [7], [27]. For example, Song et al. develop a bayesian inference mechanism to conceptualize short text with knowledgebase (e.g., WordNet, Freebase, and Wikipedia) for clustering [6]. Recently, [7] and [27] first model the short text as a set of relevant concepts using a large taxonomy knowledge, and then adopt different neural network frameworks for short texts embedding. KGNMF is proposed for better short text classification by leveraging external knowledge as a semantic regulator with low-rank formalizations [8]. Li et al. enrich the short text information from a knowledge base based on cooccurrence terms and concepts and embed these concepts into a low-dimensional vector space via a convolutional neural network (CNN) and a subnetwork based on a transformer embedding encoder [9].
Additionally, in the last decades, various embedding technologies aim to improve the deconstruction effect of the semantic structure in the text, like Topic-based and Neural-Networks-based methods [12], [13], [28], [29]. The main difference between them is that Topic-based methods belong to generative models, while Neural-Network-based methods belong to discriminative models. For this reason, the Topic-based methods mainly model the joint distribution between text, topics and words, while the Neural-Network-based method mainly model the conditional distribution between text, topics (latent factors) and words. For example, BTM learns topics over short texts by directly modeling the generation between topics and an unordered word-pair occurrence patterns [15]. Zuo et al. propose a novel probabilistic generative model called Pseudotext-based Topic Model (PTM) by implicitly aggregating short texts as a pseudo text [16]. As a result, by modeling the topic distributions of latent pseudo texts rather than short texts, PTM can be better against data sparsity. Different from the Topic-based models, Neural-Networks-based approaches adopt the low-dimensional hidden layer outputs for short text embedding by training a parametrized neural network for fitting the conditional distribution [18], [30], [31]. For short text embedding, Cedric et al. provide a learning procedure based on a novel median-based loss function to weighty aggregate word embedding [19]. Xu et al. propose a neural network called DE-CNN for shot text classification, which can incorporate context-relevant concepts into a CNN [20].

B. MANIFLOD LEARNING
Numerous successful manifold learning methods preserving the high-dimensional observations in the low-dimensional embeddings show that manifold property can be a discrete approximation by the nearest neighbor graph of scattered observation points, like Locally Linear Embedding (LLE) [22], Laplacian eigenmaps (LEs) [23] and Isometric Feature Mapping (Isomap) [24]. Some probabilistic versions motivated by the idea of LEs were subsequently proposed to extract text embeddings, such as LapPLSI [32], LTM [33], and DTM [34]. Specifically, LapPLSI, LTM, and DTM develop different manifold graph regularization terms to guide the model fitting of pLSA. The manifold graph regularization generally can be summarized as follows: where y is the low-dimensional embeddings, W ij is the edge weight between text i and j in the neighborhood graph, and Dist() indicates a distance measurement of embeddings [34]. LTM employ the Kullback-Leibler (KL) divergence as Dist() to measure the distance of embeddings, whereas DTM and LapPLSA define Dist() as the Euclidean distance. Moreover, DTM goes further to consider negative relationships of texts to improve the full discriminating power [34].
Other promising manifold-inspired approaches exploring text embedding are some AutoEncoder-based variants. The AutoEncoders (AEs) is a one-hidden-layer multi-layer perceptron (MLP), also called AutoAssociators, aiming to reconstruct the original input as correctly as possible [35]. In general, it consists of an encoder f θ , encoding an input vector x ∈ R d to low-dimensional embeddings y = f θ (x) ∈ R k k < d, and a decoder g θ T , decoding y back to the input spacex = g θ T (y) ∈R d as the reconstruction of x, where the mutually transposed parameters θ, θ T are learned by stochastic gradient descent (SGD) to minimize self-reconstruction error (SRE). After that, the CAE [36], GAE [37] and LEAE [38] have been proposed by incorporating various graph regularization terms, yielding better low-dimensional embeddings of observations concentrated in the vicinity of a smooth manifold. CAE employ the Frobenius norm of the encoder's Jacobian as the regularization terms of the AEs, resulting in the encoder is less sensitive to the input but being sensitive to variations along the high-density manifold [36]. GAE ex-plored the possibility of defining existing manifold learning operators (ISOMAP, LLE and LE) as regularization terms of AEs and propose Deep-GAE to handle highly complex datasets, like image [37]. For text representation, LEAE re-gularize the training of AEs to reconstruct In this section, the details of the 3 steps will be explained separately.
k nearest nearby texts instead of the input text, yielding an improvement on clustering and classification [38]. However, the aforementioned works neglect the damage of data sparsity of short text to manifold learning, especially when building neighborhood graph, the data sparsity will seriously affect the effect of similarity measurement between short texts.

III. METHODOLOGY
The block diagram of our approach is shown in Figure 1, and the main idea is as follows: motivated by manifold hypothesis [39], we assume that each short text is embedded in a low-dimensional manifold and nearby texts (or neighborhood) have a strong semantics dependency. Following this hypothesis, we proposed a manifold-regularized Short Texts Embedding approach, namely STE-AEs, aiming to regularize the training of AEs to extract the intrinsic discriminative embeddings by preserving the semantic dependency of the neighborhood. Specifically, we first construct the k-nearest-neighborhood (KNN) graph based on AWMD. Then, we define the manifold graph regularization term as the weighted cross-entropy of nearby texts' embeddings using the edge weight of the KNN graph and regularize the training of AEs with a joint minimization of self-reconstruction and manifold graph regularization. Finally, the encoder y = f θ (x) ∈ R k can be used as an explicit parametrized embedding mapping function to extract the short text embeddings. Additionally, we employ the Random Forest (RF) to perform regression analysis on the training set and its activation value of the encoder. As a result, the RF model can provide feature importance to select certain understandable words for embed-ding interpretation.

A. NEIGHBORHOOD GRAPH CONSTRUCTION BASED ON AWMD 1) KNN GRAPH CONSTRUCTION
In manifold learning, the neighborhood graph can be treated as a discrete approximation with respect to a smooth manifold [23], and thus the construction of neighborhood graph is usually the basic step of the manifold learning framework. In the literature review, the common construction mainly contains either connecting data points within a radius of ε, called the ε-neighborhood graph, or connecting k-nearest data points, the KNN graph. In practice, the KNN graph is more popular, since the ε-neighborhood graph provides weaker performance [40]. Therefore, in this paper, we employ KNN graph to depict the manifold structure of the entire corpus. Given a training set ∈ R n×1 is a n-dimensional text vector of text i, n is the vocabulary size, and x j is the normalized count value of the j th word in vocabulary. Let G = (X, A) denotes a KNN graph, where A = a ij n×n ∈ R n×n is an adjacency matrix composed of similarities between any short text pair x i , x j . Specifically, given a short text x i , if short text x j is one of the k nearest neighbors, then an edge a ij connects x i and x j , weighted with their pairwise similarities; otherwise, a ij = 0. However, this definition leads to a directed graph, since x i may not be among k nearest neighbors of x j . In STE-AE, we take a undirect definition of KNN, that is, if x i is among the k nearest neighbors of x j or if x j is among the k nearest neighbors of x i , there is an weighted edge connecting x j and x j . Assuming there are sufficient short texts to ensure that the short text manifold is well-sampled, this definition will endow some vertices in KNN graph more nearby texts, facilitating efficient propagation of the intrinsic discriminative semantics along with KNN graph [41]. Meanwhile, more nearby texts are beneficial to enrich the semantics of single short text against the data-sparse issue of short text.
is the k th nearest text, and the subset {x k+1 i , . . . x t i } consist of some short texts that take x i as their k nearest neighbors. The construction procedure of KNN graph is summarized as follows.

2) ATTENTION-BASED WEIGHTED MATCHING DISTANCE
The basic idea of STE-AEs is to preserve the neighborhood semantic dependency for short text embedding. In this paper, VOLUME 8, 2020

Algorithm 1 The Construction Procedure of KNN Graph
Input: k is the nearest neighbor numbers, and training set Sort in descending order according to the value of the neighborhood semantic dependency is depicted as the local geometrical pattern of k nearest neighbors weighted with pairwise similarity distance. Previous works indicate pairwise similarity distance can be defined as Euclidean distance or KL divergence [33], [34], yet, for short text embedding, due to the data sparseness issue and the ignorance of the semantic connection between words from bag-of-words representation, a degradation of pairwise similarity measurement may arise when text pair has many synonyms but do not have the same word. To address this issue, we characterize the inherent semantics of words using word embeddings technology, and further model the semantic relationships between synonyms by developing an attention-based weighted matching distance (AWMD), providing a robust pairwise similarity measurement for KNN graph construction.
Word embeddings is a general term for word vectorized representation technology derived from the distributional hypothesis, allowing words with similar meaning to have a similar representation. There have been some short text embedding works using pre-trained word embeddings technology to alleviate the sparseness of short text data [27], [42], [43]. Some well-known pre-trained word embedding models include word2vec, Glove, ELMo and fast-Text. Based on the pre-trained word embeddings, the traditional bag-of-words can be transformed as bag-of-word embeddings and each short text can be viewed as an independent embedding set of occurred words. Meanwhile, the semantics similarity of word embeddings pair can be treated as an edge weight. Therefore, a short text pair is equivalent to a bipartite graph, and then the pairwise similarity distance d x i , x j can be defined as a maximum-weighted matching distance of word embeddings. Furthermore, to better model the semantic connection between synonyms, we employed attention mechanism to assign weight coefficient so as to actively focus on the latent synonyms pattern behind short text pair x i , x j . The attention mechanism can be described as mapping a query and a set of key-value pairs to an output [44]. According to whether the sources of query and key-value pairs are the same, the attention mechanism can be divided into inter-attention and intra-attention (more famous name is self-attention) [45].
Specifically, let D= {w 1 , . . . w n } denotes the set of word embeddings, where n is the vocabulary size, d is the dimension of each word embedding, w i ∈ R d×1 is embedings of i th word in vocabulary. Given a short text pair Firstly, we employ self-attention to compute a contextual representation of the occurred word in text (Figure 2 (a)), its computation is formalized as follows, In self-attention, the key and value matrixes come from the same source T i . The essence is the weighted summation of each element, where the weight is computed by an attention score function, like softmax() of the query with the corresponding key. Secondly, based on inter-attention, we employ softmax() as attention score function to compute the edge weight connecting each contextual representation pair, like c i s , c Finally, the edge weight between c i s and c  where |M| is the numbers of edges in matching. We can see that the AWMD is the normalization of the matching, which helps to eliminate the unfairness case when the length of short text pair is different. Besides, AWMD is a similarity metric, the larger value of AWMD indicate the more similar of short text pairs.

B. REGULARIZED AUTOENCODERS WITH NEIGHBORHOOD PRESERVATION
Based on the constructed KNN graph, the neighborhood semantic dependency has been depicted as the local geometrical pattern of k nearest neighbors weighted with AWMD. In this section, we regularized the training of the AEs by preserving such local geometrical pattern from the observation space in the low-dimensional embedding space and provide an explicit parametrized mapping function between observations and embeddings. Specifically, we took the AWMD to weight the cross-entropy of low-dimensional embeddings of k-nearest-neighbors as the manifold graph regularization term and regularize the training of AEs using a joint minimization of self-reconstruction and manifold graph regularization.

1) THE OPTIMIZE OBJECTIVE FUNCTION
Formally, given a short text x, let y = [y 1 , . . . y d ] T ∈ R d×1 denotes the low-dimensional embeddings, where d is the dimension of the low-dimensional embeddings, let x = x 1 , . . .x n T ∈ R n×1 denotes the self-reconstruction of x. The AEs consists of two modules: the encoder and the decoder. The encoder transforms an input vector x into the low-dimensional embeddings y, whose mathematical expression is a nonlinear version of the affine map, The decoder transforms the embeddings y back to a self-reconstructionx whose form is same to the encoder.
where σ (·) is sigmoid function σ (a) = 1 + exp {a} −1 , b∈R d×1 is bias vector of the encoder, and W ∈ R d×n is the encoder parameters. c is bias vector of the decoder, and the parameters of decoder W T ∈R n×d is ''tied weight'' with W , which reduce the scale of estimated parameters and can make it harder for the encoder to stay in the linear regime of its nonlinearity without paying a high price in reconstruction error [45]. Finally, the self-reconstruction error between x andx is measured with the cross-entropy, denote as H B x,x .
The manifold graph regularization for a given short text x i is measured with the weighted cross-entropy, denoted as Taken together, the object function for a given short text x i is defined as follows, Furthermore, we impose sparse constraint on lowdimensional embeddings, which is mainly based on two main reasons: one is to facilitate understanding of the low-dimensional embedding, and the other is to prevent AEs from learning an identity transformation. The sparse constraint aims to regularize AEs to reduce the difference between the average of each embeddingρ = [ρ 1 , . . .ρ d ] ∈ R 1×d and a fixed sparsity target ρ 3 = [ρ, . . . ρ] ∈R 1×d by minimizing the Kullback-Leibler (KL) divergence, whereρ j = 1/m X y j denotes the average output of low-dimensional embeddings over the training set X = {x 1 , . . .x m }. Then, the corresponding sparsity regularization term is defined as follows, Therefore, the final object function of STE-AEs over the training set X is summarized as follows, 3 In practice, the sparse target is a settable constant hyperparameter. Here, for the convenience of formula 12, we express it as a constant vector. VOLUME 8, 2020 2) MODEL OPTIMIZATION Now, STE-AEs can provide the encoder as an explicit parameterized mapping function between observations and embeddings by the minimization of J (W , b, c). For this purpose, we employ gradient descent (GD) algorithm to optimize the parameters W , b, and c. For convenience, some variable symbols of STE-AEs for the description of model optimization are shown as follows.
In detail, based on the gradient descent algorithm, the parameters W , b, and c are updated as follows, where η is learning rate, and ∇ (·) is partial derivatives of corresponding parameters. Therefore, the key step of model optimization is the computation of partial derivatives with respect to parameters. According to expression (6), (7) and (8) and Chain rule, we have The procedure of the model optimization algorithm is shown in Algorithm 2.

Algorithm 2 Model Optimization for STE-AEs
Input: The training set X = {x 1 , . . . x m }. Output: the parameter of affine mapping, W , b, and c Construct a knn-graph based on the AWMD over the entire training set, G = (X, A).

Randomly initialized W , b, c of STE-AEs
For epoch = 1 to max_epoch Perform a feedforward pass with each instance x i and its nearby set N i,t , computing the activations for the hidden layer, output layer andρ j ; Based on expressions (16), (17) and (18), compute the average partial derivatives over the training set

C. EMBEDDINGS INTERPRETATION BASED ON RANDOM FOREST
Now, the low-dimensional embeddings of out-of-sample short text can be extracted easily via the encoder y = f W ,b (x). However, what meaning is implied by each dimension of the low-dimensional embeddings is confusing. In this section, we employ a regression model of Random Forest for feature selection to deal with this issue and try to find certain understandable keywords to improve the interpretability of such embedded variables. The process is divided into two steps: we first propose a partition strategy involving inter-pretation subset based on a ranking of the activation of each dimension of the low-dimensional embeddings, and then execute feature selection via random forest algorithm with each interpretation subset independently.

1) INTERPRETATION SUBSET PARTITION
Given an low-dimensional embeddings matrix Y = y 1 , . . .y m , where y i = [y 1 , . . .y d ] T ∈ R d×1 consists of the activation of the hidden unit, and y j indicate the activation of j th hidden unit. As is well-known, the activation is calculated via logistic sigmoid function that depends heavily upon the dot product of the input signals and its parameters (synaptic weights and bias). Specifically, the hidden neurons are more active when their inputs are more relevant with their synaptic weights, while the bias controls the threshold of this correlation [47]. Therefore, the parameters of the neuron determine what kind of input signal is easier to induce the neuron to give a more active activation value and the more active activation demonstrates the input signals contain richer information for embeddings interpretation. According to the sorting of activation to each hidden unit, we can choose a small part of the most active or most informative short texts from the whole training set as an interpretation subset.
There are two partition strategies to select interpretation subsets based on the sorting of activation. One is to set a threshold to select those samples whose activation value is greater than the threshold. Due to the adoption of sparsity regularization, some dimensions in low-dimensional embeddings will be infinitely close to zero. Therefore, it is difficult to set appropriate thresholds based on the absolute size of the activation value to select samples. The other is to select the top-k samples at the front of the sort, which can provide relatively stable most informative short texts. For this reason, we selected the top-k short texts based on the sorting of activation to each the hidden neuron independently and obtain interpretation subsets.

2) FEATURE SELECTION WITH INTERPRETATION SUBSET
In this paper, let is = { i } d denote a collection of the obtained d interpretation subsets, where d indicates the numbers of neurons in hidden layer (or the dimension of the low-dimensional embeddings) and i indicates a collection of the picked top-k short texts based on the sorting of activation to one hidden neurons. Based on the interpretation subsets, we execute feature selection for each dimension via random forest algorithm respectively. The Random Forest (RF) is an exemplar of the ensemble learning method for classification, which combines the random subspace method and bagging method. The principle of RF is to build a multitude of decision trees using several bootstrap samples from the entire training set and choosing the best split feature from a random selected subset of explanatory variables. In addition, it can provide a ranking of Variables Importance (VI) based on the out-of-bag (OOB) samples using wrapper methods of feature selection. The quantification of the VI is a crucial measure in the feature selection task since it indicates the contribution of candidate variables to the response variable for interpretation purposes. Therefore, we built RF model using each interpretation subset independently, then determine a subset of features according to the ranking of VI.

IV. EXPERIMENT
Here, we investigated the performance of the extracted low-dimensional embeddings from two aspect: discriminability and interpretability. Firstly, we provided different dimensionalities of embeddings (10,30,50,70,90,110,130) over three widely used text corpora (Web-snippets, 20 newsgroups and Twitter) and compared the performance of STE-AEs with the following state-of-the-art approaches in two widespread applications, clustering and classification. Factorization Topic (TRNMF, 2020) [42]. Secondly, we provided understandable keywords selected from each interpretation subset by the RF model to interpret what meaning is implied by each dimension of the embeddings.

A. DATASETS
We choose Web-snippets, 4 20-Newsgroups 5 and Twitter 6 as our experimental datasets. The Web-snippets is a collection of snippets of texts presented as results of a query by a search engine, which consists of 12,340 search snippets belonging to 8 domains [49]. 20-Newsgroup is a collection of newsgroups, including across 20 different newsgroups. For short text embedding, we selected only the samples with less than 21 words, denoted as 20Nshort, as done in [43]. The Twitter consists of 5,513 hand-classified tweets divided into 4 different topics: Apple, Google, Microsoft, Twitter [42]. All datasets were preprocessed by making all the text lower case, removing non-alphabetic characters, and stopwords in a standard list. Besides, some words shorter than 3 characters or appearing under 10 times in Snippet or under 5 times in the other two datasets were removed. Table 1 shows some details of statistical information about three datasets, where D is the number of short texts, V is the size of vocabulary spanned over each domain,D and St.Dev are the average mean and the standard deviation of the number of words occurred in each text.

B. EXPERIMENTAL PROCEDURE
To obtain a fair experimental performance, we conducted 5-fold cross-validation (5-CV) over three datasets. There are 3 general procedures in each iteration: embedding model construction, embeddings extraction and application performance evaluation (clustering and classification). Specifically, we shuffled three datasets and divided each dataset into five equal subsets. Then, one of these subsets was cyclically picked for embeddings extraction and application performance evaluation (test set), and the remaining subsets were used as embedding model construction (training set) until all subsets were picked for evaluation. Therefore, we conducted five iterations estimation and obtained five application performance. Finally, the estimation results are the average performance of the five iterations. Figure 3 is the flow diagram representting this experimental procedure.
All comparison methods were performed under the uniform hyper-parameters setting over three datasets. For BTM, 7 we set α = 50/topic numbers and β = 1e − 2. For PTM, we used the optimal setting as follows: α= 0.1, λ= 0.1 and β= 0.01 [16]. For STC, the CNN have two convolutional layers, where the number of feature maps at the first convolutional layer is 12, and 8 feature maps at the second convolutional layer. The value of k-max pooling is 5. For TRNMF, we set α= 0.1, λ= 0.1, β= 0.1 and γ = 0.01, the Gibbs sampling is run for 1,000 iterations [42]. For ST-AEs, we used two versions of pre-trained word2vec embeddings: one is for Web-snippets and 20Nshort 8 , the other is for Twitter. 9 We fixed α= 0.1 value for all corpora and set the batch size to 64 and pre-trained the autoencoder for 15 epochs [48]. Please note that we removed the words that were not in the word embedding lookup table. Although this may cause the semantic loss of the short text and increase the sparsity, it will not change the fairness of the comparison experiment.
For STE-AEs, we used the same setting of pretrained word2vec embedding as ST-AEs, and the optimal hyper-parameters obtained after 5-CV, the learning rate = 0.5,epoch = 200, λ= 50, γ = 10, the fixed sparsity target ρ = 1/d and the number of neighbors K = 13,. For LEAEs, the batch size is set to 100 and the neighbors is set to 7, η= 1.2, epoch = 30, λ= 100. For DTM, 10 we set the number of neighbors is 20 and λ= 1000. Since the graph regularize-tion of the DTM is based on LE algorithms, DTM cannot provide a specific mapping function from the manifold 7 https://github.com/xiaohuiyan/BTM 8 https://github.com/jacoxu/STC2 9 https://nlp.stanford.edu/projects/glove/ 10 http://www.cs.cmu.edu/∼seungilh/dtm_codes/index.html to the output embedding [41], leading to a limitation on handling a previously unseen short text. To address this issue, we employed inclusive approaches that rebuild similarity and dissimilarity matrices with the evaluation subset, retraining the model based on these matrices [34].

1) DISCRIMINATIVE PERFORMANCE IN UNSUPERVISED SETTING
To evaluate the performance of discriminability behind various low-dimensional embeddings of test short texts, we utilized the K-means algorithm to group them with the same cluster number as the ground truth. As is well-known, K-means automatically groups instances according to their measure of similarity of representation vector, thus the clustering results can demonstrate the quality of similarity and dissimilarity of representation. We evaluated the clustering results for 5 iterations via two common estimation metrics: accuracy (ACC) and the normalized mutual information metric (NMI ). Given a short text x i , let C i be the assigned cluster id and S i be the original label. The ACC is defined as follows [33]: where N indicates the size of the test texts and map (C i ) matches C i to equivalent short text labels. The determination of optimal mapping can refer to the Kuhn-Munkres algorithm [50]. δ (x, y) is delta function defined as follows: The NMI is defined as follows, where H (·) is the entropy, √ H (C) , H (S) is used for normalizing the mutual information to be in the range of [0, 1]. MI (C, S) denotes the mutual information between C and S, which is measured as,  p (C i , T i ) is the joint probability that x i belongs to C i and T i simultaneously. p (C i ) and p (T i ) denote the probabilities that x i belongs to C i and T i , respectively. Figure 4 and 5 are the average clustering performance of various low-dimensional embeddings after 5-CV. As shown in the Figure 4 and 5, the mean value (ACC and NMI ) of STE-AEs can consistently outperform the state-of-the-art comparative approaches (TRNMF) over three datasets. In particular, for the Twitter dataset, STE-AEs can achieve the best ACC (0.8588 ± 0.0089) in dimension 70. In addition, for three datasets with different sparse scales (more statistical information sees Table 1), the short text embeddings extracted  by STE-AEs provided smaller standard deviations in various dimensions, which indicates that SE-AEs can achieve more steadier clustering performance. The evidence demonstrates that the preservation of the semantic dependency from the attention-based neighborhood has a positive effect on capturing the intrinsic discriminative explanatory factors. Besides, compared with other manifold-inspired approaches, like DTM and LEAEs, STE-AEs presented a smoother peak, which means that STE-AEs can provide the best clustering performance in a wider range of dimensions. We attribute this evidence mainly to the introduction of sparsity constraints into the extraction process of low-dimensional embeddings, because the fixed sparsity target ρ = 1/d will decrease as the dimension d increases, so STE-AEs still capture the intrinsic discriminative explanatory factors by guiding more hidden unit's activations close to zero, even if the dimension of embeddings is large.
Furthermore, to analyze discriminative performance direct-ly, we adopted a popular tool of data visualization techniques, t-Distributed Stochastic Neighbor Embedding (t-SNE) 11 [51], to visualize low-dimensional embeddings in a 2D scatter plot. The t-SNE utilize a probabilities distribution to measure similarities instead of pairwise distances of objects and minimizes the Kullback-Leibler (KL) divergence between such probabilities in the input and output space, which is conducive to reflect the similarities and dissimilarities over objects. To provide the best visualization, we picked three best performing approaches, STE-AEs, TRNMF and ST-AEs in 50-dimension on Twitter. Figure 7 present scatter diagrams of the 50-dimensional embeddings over test sets. Each dot indicates a short text and each color-shape pair denotes a class. From Figure 7, we can see that STE-AEs and TRNMF present clearer clusters structure than ST-AEs, and comparing with TRNMF, STE-AEs presents clear-cut margins among differ-ent semantic category, which demonstrate that our proposed approach not only preserves inner-class intrinsic structure but also reduces possible overlap and widens inter-class margins. Intuitively, the evidence shows that our approach provides more separable low-dimensional embeddings than the other methods, confirming the discriminability of our low-dimensional embeddings.
Based on the above evidence, with three measures of ACC, NMI and t-SNE under three datasets, we can conclude that the proposed approach is effective approach to capture the intrinsic discriminative explanatory factors, improving the performance of short text clustering.

2) DISCRIMINATIVE PERFORMANCE IN SUPERVISED SETTING
In this section, we further compared the influence of discriminative power of various low-dimensional embeddings in a super-vised way. After extracting the low-dimensional embeddings of evaluation subsets, we randomly divided them into 2 equal parts. One is applied for classification test, and the other is used for classifier training of 1-nearest neighbor (1-NN) and support vector machine (SVM), 12 respectively. Since the size of each category is different, we employed the weighted F-measureF to estimate the accuracy of the classification model, which is calculated as follows: where c i is the proportion of instances in test set categories i and C is the size of the test set. F i is the F-measure of 12 We implemented the classification framework via weka. In this paper, we used lazy.IB1 for 1-NN, but for SVM, we adopt LIBSVM java code from github (https://github.com/cjlin1/libsvm), which could be easily executed by weka.
categories i, which indicate tradeoff between the precision P i and recall R i . The P i , R i and F i are defined as follows: Figure 7 and 8 are the averageF and standard deviations after 5 iterations on 1-NN and the SVM, respectively. From these figures, we can observe a significant improvement and a smoother peak similar to the clustering experiment, which further illustrates the effectiveness of STE-AEs to capture the intrinsic discriminative explanatory factors. In addition, from figure 7 and 8, we see that BTM outperforms DTM in most dimensions, which is different from the performance of ACC and NMI in Figure 5 and 6. This is mainly because DTM implicitly utilized the valuable class label information to build the similarity matrix [34], and the class label information can give DTM an inherent advantage to improve the discrimenability of low-dimensional embeddings in clustering (unsupervised setting), while for classification (supervised setting), the classifier can naturally utilize the class label information, so DTM's advantage of implicit use of category information may be reduced or even surpassed by other methods, like BTM.
In summary, we conclude that STE-AEs can capture the intrinsic discriminative explanatory factors, improving the performance of short text clustering and classification. The excellent discriminability benefits from the good expression of low-dimensional embeddings for the inherent similarity of short text. This is mainly because we take a local perspective that the embeddings of each short text are strongly associated with the specific word co-occurrence patterns of itself and its neighbors, ensuring the embeddings are local invariant around the neighborhoods and improve the discriminative effect. Specifically, the minimization of manifold graph regularization will guide the cross-entropy of low-dimensional embeddings to be small, when the AWMD of short text pair is large. In other words, the encoder function tends to assign similar low-dimensional embeddings to nearby short texts or the neighborhood of semantic dependency.

3) ABLATION EXPERIMENT
Additionally, following the above unsupervised and supervised settings, we provided an ablation experiment to evaluate the effect of the AWMD-based manifold graph regularization and the sparsity regularization on the final performance. Specifically, STE-AEs indicates the complete solution incorporating the AWMD-based manifold graph regularization and the sparsity regularization. Comparing with STE-AEs, STE − AEs knn&sparse also incorporates the sparsity regularization, but modify the manifold graph regularization by taking the Euclidean distance as the pairwise similarity distance for KNN construction. Besides, different from STE − AEs knn&sparse , STE − AEs knn further removes the sparsity regularization from the optimize object function. Figure 9 and 10 are the average clustering and classification performance over three datasets, respectively.
From the comparison between STE-AEs and STE − AEs knn&sparse in Figure 9 and 10, we can see that STE-AEs can consistently outperform STE − AEs knn&sparse in clustering and classification, which demonstrates the AWMD can provide a more reasonable connectivity structure to depict the semantic dependency of neighborhood. This is not only because the AWMD can integrate the inherent semantics of words through word embedding technology, but it also can further achieve good modeling of semantic connections between synonyms through the word weighted matching process based on the attention mechanism. Meanwhile, STE − AEs knn presents a worse performance than STE − AEs knn&sparse , which indicates the sparsity regularization have a positive effect on improving the discriminability of low-dimensional embeddings. As discussed above, the minimization of sparsity regularization will guide a certain proportion of dimension values that are infinitely close to zero. Therefore, STE − AEs knn&sparse tends to distribute the intrin-sic discriminative explanatory factors in those dimensions with larger values (called discriminative dimensions) in the low-dimensional embeddings, which means those similar low-dimensional embeddings from nearby short texts may have similar discriminative dimensions yet those distinct low-dimensional embeddings from non-neighbor short texts may have distinct discriminative dimensions. In other words, the sparsity regularization helps to increase the combination space of discriminative dimensions, thereby helping to expand the discriminability between low-dimensional embeddings.

4) COMPREHENSION OF LOW-DIMENSIONAL EMBEDDINGS
Finally, we provided understandable keywords to interpret what meaning is implied by each dimension of the low-dimensional embeddings. Specifically, based on  the one of five iterations estimation performance in the above unsupervised and supervised experiments, we selected 10-dimensional embeddings of training sets of three datasets by the built embedding model. Then, according to the sorting of each dimension of embeddings (the activation of each the hidden neuron), we selected the top-200 short texts to construct interpretation subsets. Next, we built an RF model using one interpretation subset independently and selected the top-5 words (variables) based on the VI to interpret the meaning of each dimension of low-dimensional embeddings.
We compared STE-AEs with other topic-based approaches like TRNMF, PTM and BTM. Table 2 shows part of the selected most important five words over three datasets.
From Table 2, we can see that STE-AEs and TRNMF can provide more informative and understandable words as expected for embeddings interpretation. The words discovered by PTM and BTM are confusing, like ''claim evid arm game true'' of PTM and ''tennis music ski buy movie'' of BTM. Therefore, the evidence demonstrates that the proposed method can alleviate the issue that the neural network-based embedding methods fail to effectively interpret the meaning of the embeddings. Besides, different from the Topic-based methods our method provides a post-processing solution for low-dimensional embeddings interpretation, reducing the complexity of the model and improving practical applications.

V. CONCLUSION
In this paper, we propose a manifold-inspired Short Texts Embedding approach, STE-AEs, which aims to extract low-dimensional embeddings of short texts against datasparse issue. To avoid the impact of the sparsity on the similarity measurement of nearby short text, STE-AEs develops a robust similarity measurement, AWMD, for KNN construction, then regularizes the training of AEs by imposing an additional minimization of the cross-entropy of nearby texts' embeddings to preserve the local geometrical pattern of neighborhood, which is beneficial to alleviate the data-sparse issue of short text. As a result, under the regularized training framework, STE-AEs can provide an explicit parametrized mapping function between observations and embeddings, ensuring the embeddings are local invariant around the neighborhoods, and improve the discriminative effect. The evidence on three real-world short text corpuses demonstrate that STE-AEs can capture the intrinsic discriminative explanatory factors, improving the performance of short text clustering and classification. Additionally, STE-AEs develops a post-processing solution that build a RF regression model to find some informative and understandable words using the activation values of the encoder and the training set. The exploration of low-dimensional embeddings interpretation yields inspirational results that some informative and understandable words are selected, improving the semantic interpretability of low-dimensional embeddings.