Deep Recursive Embedding for High-Dimensional Data

Embedding high-dimensional data onto a low-dimensional manifold is of both theoretical and practical value. In this paper, we propose to combine deep neural networks (DNN) with mathematics-guided embedding rules for high-dimensional data embedding. We introduce a generic deep embedding network (DEN) framework, which is able to learn a parametric mapping from high-dimensional space to low-dimensional space, guided by well-established objectives such as Kullback-Leibler (KL) divergence minimization. We further propose a recursive strategy, called deep recursive embedding (DRE), to make use of the latent data representations for boosted embedding performance. We exemplify the flexibility of DRE by different architectures and loss functions, and benchmarked our method against the two most popular embedding methods, namely, t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP). The proposed DRE method can map out-of-sample data and scale to extremely large datasets. Experiments on a range of public datasets demonstrated improved embedding performance in terms of local and global structure preservation, compared with other state-of-the-art embedding methods.


Introduction
Embedding high-dimensional data onto a low-dimensional manifold is of both theoretical and practical value. It can be used for a variety of applications such as data visualization, representation learning, unsupervised clustering, and data exploration , Bengio, 2013, Bengio et al., 2014, Liu et al., 2017. t-distributed stochastic neighbor embedding (t-SNE) [van der Maaten and Hinton, 2008] is among the most well-known and widely-used methods for high-dimensional data visualization, which preserves local similarity of high-dimensional data in a drastically reduced low-dimensional map (typically 2). t-SNE has found widespread applications in life science research among others in the last decade, and established itself as an important visualization tool in the scientific community [Kobak and Berens, 2019]. Nonetheless, it is so far largely recognized as a visualization tool, as some practical issues may have limited its wider use in learning-based tasks. First, t-SNE does not immediately allow out-of-sample projection. Second, the computation of t-SNE is memory-and time-consuming, not well scalable to the extremely large datasets of today. Third, t-SNE (especially its fast implementations) focuses on local neighborhood, often losing global data structure of data.
A number of follow-up work have been proposed after the initial paper of t-SNE. For instance, Barnes-Hut-SNE (BH-SNE) [van der Maaten, 2014], A-tSNE [Pezzotti et al., 2017] and FIt-SNE  were proposed to accelerate the computation and reduce memory usage of t-SNE. Recently, another DR method called uniform manifold approximation and projection (UMAP) [McInnes et al., 2018] was introduced and became another popular visualization tool. UMAP has demonstrated competitive visualization performance as t-SNE, with notably reduced runtime and arguably better preservation of global structure [McInnes et al., 2018].
In the meanwhile, DR has also been studied in the deep learning field Salakhutdinov, 2006, Makhzani et al., 2016]. For example, the auto-encoder (AE) is a classical DR method [Hinton and Salakhutdinov, 2006], which embeds high-dimensional data through a DNN with an encoder-decoder architecture. AE can be trained with mean square error (MSE) loss and standard optimization, and the intermediate tensors at the information bottleneck are seen as low-dimensional representations of the high-dimensional data. Although its embedding performance is generally inferior to that of t-SNE or UMAP, AE suggests great potential of DNN for DR through unsupervised learning.
We posit that the combination of DNN and dedicated, mathematics-guided embedding rules may further enrich the possibilities of DR and improve its performance. Embedding rules such as cross entropy and Kullback-Leibler (KL) divergence are more specific to the purpose of DR, compared to MSE which is intended for reconstructing data. Generally speaking, there exist two methodologies to combine DNN and DR. Intuitively, we may teach the DNN to learn from a referenced DR method. The recent deep learning multidimensional projection method [Espadoto et al., 2020] follows this methodology, which first computes a t-SNE or UMAP map in conventional ways, then let the DNN learn the t-SNE or UMAP results by MSE. This method can achieve scalability and out-of-sample support, but its embedding performance has an upper limitation equal to that of the referenced method. The second methodology is to train a DNN with dedicated loss functions to directly capture the mathematical principles of embedding, therefore mathematics-guided. The parametric t-SNE (ptSNE) proposed in 2009[van der Maaten, 2009] is an early work in this direction, which employed the t-SNE loss function to train a restricted Boltzmann machine (RBM). ptSNE addresses the scalability issue, but still has some practical limitations. ptSNE involves a pre-training step and a fine-tuning step, which are complicated while not guaranteed to converge. Even with ideal convergence, ptSNE cannot exceed the performance of the original t-SNE.
We seek to develop a parametric (i.e. able to embed new data points) and scalable (i.e. able to embed infinite data points) DR method that can find an intrinsic low-dimensional representation of data. In this paper, we propose a generic deep embedding network (DEN) framework that is based on well-established embedding rules and optimized by modern DNN. In addition, to break the upper limit of the performance in classical DR methods, we propose a recursive training strategy called deep recursive embedding (DRE), to make use of latent representations to boost the DR performance further.
Specifically, as we will show in later sections, the advantages of the proposed DRE method are four-fold: (1) It can readily map out-of-sample data points. (2) It is scalable and memory-efficient. (3) It is flexible in the design of its architecture, loss, and training strategy. (4) It breaks the upper limit of t-SNE or UMAP.

Classical Dimensionality Reduction Methods
Dimensionality reduction is a classical problem in machine learning. The most well-known linear DR method is principal component analysis (PCA) [Jolliffe and Cadima, 2016], which computes low-dimensional representation by linearly projecting high-dimensional data points onto the first few principal components that preserves the largest variance.
Other classical DR methods include multidimensional scaling (MDS) [Buja et al., 2008], Isomap [Balasubramanian et al., 2002], locally linear embedding (LLE) [Roweis, 2000], and stochastic neighbor embedding (SNE) [Hinton and Roweis, 2002]. They usually produce better visualization results on nonlinear data compared with PCA, but with limitations in terms of scalability, speed, and stochasticity. In 2008, the t-SNE method [van der Maaten and Hinton, 2008], a variant of SNE, was proposed, which later became one of the most popular methods in the research community for high-dimensional data visualization. t-SNE emphasizes local neighborhood in data with a modified assumption (t-distribution in low-dimensional space) over SNE, and can produce higher quality embedding results compared with other nonlinear manifold learning methods.
Many developments followed up the original t-SNE [Chan et al., 2018, Pezzotti et al., 2020. An important work is Barnes-Hut t-SNE (BH t-SNE) [van der Maaten, 2014], which approximates the high-dimensional space with sparse distributions and uses the Barnes-Hut algorithm to speed up computation. The approximation largely reduces the algorithm complexity and memory consumption, enabling t-SNE to embed very large datasets that were previously impossible. A hierarchical stochastic neighbor embedding (HSNE) method was proposed to interactively render the visualization at different scales [Pezzotti et al., 2016]. The hierarchical way of visualization allows small memory footprint by focusing on landmarks instead of all data. The authors also proposed an A-tSNE method [Pezzotti et al., 2017], which trades off speed and accuracy, to enable interactive data exploration. More recently, Linderman et al. proposed a fast Fourier transform-accelerated interpolation-based t-SNE (FIt-SNE)  to accelerate the implementation of t-SNE. In FIt-SNE, an effective late exaggeration strategy was also proposed for better clustering. Lately, another important embedding method UMAP was introduced [McInnes et al., 2018], which is rooted in mathematical foundations of Riemannian geometry and algebraic topology. UMAP is considered competitive with t-SNE in visualization, while faster and better scalable in implementation. It is also able to map out-of-sample data by graph embedding. However, it was reported that UMAP can be sensitive to the choice of hyper-parameters [Espadoto et al., 2020]. Although some work argued that the global data structure can be better preserved by UMAP, other studies suggested that UMAP preserves the global structure in a way similar to t-SNE with the same initialization [Kobak and Linderman, 2019].

Deep Learning Methods
AE is a classical DNN-based DR method, which maps the data to a low-dimensional representation with an encoder. Through unsupervised learning, the encoder can generate a nonlinear embedding in the low-dimensional space. While AE uses the MSE loss, the ptSNE method [van der Maaten, 2009], proposed by van der Maaten in 2009, adopted the same KL loss function as t-SNE. ptSNE consists of three separate steps: training RBMs, stacking the RBMs to construct a pre-trained NN, and finetuning the pretrained NN. Such a training process is cumbersome, and its performance is frequently poorer than that of the standard t-SNE due to the difficulty of optimization. Another DNN method was later proposed to directly cluster data, using a loss function similar to that of t-SNE, but based on estimated centroids [Xie et al., 2016]. Very recently, a DL Projection method [Espadoto et al., 2020] was introduced to learn from any referenced DR methods: it first selects a reference method (e.g. t-SNE, UMAP) to generate a projection of a subset; then trains a DNN to learn the embedding. The method is effective and intuitive, but its performance is restricted by its teachers. Although the network itself does not demand complex parameter settings, the front-end process (i.e. the DR method to generate the ground truth for supervised learning) still requires elaborate tuning.
We believe there is still much potential in DNN for DR, given its strong capability of representation learning. In this work, we propose a dedicated DNN-based DR framework, named deep embedding network (DEN), to embed high-dimensional data in a unsupervised manner by well-established mathematical principles. Within this framework, we further propose a novel deep recursive embedding (DRE) method, which can make use of the latent representations of the original data for further boosted DR performance. To demonstrate the flexibility of our framework, we introduce a two-stage t-SNE-UMAP loss, which is able to combine the favorable properties of t-SNE and UMAP. We provide a further detailed comparison between our methods and other deep-learning-based methods in Discussion.

Overview of DEN
We propose a generic framework, called Deep Embedding Network, to accomplish nonlinear DR through DNNs. In short, DEN is an unsupervised learning method which uses DNN to optimize a dedicated loss function such that the output low-dimensional representation optimally preserves the desirable geometrical properties of the input highdimensional data. Once trained, it forms a parametric mapping from input to output, able to embed any out-of-sample data. DEN is flexible in its design: it can make use of any modern modules of DNN, take in data of different forms (i.e. not only vectors but also tensors), and allow different loss functions.

Flexible Architecture: FCNN and CNN
DEN allows very flexible design of the network architecture. Generally speaking, DEN can be divided into two parts: a feature extraction part and a dimensionality reduction part. In Fig. 1 we present two DEN architectures, named as Model A and Model B. Model A is a classical fully connected neural network (FCNN), while Model B includes convolutional neural network (CNN) modules for better extraction of features from images.
Each network consists of two parts: (1) Feature extraction: In this part, we aim to capture the high-level features from input data. In our experiments, we determined empirically the parameters of NN, and we noticed that the final performance is very robust to the setting of NN parameters. In Model A, the vector feature extractor contains 5 hidden layers, each of which followed by a ReLU and a batch normalization layer. In Model B, a standard design is followed: the first 2 convolutional layers (filter number=16) with ReLU and maxpooling capture shallow features, then 2 convolutional layers with 32 filters followed by ReLU and maxpooling operation are used to extract deeper features related to the task.
(2) Dimension reduction: In this part, the dimensionality is gradually reduced, from 2000 to 500, then 100, and finally to the target dimension 2. We also followed a standard design in DNN: the first dense layer with 2000 neurons is used to lift the representations so that the capability to express data is increased. Then 2 dense layers with 500 and 100 neurons are used to reduce the dimension, as is often in NN design, i.e. a decreasing number of neurons at deeper layers. We note that the final DR results are not sensitive to the network hyperparameters.

Flexible Loss: Deep t-SNE and Deep UMAP
With the generic DEN architectures, we formulate dedicated loss functions to guide the training. DEN can take flexible loss functions, including that of t-SNE [van der Maaten and Hinton, 2008] and UMAP [McInnes et al., 2018], both in an information-theoretic sense. Hinton, 2008] minimizes the Kullback-Leibler (KL) divergence between two estimated probability distributions, P and Q, where P is for x in the high-dimensional space, and Q for their representation y in the low-dimensional space:

t-SNE [van der Maaten and
where p ij and q ij are normalized pairwise similarities: p ij is the probability that a data point would choose the data point x j as its neighbor under the Gaussian distribution in the high-dimensional space, and q ij is the probability that a data point y i would choose the data point y j as its neighbor under the t-distribution in the low-dimensional space. p ij and q ij are calculated as follows: where N is the number of data points, σ i is the Gaussian kernel at x i , computed through a user defined perplexity [van der Maaten and Hinton, 2008], and α is the degree of freedom of t-distribution.
UMAP minimizes the fuzzy set cross entropy (CE) between two fuzzy membership functions [McInnes et al., 2018], V and W , where V is for x in the high-dimensional space, and W for their representation y in the low-dimensional space. The loss is defined as: is the distance between data point x i and x j . ρ i is the minimum distance between x i and its neighbors to ensure connectivity of the map. σ i is derived as follows: where k is the number of nearest neighbors when constructing the UMAP graph.
In the low-dimensional space, w ij is calculated as: where a and b are user-defined values [McInnes et al., 2018].
We name the DEN trained with the t-SNE and UMAP loss deep t-SNE and deep UMAP, respectively. In the training process, mini-batches are used to reduce memory consumption for the expensive computation of P and V . We first randomly shuffle the entire dataset, then divide the dataset into mini-batches of a fixed size. After that, the batches are trained one-by-one. When all batches of the entire dataset have been trained, one epoch is accomplished. The loss calculation is as define in (1) and (6), and the two probability distributions P and Q for t-SNE (or V and W for UMAP) are calculated from data within mini-batches.

Deep Recursive Embedding
With introduction of the modern DNN components such as ReLU activation and batch normalization, we can largely improve the ease and efficiency of training, compared with the RBM-based ptSNE. However, we note that the upper limit of the embedding performance remains the same as that of the original t-SNE or UMAP, given that the loss function is identical to its original definition. To push the limit, we further present a recursive training strategy, called deep recursive embedding (DRE).
DRE is proposed to make use of the latent representations extracted by DEN. In principle, the intermediate feature maps from different layers of DNN reflect different levels of abstraction of the original input. As discussed in [Yosinski et al., 2015, Selvaraju et al., 2019, the deeper layers of DNN provide more abstract representations specific to the learning task. In [Elthakeb et al., 2020], the authors demonstrated that the deep layers of DNNs can extract a rich set of features, and leveraging these intermediate representations for knowledge distillation led to significant improvement of network efficiency. In [Johnson et al., 2016], the high-level intermediate feature maps extracted from a pretrained DNN are used to compute the perceptual loss to improve the image reconstruction performance. For our purpose, the high-level representations obtained by Deep t-SNE (or Deep UMAP) also contain relevant information related to the task of DR. Although unsupervised, these representations are optimized with the information-theoretic loss that strives to match high-dimensional and low-dimensional statistics. Such representations therefore serves as proper representation of the original data to progressively improve the performance of DR.
We propose to make use of the intermediate output of DEN for computation of P or V . This process is therefore recursive, as illustrated in Fig. 2. In each recursion step, we invoke intermediate outputs of feature extraction. Then, based on these extracted feature maps, we calculate a new probability distributionsp j|i andṽ j|i forP andṼ : where f (·) is high-level feature extraction function realized by the feature extraction part of DEN. Based on this new probability distributions, we can define the recursive t-SNE loss and recursive UMAP loss: At the beginning of recursive training, we calculate P and V once to estimate the loss in (1) and (6). In principle, we can also continuously update P and V during the training. However, this turned to be time-consuming and our experiments showed that the improvement was minor.

Multi-Stage Losses
Our design also allows flexible combination of loss functions at different stage of recursion. In theory, UMAP pushes the dispersed points between different clusters to the nearest ones, driven by the second term in its loss function (6). This results in a desirable visual effect promoting clean margins between classes (albeit not necessarily better classification performance or representation, as show in Results). We exemplify the flexible design of DEN by introducing a two-stage loss: first optimize DEN by the t-SNE loss, then refine the final visualization by the UMAP loss. The workflow is as follows: In stage 1, the network is trained with the original t-SNE loss L t−SNE between y and x; then, the network is recursively updated by minimizing the recursive t-SNE loss L t−SNE between y and the latent representations, generated by the output of Dense 2000, Dense 500 and Dense 100 layers, where the feature extractors are defined as f 2000 (·), f 500 (·), and f 100 (·), respectively. This is shown as Recursion 1, Recursion 2 and Recursion 3 in Fig. 2 (STAGE 1).
In stage 2, the DEN is further fine-tuned with the UMAP loss. This step is optional and for visualization purpose. This is shown in Fig. 2 (STAGE 2).

Datasets
We experimented with 5 public datasets, covering different data types, sizes, and characteristics: (1) MNIST Lecun et al. [1998], the handwritten digits dataset widely used to evaluate machine learning algorithms, which contains 60,000 training images and 10,000 testing images with size 28×28, in 10 classes from digit 0 to dig-it 9.
(2) Fashion- MNIST Xiao et al. [2017], the fashion product database by Zalando, developed in the same format as MNIST, containing 60,000 training images and 10,000 testing images of size 28×28, including 10 classes: T-shirt, trousers, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. Fashion-MNIST is another benchmark database for machine learning algorithms, more challenging than MNIST.
(3) RNA-Seq dataset includes 23,822 single-cell transcriptomes. The cells were isolated from the primary visual cortex (VISp) and anterior lateral motor cortex (ALM) of adult mouse. The cluster label is defined in Tasic et al. [2018], which is used for visualizing different cell types. (4) IMDB dataset Maas et al. [2011] includes 25,000 movie rating data used for sentiment analysis. All data are classified into positive and negative comments. Each comment was preprocessed to transform the textual sequences into a 500-dimensional word-vector. (5) InfiMNIST Simard et al. [1991]: an open source method to generate an infinitely large database of handwritten digits based on MNIST. This dataset is used to test the scalability of DR algorithms.

Implementation
Our environment is a Linux workstation with a NVIDIA Tesla V100 GPU, with 50GB system memory and 16GB GPU memory. The proposed DEN has a number of loss-related and network-related hyperparameters. In our experiments, we used Model A in Fig. 1 for vector input including RNA-Seq and IMDB, and Model B in Fig. 1 for image input including MNIST, Fashion-MNIST, and InfiMNIST. For the loss-related hyperparameters, we set the perplexity to 30 and the degree of freedom of t-distribution to 1. For the network-related hyperparameters, a mini-batch size of 2500 is chosen. Given one mini-batch of 2500 randomly selected data points from the input, P ij is computed to derive the gradient for updating the network parameters. Then the next mini-batch follows to update the network parameters further. The Adam optimizer was used for DNN training, with an initial learning rate of 10 −3 , β 1 (the exponential decay rate for the 1st moment estimate) of 0.9, β 2 (the exponential decay rate for the 2nd moment estimate) of 0.999 and an epsilon of 10 −7 . The training epochs of Deep t-SNE is set as 100. The DRE method has the first 100 epochs trained with the original t-SNE loss, and 50 epochs for each subsequent recursion in Stage 1. In Stage 2, an additional 100 epochs were trained with the UMAP loss.

Evaluation Metrics
We used 6 metrics to evaluate the DR performance Espadoto et al. (1) The 1-nearest neighbor (1NN) classification accuracy is a standard measurement as reported in van der Maaten and Hinton [2008], van der Maaten [2009], which can be used to estimate the quality of clustering. With its absolute simplicity, the 1NN classification accuracy can be an indication of the goodness of clustering in the low-dimensional space.
(2) The neighborhood hit measures how well separable the data is in the low-dimensional space, which helps gauge if a technique is good for data exploration. The neighborhood hit is defined as: which indicates the proportion of K neighbors N (K) i of a point i in the low-dimensional space who have the same label l as point i itself, averaged over all points in the low-dimensional space (K = 7 as commonly used in literature).
(3) The trustworthiness measures the proportion of points in D that are also close in its mapping, suggesting how much one can trust the local patterns in a projection. The trustworthiness is defined as: where r(i, j) represents the rank of the low-dimensional data point j according to the pairwise distances between the low-dimensional data points, and U (K) i represents the set of points that are among the K nearest neighbors in the low-dimensional space but not in the high-dimensional space.
(4) The continuity measures the proportion of points in its mapping that are also close together in its original space, and is closely related to the missing neighbors of a projected point. The continuity is defined as: wherer(i, j) represents the rank of the high dimensional data point j according to the pairwise distances between the low-dimensional data points, and V (K) i represents the set of points that are among the K nearest neighbors in the high-dimensional space but not in the low-dimensional space.
(5) The normalized stress measures the relative preservation of point-pairwise distances, which can be expressed as: where ∆ L and ∆ H are distance metrics for data points in low-and high-dimensional space, respectively.
(6) The Shepard goodness measures the overall distance preservation by computing the Spearman rank correlation of Shepard diagram. The formula of generating a Shepard diagram is as follows: 5 Results

Effect of Recursive Training
During recursive training, we observed that the clusters were gradually refined to better reflect the inter-cluster relationship, resulting in better separable clusters. This is reminiscent of the late exaggeration in FIt- SNE Linderman et al. [2019], which also forces clusters to be tighter and more apart. For an empirical comparison between FIt-SNE and DRE, we implemented FIt-SNE with different late exaggeration settings, as well as DRE with different number of recursions. Fig. 3 shows a comparison between the two different strategies on the MNIST and RNA-Seq datasets. We observed that both late exaggeration and recursive training resulted in similar effect of map optimization by shortening the distance between data points within the same class while prolonging the distance between data points in different classes, both in an unsupervised manner. While FIt-SNE realizes this effect by increasing the repulsive forces explicitly during optimization, our method recalculates the distribution P using an updated representation of the original high-dimensional data, obtained through DNN. The two strategies are not equivalent per se, but both lead to a better separation of the inherent clusters in data. Closer observation shows that the DRE preserved a better visual balance between global and local structures. Within clusters we can better appreciate sub-structures in data, especially in the RNA-Seq data (lower panel b) with a natural hierarchical structure.

Comparison with other Embedding Methods
We compared the proposed DEN methods with other reference methods on the MNIST, Fashion MNIST, RNA-Seq and IMDB datasets. For illustrations We included in total 7 methods for visual evaluation: (1) PCA, (2) AE, (3) t-SNE (Barnes-Hut implementation in Python), (4) UMAP, (5) DL Projection Espadoto et al. [2020] (trained by t-SNE), (6) DRE (with t-SNE loss), (7) DRE (with t-SNE and UMAP loss). Quantitative evaluation metrics are reported in Table 1, which compared a more extensive set of embedding methods.  projection method (column e) method reproduced the results of t-SNE (column c) both in training and testing. However, we noticed a change of cluster distribution (colunm c and e), due to the different optimization of t-SNE in separate runs. The last two columns show the embedding results from the proposed DRE method, with t-SNE loss (column f) and t-SNE + UMAP loss (column g), respectively. Both exhibited visually improved 2D embeddings. Even without color coding, we can still observe 10 distinct clusters, better separated compared with t-SNE or UMAP. Quantitatively, our proposed DRE method also showed superior 1NN and neighborhood hit value, comparable trustworthiness, continuity and normalized stress compared with those from the t-SNE and UMAP methods. Nonetheless, we noticed that the Shepard goodness was reduced, as the recursive training strategy exaggerated the inter-cluster distances, while Shepard goodness is a metric based on the original data distribution.
The Fashion-MNIST dataset is generally more challenging to cluster or classify than the MNIST dataset. We observed more distinct clusters (e.g. tops, bottoms and shoes) by our method than UMAP or t-SNE, however, the confusion of certain classes (e.g. shirt, pullover and coat) remained similar.
The results of RNA-Seq are shown in the first and second rows of Fig. 5, in two color schemes reflecting the hierarchy of global and local data structure. The RNA-Seq can be divided into three major types: glutamatergic excitatory neurons (GABA), GABAergic inhibitory neurons (Gluta) and non-neuronal cells (Non-neu), color-coded in red, blue and black in the first row. It can be observed that the PCA method could differentiate the three major types of cells very well, presenting the global structure in data. In each major cell type, there are also a large number of subtypes, in total 133 (details can be find in Tasic et al. [2018]). The second row of Fig. 5 color-codes the cell subtypes in a refined color scale. As the local data structures are nonlinear, PCA can no longer differentiate these subtypes, in contrast to nonlinear methods such as t-SNE and UMAP, which exhibited clustering according to cell subtype. However, we observed a loss of global structure in the results of t-SNE and UMAP, with the global distribution of three major types no longer consistent, i.e. scattered and mingled. The proposal DRE method (column f and g) showed a better balance between global and local data structures. The three major types are close as highlighted by the dotted ellipses in Fig. 5 column g, whereas the subtypes can also be appreciated.
The results of IMDB dataset is shown in the third row of Fig. 5. The original data are only classified into positive and negative comments, however, the actual sentiments are much more complicated and subtle. We observed that all embedding methods mixed up the two categories of data, but nonlinear methods (column c-g) showed sub-clusters, which might reflect subtle semantics in comments. In Table 1, we note that the two DRE methods resulted in better quantitative metrics compared to other methods, with the highest neighborhood hit, trustworthiness and Shepard goodness, and lowest normalized stress. However, further evaluation demands more detailed labels of sub-clusters.

Scalability and Generalization
To evaluate the scalability of the embedding methods, we generated huge training and testing datasets from InfiMNIST Simard et al. [1991], and evaluated if the proposed DRE could scale to such extremely large dataset. We trained DRE with 0.05 million, 0.1 million and 0.3 million InfiMNIST data, and embedded an independently generated 1 million InfiMNIST testing dataset. Then we measured the 1NN classification accuracy on the 2D embedding, as reported in Table 2. The proposed DRE method out-performed all other embedding methods, with a 93.52%, 95.58% and 97.17% testing accuracy when trained on 0.05 million, 0.1 million and 0.3 million data, respectively. Fig. 6 shows the embedding results of the 1 million testing data (trained on 0.1 million data), for 6 methods in comparison: (a) PCA, (b) AE, (c) UMAP, (d) DL projection trained with t-SNE, (e) DRE with t-SNE loss, and (f) DRE with t-SNE + UMAP loss, respectively. The embedding of both the training and testing datasets are shown. On the testing results, we zoomed in the local neighborhood at the inside and border of class 9 to inspect embedding details. As the learned embedding was extrapolated from 0.1 million training data onto 1.0 million testing data, we observed scattering of classes at the border.  Overall, the proposed DRE methods resulted in the cleanest neighborhood compared to other embedding methods, which explains the high classification accuracy in Table 2.
In a second experiment, we used a training set of 10K data points to train the embedding methods, and tested them on an increasing testing set of 10K, 30K, and 60K. The 1NN classification accuracy is reported in Table 3. It can be seen that all embedding methods showed stable generalization, suggesting the manifold of MNIST can be adequately   learned and represented in a dimensionality as low as 2. In both Table 2 and Table 3, we observed that the two DRE methods had the best quantitative performance, followed by deep t-SNE and the DL projection method. The results can be intuitively explained as follows: both deep t-SNE and DL projection methods are approaching the limit of t-SNE, while the DRE methods are pushing the limit further by recursive training. Comparing the two DRE methods using t-SNE and composite t-SNE+UMAP losses, however, we noticed little difference in 1NN classification performance or other quantitative measures, although the latter produces visually better separation and clearer margin.

Runtime
We tested the runtime of a large range of existing embedding methods in literature, categorized into parametric and non-parametric DR methods. The runtime are reported in Table 4. The PCA, Eigenmaps and BH t-SNE methods were implemented single threaded on CPU. UMAP used numba's parallel implementation to do multithreaded processing with multiple cores on CPU. The DL projection methods adopted the MulticoreTSNE with 8 NVIDIA Tesla V100 GPU to parallelly calculate the network references, and fitted the model using 1 GPU. The t-SNE and FIt-SNE method utilized the multithreading and C/C++ compiler on CPU. The proposed methods single threaded all computations, and used 1 GPU to train the network. As shown in Table 4, the runtime of the proposed Deep t-SNE, DR t-SNE and DRE increased gradually with more recursions. Compared with non-parametric methods, the DRE method needed less runtime than the publicly available Open t-SNE Poličar et al. [2019], but more runtime than FIt-SNE Linderman et al.
[2019]. Among the parametric methods, the proposed methods were faster than the RBM-based ptSNE. Although our methods are slower than AE and UMAP during training, they are very fast when applied on new data points (second row "test").

Discussion
In this work, we propose to integrate modern DNNs and mathematics principles to embed high-dimensional data, and we name this framework "deep embedding network" -DEN. DEN is a generic framework, which can be designed in a flexible manner in terms of NN architectures, training strategies, and loss functions. Based on DEN, we showed how to design Deep t-SNE and Deep UMAP, and further introduced a recursive training strategy to boost the embedding performance over the original t-SNE or UMAP. We exemplified the flexibility of DEN by the recursive training with two-stage loss functions combining the popular t-SNE and UMAP. We tested the proposed DRE on a variety of public datasets, and evaluated its performance both visually and quantitatively, against a comprehensive set of classical and state-of-the-art embedding methods.
While inspired by t-SNE, the proposed DRE has made the following improvement: (1) the DRE method is parametric and can easily embed new, out-of-sample data; (2) the proposed recursive training strategy can boost the embedding performance over the original t-SNE; (3) empirical experiments show that the DRE method can better preserve the global data structure simultaneously; (4) it can integrate any mathematics rules beyond t-SNE.
Both DL projection method and the proposed DRE method can map large, out-of-sample datasets. This is an inherent advantage of learning-based methods. However, the training mechanisms are very different: DL Projection solves the DR problem by learning from the results of t-SNE or UMAP, i.e. the network is taught to generate an output similar to that of the reference methods. In contrast, the proposed DRE method provides the network a principled loss function, and then let the network explore the embedding itself.
There are a number of hyperparameters in the proposed method. The mini-batch used in DNN is critical for scalability and has an influence on the embedding results: a small mini-batch cannot fully sample the data distribution, while a big mini-batch demands too much memory as the square matrix of P is dependent on mini-batch size. We empirically selected 2500, but it can also be set larger if memory allows. Other network settings only had minor influence on the final performance according to our experiments, given the capability of DNNs to optimize highly complex loss functions. Another hyperparameter is the number of the recursions, for which we can use a simple rule of thumb: if we are keen in visualizing the global clustering of data, we can increase the number of recursions (e.g. 3) and obtain better separated clusters in a global view; otherwise 1 or 2 recursions are sufficient as our experiments indicated. We argue that the improved global structure preservation by DRE may arise from the use of latent representations, which better capture the global structure in data (for example, PCA is a linear latent representation). Meanwhile, the local structures are captured by DRE as it is initiated from the t-SNE embedding rule, which emphasizes local neighborhood.
There are a few practical weaknesses of the proposed methods. First, our method obtains the parametric mapping by learning from a dataset, so it is more suitable for big data datasets than for small datasets. Second, the proposed methods still need more runtime than PCA, AE and UMAP. Nevertheless, we expect that the runtime can be reduced by multi-threaded implementation.

Conclusion
In this paper, we introduced a generic DEN frame-work, which is flexible in its NN architectures, training strategies, and loss functions. Based on DEN, we proposed a novel DRE strategy to further boost the embedding performance. The proposed framework can combine modern DNNs and any effective embedding rules such as those from t-SNE and UMAP. The proposed DRE method can map out-of-sample data and scale to extremely large datasets. Experiments on a range of datasets demonstrated its improved embedding performance in terms of local and global structure preservation, compared with other state-of-the-art embedding methods.