Generalized t-SNE Through the Lens of Information Geometry

t-SNE (t-distributed Stochastic Neighbor Embedding) is known to be one of the very powerful tools for dimensionality reduction and data visualization. By adopting the student’s t-distribution in the original SNE (Stochastic Neighbor Embedding), t-SNE achieves faster and more stable learning. However, t-SNE still poses computational complexity due to its dependence on KL-divergence. Our goal is to extend t-SNE in a natural way by the framework of information geometry. Our generalized t-SNE can outperform the original t-SNE with a well-chosen set of parameters. Furthermore, the experimental results for MNIST, Fashion MNIST and COIL-20, show that our generalized t-SNE outperforms the original t-SNE.


I. INTRODUCTION
In recent years, with the increasing complexity of data, a large number of statistical analysis and machine learning approaches have become increasingly important. In particular, manifold learning [1], [2], dimension reduction [3], [4], and visualization [5], [6] are useful paradigms to help understand complex data.
Stochastic Neighbor Embedding (SNE) [7] is one of the powerful frameworks to achieve these objectives. SNE compresses the dimension so that the proximity of points in the original space is as close as possible to the proximity of the points after compression. However, it is known that SNE suffers from the crowding problem, and to solve this problem, t-distributed Stochastic Neighbor Embedding (t-SNE) [8], which adopts the student's t-distribution with 1 degree of freedom as the probability distribution after compression, is widely used. Although t-SNE achieves the standardization of SNE with simple modifications, it still suffers from computational instability due to KL-divergence. Many implementations take the strategy of shifting with small to avoid zero division of KL-divergence, but this operation is unnatural in probability space and leads to a gap between theory and implementation.
In this paper, we consider extending t-SNE in a natural way to solve this computational difficulty. To achieve this goal, we utilize the framework of information geometry [9], [10].
The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . This theoretical framework allows us to interpret observations using the tools of differential geometry on a Riemannian manifold constructed by a set of probability distributions.

II. NOTATIONS AND PRELIMINARIES
Let X ⊂ R d be the d-dimensional input space. The goal of manifold learning or dimensionality reduction is to obtain a map f : X → Y , where Y ⊂ R s is the output space and s d. For input data points X = {x i } n i=1 of sample size n ∈ N, t-SNE [8] starts by computing a joint probability distribution over all pairs of data points {(x i , x j )} 1≤i =j≤n , represented by a symmetric matrix P = (p ij ) 1≤i,j≤n , where p ii = 0 as Here τ i are tuning parameters, which are usually determined based on a certain perplexity measure and a simple binary search [7], [11]. Similarly, in a s-dimensional map {y i } 1≤i≤n ⊂ R s , define the joint probability distribution over all pairs {(y i , y j )} 1≤i =j≤n through a symmetric matrix Q = (q ij ) 1≤i,j≤n , where q ij = 0 as P and Q are similarity matrices summarizing the pairwise distances of the high dimensional data points {x i } 1≤i≤n , and VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the low dimensional map {y i } 1≤i≤n . Then t-SNE aims to find {y i } 1≤i≤n in R s that minimizes the KL-divergence [12] between P and Q: (y 1 , . . . , y n ) = arg min Here, KL-divergence is defined as follows: Definition 1 (Kullback-Leibler Divergence [12]): The Kullback-Leibler divergence or KL-divergence D KL : P × P → [0, ∞] is defined between two Radon-Nikodym densities p and q of µ-absolutely continuous probability measures by where KL-divergence is a measure of the difference between two probability distributions in statistics and information theory [13], [14]. For continuous distributions, the KL-divergence is known to have computational difficulty.
To be more specific, if q takes a small value relative to p, the value of D KL [p q] may diverge to infinity. The simplest idea to avoid this is to use very small > 0 and modify D KL [p q] as follows: However, since q + no longer satisfies the condition for a probability measure: X (q+ )dµ(x) = 1, such an extension is unnatural. This computational difficulty is a critical issue in the implementation of t-SNE. For example, a gap occurs between theoretical analysis and implementation regarding t-SNE, making it difficult to describe numerical observations analytically.
As a more natural way to stabilize KL-divergence, the following skew divergences have been proposed: Definition 2 (Skew Divergence [15], [16]): is defined between two Radon-Nikodym densities p and q of µ-absolutely continuous probability measures by where λ ∈ (0, 1].

A. EINSTEIN SUMMATION CONVENTION
In the rest of this paper, Einstein summation convention [17] will be assumed, so that summation will be automatically taken over indices repeated twice in the term, e.g.,

B. CONTRACTION NOTATION FOR PARTIAL DERIVATIVES
The variables in the partial derivative are represented by subscripts when it is clear from the context, e.g.,

III. INFORMATION GEOMETRY OF t-SNE
We now consider the interpretation and extension of t-SNE through the lens of information geometry [9], [10]. We assume that P and Q are parameterized by unknown θ and ξ , that is, we assume two statistical models We also denote (S , g S ij ) as a statistical model equipped with the Riemannian metric g S ij . In particular, let g S ij be the Fisher-Rao metric, which is the Riemannian metric induced from the Fisher information matrix [18]. Then, t-SNE can be regarded as learning a mapping from an input manifold (S , g S ij ) to an output manifold (M , g M ij ), f : S → M . We begin with an overview of the geometric concept of statistical manifolds.
Definition 3 (Christoffel Symbols): Let g be a Riemannian metric, particularly the Fisher information matrix, then the Christoffel symbols are given by Definition 5 (∇ (1) -Connection): Let g be the Fisher-Riemannian metric, which is a 2-covariant tensor. Then, the ∇ (1) -connection is defined by It can also be expressed equivalently by explicitly writing as the Christoffel coefficients (1) Definition 6 (∇ (−1) -Connection): Let g be the Fisher-Riemannian metric, which is a 2-covariant tensor. Then, the ∇ (−1) -connection is defined by More generally, the following ∇ (α) -connection can be defined.
Definition 8 (α-Divergence [19]): Let α ∈ R be a real parameter. The α-divergence between two probability distributions p and q is defined as Let N = S ∪ M . The mapping from P θ to Q ξ can be regarded as choosing the shortest curve connecting P θ ∈ N and Q ξ ∈ N . Such a curve can be formulated as a geodesic, defined as follows.

IV. GEODESICAL SKEW t-SNE
Using the notion of geodesics, skew divergence is generalized in terms of information geometry as follows.

GS [p q] in t-SNE, we can derive a skew t-SNE:
(y 1 , . . . , y n ) = arg min y 1 ,...,y n D (α,λ) The update rule of y i by the gradient decent is given as where t, η and α(t) are respectively the iteration, the learning rate and the momentum of iteration t. Gradient descent strategies do not always converge to the global optimal solution, but they are guaranteed to converge to the saddle point under suitable continuity [22].

A. GEOMETRY OF THE EMBEDDED MANIFOLDS
We illustrate the geometric advantages of our generalization of t-SNE. From the definition of α-geodesical skew divergences, we can see that the generalized t-SNE obtains embeddings along the geodesics associated with the parameters. This suggests that generalized t-SNE may be able to select the optimal parameters from a bundle of curves, whereas the existing KL-divergence based t-SNE is restricted to a single curve. For a more formal discussion, we introduce the notion of flatness of a manifold. Definition 12: The curvature of the linear connection ∇ is given by Similarly, let R (α) be the curvature tensor in the α-coordinate system. A manifold is said to be α-flat if R (α) = 0. Since curvature is a local property that depends on the coordinate system, which curves connect manifolds with the shortest distance also depends on the properties of the manifold. Figure 1 shows a conceptual graph to help with intuitive understanding.
Original t-SNE assumes the student's t-distribution with 1 degree of freedom, but for the sake of general discussion, we consider the following version with r degrees of freedom: where u is location parameter and v is scale parameter and r is degrees of freedom. Is known that the Fisher information matrix of M is given as follows: Proposition 2: The α-curvature tensor of the M is given as follows.    Proof: We have 2r v 3 (r + 3) , 122 = k g k1 = 0. (29) Let = ln p. We also have By combining them,  [28]. By choosing the parameters well, we can see that our generalized t-SNE outperforms the original t-SNE. (α) Thus Proposition 3: Embedding manifolds of the original t-SNE have the constant curvature R (α) 1212 = − 1 2v 4 . Proof: By substituting r = 1, we can prove immediately.

V. EXPERIMENTAL RESULTS
In this section, we present several numerical experiments. In our experiments, we apply generalized t-SNE to the MNIST [23], Fashion-MNIST [24] and COIL-20 [25] datasets. Since the generalized t-SNE includes the original t-SNE by modifying the parameters, we study the behavior of the method by conducting experiments with multiple parameters. Throughout the entire experiment, we use perplexity = 30.

A. CLUSTERING PERFORMANCE
First, we investigate the clustering performance of generalized t-SNE. We apply k-Nearest Neighbors [26] to the dimensionally reduced representation of the raw data and observe the clustering performance. Table 1 shows the evaluation of the embedding space by using the scale-independent quality criteria [16]: where F K = {1, . . . , K }, S = {K + 1, . . . , N − 1}, Q = (q kl ) 1≤k,l≤N −1 is the co-ranking matrix [27] and This metric is the rank-based scale-independent quality criteria for the embedded spaces, and the embedding space with high Qlocal is evaluated as having good properties. From Table 1, we can see that the generalized t-SNE outperforms the original t-SNE with a suitable choice of parameters. In fact, a t-test between the parameter set when the best performance is achieved for each dataset and the performance of the original t-SNE yields t MNIST = −5.059, t FashionMNIST = −5.331, t COIL−20 = −4.883, indicating a significant difference in all experiments.
In addition, the visualization results for each dataset are shown in Figures 2, 3 and 4. From these figures, we can VOLUME 9, 2021 FIGURE 6. Parameter surfaces for skew t-SNE. We can see that the change in accuracy of skew t-SNE with respect to the combination of parameters is somewhat smooth, indicating that optimization is easy.
observe that the behavior of the generalized t-SNE changes depending on the choice of parameters. Figure 5 shows the numerical convergence analysis of t-SNE and skew t-SNE. In this experiment, we performed optimization by SGD [29] with lr = 0.05 on the MNIST dataset [23] and set the parameters of skew t-SNE to α ∈ {0, 1, 2} and λ = 0.5. From this experiment, we can see that skew t-SNE converges dramatically faster than the original t-SNE.

VI. RELATED WORKS
Many variants of t-SNE have been proposed due to the excellent experimental results [30]- [32]. Kernel t-SNE [33], [34] is an efficient extension of t-SNE to a parametric framework, which preserves the flexibility of original t-SNE, but enables explicit out-of-sample extensions. In addition, there are variants that leverage the power of GPUs to solve the computational complexity of t-SNE [35], [36].
Other research areas around t-SNE include its parameter tuning [11], [37]. As mentioned earlier, the original t-SNE has a large computational time complexity, so the automation of parameter tuning is very important.

VII. CONCLUSION AND DISCUSSION
In this paper, we generalized t-SNE from the perspective of information geometry. Our generalization is derived very naturally, and suggests the possibility of the existence of a better parameter set.

A. FUTURE WORKS
We have the following future studies: • Deriving the convergence rate of skew t-SNE.
In Section V-B, we show numerically that our skew t-SNE converges significantly faster than the original t-SNE. It is very important in practical applications to derive the convergence rate of the proposed method theoretically, for example, O(N ) and O(N 2 ).
• Information criterion for parameter selection. As shown in Figure 6, generalized t-SNE produces different clustering results depending on the values of the two parameters. Therefore, some criteria for determining the parameters are very useful [38], [39].
• Additional experiments. For example, it would be useful to investigate the combination of our generalized t-SNE with other clustering methods other than k-means [40], [41].