Hyperbolic Generative Adversarial Network

Recently, Hyperbolic Spaces in the context of Non-Euclidean Deep Learning have gained popularity because of their ability to represent hierarchical data. We propose that it is possible to take advantage of the hierarchical characteristic present in the images by using hyperbolic neural networks in a GAN architecture. In this study, different configurations using fully connected hyperbolic layers in the GAN, CGAN, and WGAN are tested, in what we call the HGAN, HCGAN, and HWGAN, respectively. The results are measured using the Inception Score (IS) and the Fr\'echet Inception Distance (FID) on the MNIST dataset. Depending on the configuration and space curvature, better results are achieved for each proposed hyperbolic versions than their euclidean counterpart.

: Triangle on a hyperbolic paraboloid.
Hierarchy o cyclicity in the data can be better exploited through embeddings in a Riemannian manifold. A Riemannian manifold is a "curved" space. Intuitively, the curvature concept can be understood by analyzing a surface. The surface curvature will be represented by how much any point in the surface deviates from a tangent plane. For spaces that present constant curvature, there are three families [5], those with positive curvature or elliptic spaces, negative curvature also known as hyperbolic spaces, and the Euclidean space with zero curvature [6]. In a Riemannian Manifold, the fifth axiom of Euclid (parallel postulate) does not hold [7], this postulate is equivalent to assert that in a triangle, the internal angles add 180° [5]. Instead, depending on the intrinsic space curvature, the sum of the angles in a triangle can be bigger than 180°f or elliptic spaces and less than 180°for hyperbolic spaces, as shown in figure 1 for a triangle in a hyperbolic paraboloid.
Non-Euclidean spaces with constant curvature have properties useful for providing better representations for a certain type of structured data. The elliptical or spherical spaces are well suited for representing data with a cyclical structure [8] [9]. On the other hand, for hierarchically structured data, Hyperbolic spaces, particularly the Poincaré ball shares the same metric structure as trees [10]. Hierarchical structural properties are strongly present in text and graphs [11] [12], motivating different works using Poincaré Ball space. Tifreaa et al. proposed an adaptation of the GloVe [13] embedding algorithm in a Poincaré Ball [14] and Dhingra et al. developed a text embedding, using neural networks, in a hyperbolic space [15]. In [16], a method is proposed to make link prediction over a graphs of words, embedded on a low dimension Poincaré ball, to achieve better results in word similarity and lexical entailment. However, besides text, other data types have underlying or latent tree structures, where Poincaré spaces have been less used. Images do not present an apparent hierarchical behavior; however, they can be grouped in classes because of their similarities. These classes can also be grouped in other classes, increasing the abstraction level or going up in a tree structure. An example of the hierarchy in images is the WordNet-ImageNet [17] [18] dataset. The WordNet dataset is a compound of words (nouns and adjectives) organized in a tree, and in the ImageNet dataset, the images are organized following the WordNet tree. The Hyperbolic space can leverage the hierarchy of ImageNet as shown in figure 2, where WordNet-ImageNet mammals tree embedded in a two-dimensional Poincaré Ball using the method proposed in [16]. This argument was first wielded in [19] where they claim that the hierarchical semantic structure of language concepts can also be present in the images of those concepts, as in our example with mammals in figure 2. In that work, they modify the last section of three network architectures adding hyperbolic embedding to a Poincaré space followed by hyperbolic layers for the few-shot learning task. The hyperbolic layers were proposed in [20] adapting the MLP, RNN, GRU, MLR to the hyperbolic space plus two layers for mapping from euclidean space to hyperbolic, called exponential map, and vice versa called logarithmic map. Hyperbolic Neural Networks (HNN) have been useful for the data exhibiting hierarchical latent anatomy because they increase its representation fidelity, with less distortion and dimensional requirements [8].
Similar to the embeddings, hyperbolic spaces have been employed to improve generation tasks on deep learning, both on text and images. Shuyang Dai et al. [21] developed text generation in hyperbolic space, using a hyperbolic version of a Variational Auto Encoder (VAE). Hyperbolic VAE's also have been used to hierarchically represent images on the Poincaré ball [22] [23]. The VAE can be used to learn efficient representations and also for generation. However, the most popular architecture for the image generation tasks is the Generative Adversarial Network (GAN) [24].
GAN consists of two networks that participate in an adversarial game during training. This architecture comprises a generator network trying to generate data similar to the training dataset and a discriminator network trying to differentiate between real and generated data. The generator network is trained in an unsupervised way evaluating its output against the discriminator. The discriminator is trained in a supervised way for classifying real and fake images. After the networks have been trained, the generator can produce data similar to the dataset, but with original variations, only from a noise input. There are multiple variations and modifications to the original GAN, but, to the best of our knowledge, there is no application of hyperbolic space to GANs architectures that aim to exploit the data's hierarchical characteristics, as in the VAE architecture.
We propose that the GAN architecture can take advantage of the hierarchical characteristics present in the images by using hyperbolic neural networks. Thereby, they can achieve better quality and diversity of the generated images. In this work, we use the GAN [24], WGAN [25], and CGAN [26] architectures, conforming to their hyperbolic versions HGAN, HWGAN, and HCGAN. In each case, experiments with different arrangements and combinations of euclidean with hyperbolic layers and different curvature values were conducted. The performance over the MNIST dataset was measured by the Inception Score (IS) [27], and Fréchet Inception Distance (FID) [28], achieving better results for some layers arrangements for each architecture.

A. The Poincaré Ball
The hyperbolic spaces H n are n-dimensional Riemannian manifolds homogeneous and simply connected with a constant negative curvature. There are multiple models of hyperbolic space, but this research focuses on The Poincaré Ball. The Poincaré Ball manifold (D n c , g D n c u ) is a n-ball over R n of radius 1/ √ c. In this space, the ball center is the hyperbolic version of euclidean zero, and the ball perimeter is the infinity.
The Poincaré Ball (D n c , g D n c u ) is defined as: With the Riemannian metric tensor: λ c u is called the conformal factor of this space, and g E = I n is the Euclidean metric tensor in the cannonical base. With T u the tangent space operator, the exponential map takes a vector x ∈ T u D n c ∼ = R n and assign it to a vector v ∈ D n c . Its equation is given by: where the Möbius addition ⊕ c , for u, v ∈ D n c , is defined as: The inverse of exponential map, the logarithmic map, takes a vector v ∈ D n c and assign it to a vector For simplicity we choose u as the origin of R n and the equations (4) and (6)

B. Hyperbolic Deep Learning
Hyperbolic spaces have a high capacity to represent data with tree-like structures. Figure 3 shows how a square grid shapes into a tree like-structure in a 2D Poincaré ball. When applied to neural network layers, this property allows a better representation of hierarchical data, which is exploited in Hyperbolic Neural Networks (HNN's) [20]. The HNN uses the gyrovectors to implement the basic neural network operations in the Poincaré Ball. The gyrovector spaces allows create a simil of a vector space in a non-eculidean space like the hyperbolic. The gyrovector space used to operate in the Poincaré Ball is the Möbius, with two binary operations the Möbius addition (5), and the Möbius scalar multiplication of k ∈ R by vector v ∈ D n c , defined as: Equation (9) together with equation (5) allow us to build feedfordward networks. A Feedforward network consists of an affine transformation and a non-linear function as activation.
The hyperbolic feedforward network [20] has a gyrovectorbased affine transformation with Möbius matrix-vector multiplication and Möbius addition the bias vector. The Möbius matrix-vector multiplication was defined in [20]; the procedure implementation is similar to Möbius scalar multiplication (9), it uses the exponential and logarithmic mappings. For v ∈ D n c to be multiplied by a Matrix M ∈ R m×n , v is taken to the Euclidean space by a logarithmic map, and multiplied by M . The exponential map is then applied to the resulting vector. This operation (D n c → D m c ) is given by: The bias Möbius addition is a translation of gyrovector v ∈ D n c by bias b ∈ D n c , that is given by: Additionally, in order to apply any function to a gyrovector, the Möbius version of the function is required. Similarly to Möbius matrix-vector multiplication, let f : C. GAN Generative Adversarial Networks [24] are an architecture of two networks that participate in an adversarial game during training. One, the Generator, is in charge of producing artificial images from noise input. The other, the discriminator, is in charge of classifying between real images from the artificially generated ones. Therefore, during training, the generated images are passed to the discriminator input, which in turn alternates artificial and real images. The roles in the adversarial game are as follows: the generator G is trying to produce images similar to the real ones in order to fool the discriminator D; on the other hand, the discriminator is trying to detect whether a particular image is real or artificially generated by G. In this way, the generator G(z; θ g ) learn the distribution p data of the images x from the noise input z, and the discriminator D(x; θ d ) estimates the probability of x being real or artificial. Equation (13) shows the loss function for the GAN, a minmax game, as proposed by [24]. Where The GAN's training finishes when the discriminator is unable to distinguish between real images from generated images. Once trained, the generator network can be used independently for generating images. However, it is impossible to control the output in any way. Furthermore, the GAN network can have challenges to train because of vanishing gradients, and mode collapse; more details about these issues can be found in [29]. In this work, in addtition to the GAN, we use two modifications to the original GAN architecture, the Conditional GAN (CGAN) [26], and the Wasserstein GAN (WGAN) [25].
The CGAN has additional class label information concatenated in the input layer of the generator and discriminator network. This additional information condition both networks, like a digit label, when working with the MNIST dataset. Equation (15) shows the loss function for the CGAN, which includes the label information y for conditioning at the discriminator and generator.
The WGAN addresses the issues of vanishing gradients and mode collapse of the original GAN architecture. To achieve this goal, they use Wasserstein distance to have a smoother gradient everywhere, but the distance equation is intractable. The Kantorovich-Rubinstein duality simplifies the equation of distance as (16): Where sup indicate the supremum, P r and P θ two distributions, and f is a 1-Lipschitz function that meet the following constraint: To calculate the Wasseterin distance is necessary to find a 1-Lipschitz function. However, in practice this function can be learned by a neural network, and the discriminator network D, without sigmoid function in the output layer, is an ideal candidate for doing this task. Under this configuration, the output of the discriminator can be any real number, and the bigger this score, the closest to a real image the input image is. To enforce D to comply with 1-Lipschitz restrictions, WGAN can apply a gradient penalty [30]. The WGAN with Gradient Penalty (WGAN-GP) works since any differentiable function f is 1-Lipschitz if and only if its gradient have norm less or equal to one in the whole space (check the demonstration in [30]). The loss function of the WGAN-GP is given by: Wherex sampled from fake images G(z), and real images x with uniformly sampled between 0 and 1.
III. HYPERBOLIC GAN (HGAN) The central assumption in this work is that the hyperbolic neural networks can improve GAN's performance, either in the generation or discrimination process, because the hyperbolic layers can leverage the hierarchical characteristics of the images [19]. This hyperbolic version of GAN, the HGAN, mix the hyperbolic and euclidean space in either the generator and discriminator, replacing some of the euclidean linear layers by the hyperbolic linear layers implemented in [19]; therefore, the use of exponential and logarithmic mapping is necessary to move between euclidean and hyperbolic spaces. The use of different spaces in the neural networks allows the creation of abstract features representing different implicit data structures, like the hierarchical structure as already mentioned. This approach adds more degree of freedom to the HGAN design by the number of hyperbolic layers, their location, and the hyperbolic space curvature (represented by c). The HGAN is a family of architectures derived from modifying the original GAN architecture proposed by Goodfellow et al. [24], figure 4 shows the general architecture that we implemented. Similarly to the HGAN, the HWGAN, and HCGAN, correspond to modifications of the original WGAN-GP [30], and CGAN [26] architectures by replacing some euclidean by hyperbolic layers.
The HGAN family architectures can have both euclidean layers and hyperbolic layers in different arrangements, with the corresponding exponential and logarithm mapping, center on zero, necessary to pass from the euclidean space to hyperbolic space and vice versa. The notation of this architecture's different configurations are D eehh , for the discriminator, and G hhee , for the generator, the subscripts denote which layers they are composed of. For example, G hhee means that the generator network comprises an exponential map, two hyperbolic layers, a logarithmic map, and two euclidean layers, the reading order from left to right correspond the order from input to output in the network. For simplicity, both exponential and logarithmic mappings are omitted from the notation. With this notation, the GAN architecture is represented by D eeee G eeee .
The place of hyperbolic layers in the network has a significant influence on the HGAN performance. Three different general configurations were tested. First, the HGAN with a euclidean-hyperbolic (EH) configuration, consisting of euclidean layers, an exponential map, and finalize with hyperbolic layers. The EH configuration first generates a feature vector in euclidean space, and then the abstract representations are processed in the Poincaré Ball. We believe that the EH configurations could improve the discriminator network's performance because the image hierarchical structure should be relevant in deep process layers where there is a more abstract representation. The second configuration is the hyperbolic-euclidean (HE), which consists of the exponential map in the network input, followed by hyperbolic layers, logarithmic map, and euclidean layers in the network output. This sequence, HE, could improve performance by creating hierarchical representations first in the hyperbolic space that can help the euclidean layers task. Finally, we have the Euclidean-Hyperbolic-Euclidean (EHE) configurations that consist of the network start with euclidean layers, followed by the exponential map, hyperbolic layers, logarithmic map, and euclidean layers again to the network end. The EHE configuration first creates a representation in the euclidean space enriched with abstract features that can then be exploited by the hyperbolic layers, and the final layers map it to euclidean space. The generator can better exploit this characteristic since it has to create an entire image from a complete non hierarchical noise input.
Another degree of freedom is the c value, which has a significant influence on HGAN performance. The Poincaré Ball radius r is related to c by r = 1/ √ c. Therefore, c represents how big the Poincaré ball is, and its effect when mapping the MINIST dataset can be seen in figure 5. The distribution of the magnitude of the images in the hyperbolic space change as c varies, for c = 10 −5 (r ≈ 316) the distribution is near 0; on the other hand, for c = 10 −2 (r = 10) the distribution is squashed to the r value. A wrong selection of c causes a poor behavior or fault in the convergence because the data can collapse into the ball's boundary or to the origin and it can also produce numeric instability. The study of the appropriate c values for a specific data set is a crucial task.
Finally, the HGANs architectures used Leaky ReLu directly in hyperbolic space as an activation function, without a logarithmic and exponential map, which differs from the standard procedure to implement functions in the hyperbolic space. Furthermore, the HGAN has standard dropout layers for regularization, and the method used for optimization was Adam [31].

IV. EXPERIMENTS
The experiments consist of training the architectures with the MNIST dataset. The performance of each architecture is measure with the Inception Score (IS) [32] and the Fréchet Inception Distance (FID) [28].
The GAN implementation has four linear layers on each network. The discriminator input is a vectorized image with 784 dimensions, followed by layers of 1024, 512, 256, and 1 hidden units respectively. For the generator, the input layer has 128 units, which corresponds to Gaussian noise, and the hidden layers are of size 256, 512, 1024, and 784, respectively. The discriminator contains dropout layers with a rate of 0.1. Both networks have a leaky ReLU with a leak factor equal a 0.2 for activation, except the network end, where the discriminator has no activation function, and the generator has an hyperbolic tangent. The loss function was the binary cross-entropy with logistic regression, and the training pipeline used the Adam [31] optimizer with a learning rate equal to 0.001, β 1 = 0.5, β 2 = 0.999. The WGAN-GP used the same structure as the GAN. However, it uses the Wasserstein loss with gradient penalty [25] and Adam optimizer with a learning rate equal to 0.0001 and the same beta values. The version of CGAN follows almost the same structure, but with the discriminator input of 794 and the generator input of 138, ten more nodes on both network in order to allocate the one-hot encoding for the class labels. The optimizer uses the same set of parameters than the WGAN-GP with binary cross entropy with logistic regression as loss function. The tested networks were compound of blocks of hyperbolic networks at the begging (HE), in the middle (EHE), and to the end (EH) for both generator and discriminator with different fixed c values. For configurations where both discriminator and generator have hyperbolic linear layers, combinations of the architectures in the previous step were combined into one, with the sole exception of HWGAN, where the discriminator never showed better performance than the euclidean version. Since the different nature of discriminator and generator tasks, the curvature for each was not always the same, c d was for the discriminator and c g for the generator. This distinction produced a remarkable improvement in the FID score from 67.291 for the euclidean GAN to 18.697 for the HGAN with c d = 10 −5 and c g = 10 −3 , as showed in    V. CONCLUSION This work shows that the HGAN, HCGAN, and HWGAN architectures can perform as well as or, in some cases, much better than the original euclidean architectures. The performance depends heavily on two main factors: the curvature through the c parameter, and the architecture configuration. The best performance was achieved for the HGAN with c = 10 −3 , HCGAN with c = 10 −2 , and for the HWGAN with c = 10 −1 . For smaller values of c the radius growths and the hyperbolic layers behave as euclidean layers, consequently there is no improvement over the euclidean version. And for c < 10 −6 the networks did not converge because of numerical instability. One last consideration for the c value was to make a distinction between the c used in the discriminator that can be different from the c used in the generator. This was because the different nature of the tasks performed by discriminator and generator, and it had as a consequence the best improvement of performance in the HGAN as showed in table I. For the configurations, the experiments show that the generator's EHE and the HE configurations had better performance. For the discriminator the HE and the EH configurations worked better. However, the HWGAN never showed a performance improvement with hyperbolic layers in the discriminator. When the discriminator was in the EH configuration, the logarithmic map at the network output was not applied, this because it became unstable for one dimension. Also, the visualization shows each configuration and c value. The rows show the configurations, for example D hhhh G eehh , and the columns indicate the c value. The IS and FID measurements are in the x-axis. The bigger values of IS represent better performance (showed by the → arrow), on the other hand smaller values on the FID score represent better values (showed by the ← arrow). Therefore, when a network has good performance its IS and FID markers should be placed to the center of the column near each other, in contrast when the markers are far from each other and from the center it is an indicative of poor performance. Aditionally, the perfomance of the fully euclidean version is depicted by a vertical segment in each graph line. Figure 8 shows the results for the HGAN, the best performing architectures can be seen for c = 10 −3 . The results for the HCGAN are displayed in figure 8, it is possible to observe that there are many configurations that presents better performance than the euclidean CGAN for all the tested values of c. Finally, figure 9 and 10 show the results obtained for the HWGAN. The best perfoming architecture was found for c = 0.1.