Learning GANs in Simultaneous Game Using Sinkhorn with Positive Features

Entropy regularized optimal transport (EOT) distance and its symmetric normalization, known as the Sinkhorn divergence, offer smooth and continuous metrized weak-convergence distance metrics. They have excellent geometric properties and are useful to compare probability distributions in some generative adversarial network (GAN) models. Computing them using the original Sinkhorn matrix scaling algorithm is still expensive. The running time is quadratic at O(n2) in the size n of the training dataset. This work investigates the problem of accelerating the GAN training when Sinkhorn divergence is used as a minimax objective. Let G be a Gaussian map from the ground space onto the positive orthant Rr + with r ≪ n. To speed up the divergence computation, we propose the use of c(x, y) = -ε log ⟨G(x), G(y)⟩ as the ground cost. This approximation, known as Sinkhorn with positive features, brings down the running time of the Sinkhorn matrix scaling algorithm to O(r n), which is linear in n. To solve the minimax optimization in GAN, we put forward a more efficient simultaneous stochastic gradient descent-ascent (SimSGDA) algorithm in place of the standard sequential gradient techniques. Empirical evidence shows that our model, trained using SimSGDA on the DCGAN neural architecture on tiny-coloured Cats and CelebA datasets, converges to stationary points. These are the local Nash equilibrium points. We carried out numerical experiments to confirm that our model is computationally stable. It generates samples of comparable quality to those produced by prior Sinkhorn and Wasserstein GANs. Further simulations, assessed on the similarity index measures (SSIM), show that our model’s empirical convergence rate is comparable to that of WGAN-GP.


I. INTRODUCTION
T HE introduction of the Wasserstein GAN (WGAN) in [1] and its subsequent exposition in [2] propelled optimal transport (OT) to popularity. It has since served as a generic tool to formulate distance metrics and minimax objectives in the Generative Adversarial Network (GAN) framework [3]. The key to using the OT metrics in WGAN formulation lies in the heuristic regularization to enforce a 1-Lipschitz constraint on the discriminator or the critic. The regularization takes several forms, e.g., weight clipping in [1], gradient penalty in [2], and spectral normalization in [4]. Adding a small convex regularization, e.g., an entropic penalty in [5], to the classical linear OT cost is a theoretically sound technique. The entropy regularized optimal transport (EOT) distance and its symmetric normalization, known as the Sinkhorn divergence, are powerful algorithmic tools. They help convexify the minimax objective and come with efficient solvers through the Sinkhorn matrix scaling algorithm. Examples include those proposed in [6]- [9]. They also provide better sample complexity and stability with respect to the input data and the ground cost robustness, as explained, e.g., in [10]- [12].
The GAN framework considers a minimization of a loss functional L over a parametric family of density µ θ . It searches for a solution by computing where ν is an unknown training data distribution and Θ ∈ R d is the feasible domain of the parameter θ. The model distri-bution is µ θ := G θ (ξ) over a ground set X ⊆ R q . Here, ξ is a known low-dimensional manifold in the latent space Z ⊂ Rq and G θ : Z → X is a push-forward mapping that sends ξ to µ θ . To solve this minimization problem, GAN employs two neural networks, namely, the generator G θ and the discriminator D ϕ , with θ and ϕ as their respective weight parameters. An adversarial training of GAN forms a two-player minimax game, in which G θ captures the data distributions whereas D ϕ estimates the probability that a sample came from the training data rather than from G θ . In the space of arbitrary functions G θ and D ϕ , a unique solution exists, with G θ recovering the training data distribution and D ϕ = 0.5 everywhere [3]. Let S cϕ denote the Sinkhorn divergence between two distributions µ θ and ν. Let c ϕ (x, y) be a convex and symmetric transport ground cost in X 2 ⊆ R q . Typical Sinkhorn GANs are formulated with an additional maximization subproblem L (µ θ , ν) := max ϕ∈Φ S cϕ (µ θ , ν). The role of D ϕ in a Sinkhorn GAN is no longer as a probability estimator. It becomes a feature map D ϕ : X → Z that sends µ θ to a learned latent space [13]. The transport ground cost, instead of getting fixed to pure Euclidian distances, such as the 1 , 2 , and cosine norms, is encoded by D ϕ . Doing so enables S cϕ (µ θ , ν) to adversarially distinguish µ θ and ν. One can then restate the minimax optimization problem in a Sinkhorn GAN as min where G θ and D ϕ represent, respectively, the push-forward and the discriminative feature maps from [13] and [6]. In WGAN literature, with [1], [2], and [4] as prominent examples, these maps are the generator and the critic networks, respectively. In discrete cases, the Sinkhorn divergence between the model µ θ := 1 m m i=1 δ xi and the target ν := 1 n n i=1 δ yi distributions is the symmetric normalization of the EOT distance W ε cϕ . Its entropy regularization parameter is ε → 0. More detail can be found, e.g., in [5] and [6]. The Sinkhorn matrix scaling algorithm can determine the distance W ε cϕ , which measures the total cost of moving all masses from µ θ to ν, in a quadratic time O(n m). The typical choice for the transport ground cost is c ϕ (x, y) := D ϕ (x) − D ϕ (y) p .
The Sinkhorn GAN can be interpreted as a differentiable minimax game with two players G θ and D ϕ . The objective S ε cϕ (µ θ , ν) can be either convex or non-convex in θ, but must be concave in ϕ (see [6] for an explanation). It is beneficial that the convexity and computational complexity of the Sinkhorn divergence can be controlled by astute choices of the ground cost c ϕ (x, y), as shown in [13] and [6]. Good choices include the cosine norm in [7] and [8], the 2-Wasserstein in [6] [14], and the recently-proposed lineartime kernel costs in [12]. We know from [15] and [16] that the stochastic gradient descent-ascent (SGDA) method is commonly deployed to solve the minimax optimization problem in (2). It satisfies the Von Neumann theorem from [17] that reads For a convex-concave (CC) objective, a vast literature on the simultaneous SGDA (SimSGDA) as a solution to the problem is available. The treatments in [18]- [21] are but a few examples. Under the assumption that the objective is locally strongly-convex in θ and strongly-concave in ϕ, SimSGDA has been shown to linearly converge to a local Nash equilibrium in [22]- [24]. At each iteration, it performs a gradient-descent along θ and a gradient-ascent along ϕ with an equal step size or learning rate. Irrespective of which player updates its parameters first, SimSGDA satisfies the Von Neumann minimax theorem in (3).
Despite theoretical progress on SimSGDA, prior Sinkhorn GAN implementations in [6], [7], [9] and [12] carried out the more expensive sequential game. It is solved by a sequential SGDA (SeqSGDA), in which D ϕ updates its parameters several times, either sequentially or in an alternate manner, before G θ does. It is known theoretically from [15], [18] and demonstrated empirically in [13] that SeqSGDA is more suitable for a nonconvex-nonconcave (NCNC) objective.
In this work, we accelerate the Sinkhorn GAN training based on two strategies. First, we use SimSGDA, instead of SeqSGDA, to efficiently solve the minimax optimization. Under some mild conditions on the ground cost adversarial from [11], we hypothesize a solution that convergences. Second, we deploy the linear-time Sinkhorn Algorithm with a Gaussian kernel from [12] to compute W ε cϕ . Our strategies reduce the running time from O(n m), which is quadratic in n, to O(r n), which is linear in n since r is a constant. In addition to the gain in training speed, our approach learns with a better generalization than the one proposed in [25]. By extensive simulations we demonstrate empirically that our model, with the CC objective, achieves a local Nash equilibrium efficiently. The main tools are mini-batch approximation, the 1 st order SimSGDA algorithm, and a spectral normalization on D ϕ . Our work aligns well with recent game-theoretical studies in [18] and [15] that interpret the minimax optimization in GAN, with the CC objective, as a simultaneous game solvable by SimSGDA.
Notation. The transpose of a matrix A is A . The canonical inner product (or Frobenius dot-product) of vectors (or matrices) u and v, denoted by u, v , is i u i · v i . The direct sum and the Kronecker product of dual matrices a and b are a ⊕ b and a ⊗ b, respectively. Given the space R d , we use X and Y to denote bounded subsets. A set of probability distributions over X is denoted by M 1 + (X ) for a given X ⊆ R d . The notation T : M 1 + (Z) → M 1 + (X ) represents a linear map from Z to X . For any x ∈ X , we use δ x to denote the Dirac (unit mass) distribution at x. The uniform histogram (1/m, ..., 1/m) in R m + is represented by 1 m . For an n × m real-valued matrix P , that is, P ∈ R n×m , the row average and the column average are written as P 1 m ∈ R n and P 1 n ∈ R m , respectively. Let f : R d → R be a given real-valued function. Then ∇f is its gradient and ∇ 2 f is its Hessian. When f takes in two variables x ∈ R d1 and y ∈ R d2 , that is, f : then ∇ x f and ∇ y f are the partial gradients with respect to x and y, respectively. Its partial Hessian matrices are ∇ 2 xx f ,

A. CONTRIBUTIONS
Our study yields the following contributions: • A proposal of an efficient and stable SimSGDA algorithm to train Sinkhorn GANs to stationary points. They converge exactly to local Nash equilibriums when using the convex-concave objective. • An empirical demonstration that simultaneously minimizing the asymmetric distance W ε cϕ (µ θ , ν) and maximizing the symmetric divergence S ε cϕ (µ θ , ν) still lead to a stable Sinkhorn GAN training. • A practical numerical implementation to illustrate the versatility of Sinkhorn with positive features. By learning adversarially a Gaussian kernel induced from a positive feature map, the running time of the Sinkhorn Algorithm goes from quadratic down to linear time. • A structural similarity index measurement on the nonasymptotic convergence rate. We show that our model has a comparable convergence rate to that of WGAN-GP. Both models output samples which are on par with one another in quality. Table 1 lists prominent GAN models alongside ours, for comparative purposes. The models are organized based on their respective minimax objective formulations and optimization techniques.

B. RELATED WORKS
Formulation. The first model that works in large-scale GANs is SGD-AutoDiff [6]. It uses the Sinkhorn divergence as the minimax objective and an MMD-like 2-Wasserstein ground cost. Another model is OT-GAN [7]. It uses a minibatch energy (MBE) distance and the cosine ground cost. EOT-GAN [9] uses the same ground cost as SGD-AutoDiff, but a different minimax objective by combining the Sinkhorn divergence and the Hinge loss. Yet another model, SWGAN [8], uses the Sinkhorn divergence as the minimax objective, validated on both the 1-Wasserstein and encoded cosine ground costs.
To solve the minimax optimization problem in Sinkhorn GANs, our model uses both the asymmetric W ε cϕ (µ θ , ν) distance and symmetric S ε cϕ (µ θ , ν) divergence for minimization and maximization objectives, respectively. A recent model, SiNG [26], uses the Sinkhorn divergence as a minimax objective with a natural gradient. SiNG adopts the squared 2 ground cost defined by D ϕ . All of these prior models use the so-called vanilla Sinkhorn matrix scaling algorithm, whose running time is quadratic [5]. In contrast, we use the recently-proposed Sinkhorn with positive features (SiPF). It uses the Gaussian kernel as the map to achieve a linear time performance [12].
Optimization. A variety of 1 st order gradient-based techniques are utilized by prior models. SGD-AutoDiff implements a sequential SGDA (min-max optimization) with a weight clipping regularization. OTGAN utilizes an alternating SGDA (min-max optimization) without any additional regularization. Instead of using SGDA, EOT-GAN proposes a simultaneous SGD in a min-min optimization context. Unfortunately, EOT-GAN performs poorly on higher-resolution datasets and shows instability during training. SWGAN performs an oracle-based non-convex SGD with a proven theoretical convergence guarantee to a stationary distribution, provided that the discriminator can be shown to be approximately optimal.
Our model performs SimSGDA that minimizes W ε cϕ (µ θ , ν) and maximizes S ε cϕ (µ θ , ν) simultaneously. It has an equal gradient step size for both the push-forward map G θ and the discriminative feature map D ϕ . Instead of implementing the SiPF model in a sequential or alternating SDGA, which had been done in [12], we opt for the simultaneous one. Implementing either of the earlier approaches requires several expensive critic updates for D ϕ , which is also commonly found in variants of OT-GAN [7]. Our novel optimization approach, via Algorithm 2 below, is stable and more efficient.

II. PRELIMINARIES
We recall necessary notions and known results on EOT, paying special attention to Sinkhorn divergence's desirable VOLUME 4, 2016 properties. These include smoothness, positivity, convexity, and differential properties.

A. EOT PRIMAL FORMULATION
Let the model and target distributions be Given a constant entropy reqularization parameter ε → 0, the Sinkhorn divergence between µ θ and ν is The projection operators are A := P 1 m and B := P 1 n . The transport ground cost is C : . The EOT adds an entropy scheme to produce a smooth and ε-strongly convex distance with a unique solution P * . It decomposes into for some optimal scalings u * ∈ R n + and v * ∈ R m + , with K := exp(−C/ε) being the associated Gibbs kernel. In this formulation, we can use a fixed point iteration on a modern GPU, known as the Sinkhorn matrix scaling algorithm [5], to determine u * and v * . Once the transport ground cost is fixed, the task entails initializing u and v to any arbitrary positive vectors and, then, applying the Sinkhorn Algorithm.

B. EOT DUAL FORMULATION
The dual formulation comes directly from the application of the Lagrange multipliers α and β and the Legendre transformation. As an unconstrained maximization problem, it is summarized from [27] and [28] in the following proposition. A proof has been supplied in [13, Section II.B].
Proposition II-B.1. The EOT primal formulation in (6) is equivalent to a dual formulation The dual formulation is also equivalent to a maximization problem where F ε (α, β) There exists a primal-dual relationship It links an OT plan to an optimal (α, β) that solves (8).
When ε > 0, it is clearly beneficial to use the EOT dual formulation since it can be rewritten as the maximization of an expectation with respect to the product measure µ θ ⊗ ν. The dual problem is unconstrained and jointly concave in both variables. This allows us to fix one and optimize over the other.

C. EOT SEMI-DUAL FORMULATION
We focus on α as a function of β in α = T (ν, β) and vice versa in β = T (µ θ , α). The dual formulation can then be interpreted as a smoothed version of the c-transform that links the dual potentials. A commonly used c-transform of β is β c (x) := min y∈Y c ϕ (x, y) − β(y). In EOT, however, we make use of the c, ε-transform (13) One sees β c,ε as a smoothed version of β c , which depends on µ θ . We omit µ θ from the notation for brevity. The transform leads to a semi-dual EOT formulation [30].
Proposition II-C.1. The EOT primal formulation between µ θ and ν in (8) is equivalent to a semi-dual formulation This semi-dual formulation is equivalent to a maximization of an expectation with respect to only one of the marginals In discrete cases, B x ε (β) is given by Proof. One replaces α by β c,ε in the dual formulation in (8) and proceeds accordingly.
Remark II-C.1. The potential β of a discrete measure ν with n-Diracs is an n-dimensional vector. If ε > 0, then the gradient is ∇ β B x ε (β) = ν − X ε and the Hessian is with Since it can further be shown that 0 ≤ ∂ 2 β B x ε (β) ≤ 1 ε , we know that B x ε is convex with a Lipschitz gradient. A strong convexity, however, requires a strictly positive lower bound on the eigenvalues of the Hessian. The semi-dual formulation is convex but not strongly convex. The lower bound on the eigenvalues of the Hessian is 0 [30]. Convexity is an important success criteria in solving the EOT via gradient descent optimization techniques [15], [18], [31].

III. MAIN CONTRIBUTIONS
Instead of the quadratic cost Sinkhorn matrix scaling algorithm, we equip our GAN implementation with a lineartime iteration by using Sinkhorn with positive features. This section explains the key approximations in developing the SimSGDA Algorithm, before completing the model.

A. SINKHORN WITH POSITIVE FEATURES
Computing W ε c (µ θ , ν) is typically done by the Sinkhorn matrix scaling algorithm. It requires the convex ground cost c ϕ (D ϕ (x), D ϕ (y)) to embed a sample (x, y) ∈ X 2 into a latent space [13]. It is common to define the convex ground cost first before the kernel K := exp(−C/ε). Doing so leads to a costly quadratic running time O(n 2 ) in the size n of the support of the µ θ and ν distributions. Despite the cost, all prior Sinkhorn GANs in Table 1 use this vanilla Sinkhorn  Algorithm. A kernel-first approach, recently proposed by Scetbon et al. in [12], achieves a linear O(r n) time, where r is a constant given by the dimension of the feature space (R * + ) r . By utilizing the optimal u * and v * from the primal formulation, we rewrite the dual formulation in (8) as Using the optimal α * := ε log u * and β * := ε log v * , In [12], the negative term (exp(α/ε)) K exp(β/ε) = u Kv in (19) is approximated to 1. Hence, given the output u * and v * from the Sinkhorn iteration in Algorithm 1, we can estimate W ε c (µ θ , ν) directly by using (20) without instantiating any ground cost. In contrast to the work of Scetbon et al., we use a fixed-point Sinkhorn iteration algorithm with the budget L as an additional hyperparameter, as shown in Algorithm 1.
The aim of the kernel-first approach is to learn an embedding from the input space into the feature space via two feature map operations. The first entails taking a sample and embedding it into a latent space by using the discriminator network D ϕ : X → R d . The second embeds the latent space into a feature space by the mapping kernel G φ : X → (R * + ) r . It associates each point in X with a vector in the positive orthant. Thus, by defining h ϕ (x, y) := (D ϕ (x), D ϕ (y)) and u ← a/Kv 4: end for given a fixed ground cost function c on R d , we can redefine a parametric ground cost on X as These maps enable us to build the corresponding positivedefinite kernel As a by-product and since D ϕ is positive, we can define, for all (x, y) ∈ X 2 , the ground cost function Although the mapping kernel G φ can be chosen arbitrarily, this work implements the Gaussian kernel, whose associated ground cost is the 2 -norm Euclidean metric [12], already used by previous Sinkhorn GANs in [6], [9]. Our G φ feature map implementation is based on [12, Lemma 1] with φ initialized with a normal distribution.
Proposition III-A.1. Let I be the identity matrix of the required dimension. Let d > 1, > 0, and k(x, y) be the kernel on R d such that, for all x, y ∈ R d , k(x, y) = e x−y 2 2 / . Let W 0 be the Lambert function. Let R > 0, q = R 2 2 dW0(R 2 / d) , and σ 2 = q /4. Let the random variable ρ be normally distributed, with ρ = N (0, σ 2 I). For all x, u ∈ R d , we define the map Then, for any x, y ∈ R d , we have Proof. The proposition can be proven by rewriting the right hand side of (24) before following the route explained in [12,Appendix A.4].
Choosing the Gaussian kernel feature map improves our Sinkhorn GAN in two directions. First, it becomes memory efficient. The computation of the gradient of G φ is independent of the computation required by the Sinkhorn Algorithm. Second, computing the Sinkhorn divergence S ε c (µ θ , ν) to a high accuracy can be done even when ε → 0. The main difference between our Sinkhorn GAN implementation and the SiPF model in [12] is that we adopt the key approximations from our prior work [13]. The work improved on a version of SGD-AutoDiff but not of the OT-GAN [7]. Furthermore, Scetbon et al. train their model by using SeqSGDA, in constrast to our SimSGDA proposal. VOLUME 4, 2016

B. LEARNING GANS WITH SIMSGDA ALGORITHM
Our focus is on the minimax optimization with a CC objective derived from the convex kernel ground cost c ϕ in (8). A variant of the SimSGDA Algorithm can efficiently find an approximate local Nash equilibrium [15], [16]. To find a local Nash equilibrium in SimSGDA, it is typical to have G θ and D ϕ randomly initialize their variables and follow their respective gradients, computed by using the reverse AutoDiff [32]. That is, at each step n = 1, 2, . . ., each player updates G θ descends a gradient on S ε cϕ (θ, ϕ) while D ϕ ascends the gradient with the same gradient step or learning rate η. Algorithm 2 is our SimSGDA implementation. It comes with additional heuristic approximation that, as ε → 0, sets cϕ (ϕ n , ϕ n ) ≈ 0 to the generator update in each step. This turns (25) into The approximation intuitively implies that small changes in θ result in negligible changes in the corresponding OT plan between points in the µ θ and ν distributions [13]. The formal expression is in [8,Theorem 3.1]. Game theoretically, G θ can take a different course of action, say, minimizing the asymmetric W ε cϕ (θ, ϕ), instead of D ϕ . The latter maximizes the symmetric S ε cϕ (θ, ϕ) to compute the discriminative feature map. We leave a theoretical investigation into this approximation for a future work.

Algorithm 2 SimSGDA Algorithm
Input: the real dataset (y j ) n j=1 and the hyperparameters (mini-batch size m, learning rate η, number of Sinkhorn iteration L, and entropy regularization weight ε) Output: θ, ϕ Initialization : θ ← θ 0 , ϕ ← ϕ 0 1: while θ has not converged do 2: Sample (y j ) m j=1 from the real dataset 3: We compute the gradient at each step and approximate ∇ θ S ε cϕ (µ θ , ν) by ∇ θ S ε(L) cϕ (μ θm ,ν m ) in mini-batch sampling. Thus, we can optimize min θ S ε c (µ θ , ν). Another approach is to use spectral normalization (SN) to enforce the K-Lipschitz constraint on each discriminator layer [4]. SN remains independent of the objective function and the Sinkhorn algorithm since it is a matrix operation.
The SimSGDA hyperparameters (m, η, ε, L) are the minibatch size m, the learning rate η, the regularization parameter ε, and the number L of Sinkhorn iteration. Note that the procedure AutoD ϕ corresponds to the classical reverse-mode automatic differentiation of the L steps of the Sinkhorn iteration. Its running time is, therefore, O(L r n). The training procedure consists of the same optimization steps to train the generator G θ and discriminator D ϕ feature maps.

C. OPTIMALITY OF SIMULTANEOUS GAME
The minimax optimization in (2) refers to a two-player minimax game. The generator G θ tries to minimize S ε cϕ (µ θ , ν) with respect to the decision variable θ. The discriminator D ϕ tries to maximize it with respect to ϕ. Neither player knows anything about the critical point of S ε cϕ (µ θ , ν). Both players follow the rules of the game and can act simultaneously. The minimax theorem in (3) guarantees that the order of which player goes first does not matter.
To be concise we rewrite S ε cϕ (θ, ϕ). A well-known notion of optimality in this game is that of a local Nash equilibrium. It is the point (θ * , ϕ * ), where θ * is a global minimum of S ε cϕ (., ϕ * ) and ϕ * is a global maximum of S ε cϕ (µ * θ , .). By assuming that the objective is differentiable with a welldefined gradient, we can use the following definition of optimality for a simultaneous game.
The local Nash equilibria can also be characterized in terms of the 1 st and 2 nd order conditions.
Theoretically, SGDA can find an -approximate stationary point within O(κ 2 log(1/ )) iterations for strongly-convex and strongly-concave objectives, and within O( −2 ) iterations with decaying stepsize for a convex-concave simultaneous game [20], [21]. It is hard to analyse the convergence rate for the Sinkhorn divergence. The first theoretical approximation, based on the EOT dual formulation, provides a convergence guarantee to -stationary in O( /λ) for SWGAN [8]. This guarantee, however, requires an optimal discriminator.
To empirically determine the SimSGDA Algorithm's convergence rate, we need a proper metric that measures similarity among the tensors produced by the optimizing variables θ and ϕ. We assume that the algorithm can reach a local Nash  [33]. It quantifies structural information changes, e.g., the local mean, variance, and correlation, over the optimizing generator weight parameter θ, relative to the preceding SimSGDA iteration. By Definition III-C.1, the algorithm achieves a local minimax if there is a small δ > 0 such that θ − θ * ≤ δ and ϕ − ϕ * ≤ δ, for any (θ, ϕ). A more precise treatment is already given in [13].

IV. EXPERIMENTS
We carried out experiments to evaluate our model's convergence and stability. We use the standard DCGAN architecture, like the one used by all prior models, with additional spectral normalization (SN) in the discriminator and a batch normalization (BN) in the generator. The detail is in Table 2. The modifications are designed to significantly smooth the SimSDGA optimization landscape. It is known that BN smooths the objective function [34] while SN enforces Lipschitz continuity [4]. NVIDIA Tesla V-100 GPUs, with 32 GB VRAM each, constitute the platform. The linear-time Sinkhorn Algorithm with Gaussian kernel, AutoDiff, and ADAM optimizer were deployed in all of the experiments. For reproducibility, our PyTorch codes are available online at https://github.com/muchlisinadi/compare-ipmgan.

A. DATASETS SELECTION
Image datasets such as MNIST, CIFAR10, ImageNet, and CelebA have been used to test prior Sinkhorn GAN models. CIFAR10 and CelebA are well-studied datasets of 32 × 32 color images. For the experiments, we selected Cats and CelebA, instead of CIFAR10 and ImageNet, for three main reasons. First, the preferred two datasets have around 20, 000 images each. This number is much larger than the 600 and 6, 000 per class, respectively, in ImageNet and CIFAR10. Second, both samples for Cats and CelebA generated by GANs are easier to inspect visually, compared to multi-class datasets. Finally, using datasets of the same size allows us to focus on the impacts of the hyperparameters in our Sinkhorn GAN and on the visual quality comparison of the generated samples against those produced by other state-of-the-art models.

B. HYPERPARAMETERS SELECTION
The hyperparameters include the mini batch size m, the SimSGDA learning rate η, the number of Sinkhorn iterations L, and the entropy regularization weight ε. To determine the best values for the set {m, η, L, ε} we performed simulations with varied {η, ε} on fixed {m, L}. We decided on L = 10 and m = 100 based on the insights gained from the reported experiments in [13].
The outcomes for ε ∈ {0.001, 0.01, 0.1} and η ∈ {5 · 10 −5 , 10 −4 } are displayed in Figures 1 and 2. The choice of ε is crucial. Too much regularization, that is too large an ε, leads to a loose fit on the data. Insufficient regularization severely slows the convergence of the Sinkhorn iteration. It demands more GPU time whenever the total Sinkhorn iteration budget L increases. Choosing η = 5·10 −5 produces better generator and discriminator losses compared with the higher η = 10 −4 . The latter leads to vanishing gradients. Since m and L are fixed, the best hyperparameters for our model are m = 100, L = 10, ε = 10 −3 , η = 5 · 10 −5 . (28) Choosing lower values for ε, η, and L produces better quality samples, as illustrated in Figure 4 for Cats. Higher inception score (IS) and lower Fréchet inception distance (FID) scores shown, respectively, in Figures 7 and 8, strengthen our claim of improved sample quality. We trained our model on CelebA with the same hyperparameters. It yielded samples of comparable quality when the positive features are in use, as shown in Figure 5. For higher ε and η, SimSGDA tends to diverge above 275 · 10 3 iterations, after having previously converged. This tendency is highlighted in Figure 1 for ε = 0.01 and in Figure 2 for η = 10 −4 . This phenomenon, known as a rotational behaviour, is typical when a constant is chosen as the learning rate η [35], [36].

C. IMPACT OF POSITIVE FEATURES
We investigated the impact of using Sinkhorn with positive features on the selected hyperparameters. We ran the SimSGDA Algorithm on two types of Sinkhorn Algorithms, namely, the kernel-first positive features in Algorithm 1 and the original Sinkhorn matrix scaling from [13]. On the same L, the former runs in O(L r n) time, underlying the impact of the positive features. The latter takes O(L n 2 ) time. Figure  3 shows a faster convergence behaviour for Sinkhorn with positive features below 275 · 10 3 iterations. Soon after, it diverges, exhibiting a rotational behaviour. The generated Cats samples, however, are comparable in quality to those produced by WGAN. This can be seen in Figure 4 for Cats. A direct comparison of the generated samples from the two types of Sinkhorn Algorithm for CelebA is given in Figure 5. The tendency of Sinkhorn with positive features to generate blurry images is confirmed by the slightly lower IS and higher FID values in Figure 6. Similarly blurry CelebA samples are also produced by other Sinkhorn GANs, notably   SiNG and SiPF. These last two models, however, use different objective functions and optimization techniques.
To empirically establish that SimSGDA converges locally to a Nash equilibrium, we reversed the order of minimization and maximization. The SimSGDA, trained with the reversed order, obtains the same results. This forms a strong empirical evidence that SimSGDA has a CC objective, since it satisties the von Neumann condition in (3). It has already been proven theoretically that SimSGDA with a CC objective can converge to a local Nash equilibrium [22]- [24]. Here we supply empirical evidences for the Sinkhorn GAN trained using SimSGDA. All previous Sinkhorn GANs also have CC objectives, as mentioned in Table 1. They, however, are trained with SeqSGDA, with additional critic updates on the discriminator, not with SimSGDA.
Many methods have been proposed to handle the rotational problem that may appear, as in our case here, in the generator and discriminator losses after convergence. Prominent examples include the 1 st order consensus optimization from [35] and the 2 nd order follow-the-ridge (FR) correction proposed in [36]. Since the Sinkhorn divergence has welldefined gradient and Hessian, through its dual and semidual definitions, both corrections can be implemented. A comprehensive analysis on such implementations deserves a treatment beyond what we can include in the present work.

D. INCEPTION SCORE AND FRÉCHET INCEPTION DISTANCE EVALUATION METRICS
We use two common GAN evaluation metrics, namely the inception score (IS) and the Fréchet inception distance (FID), to evaluate the quality of the generated samples. A high IS score indicates that the generated samples contain clear objects and the GANs can output highly diverse samples. The FID measures the distance of two probability distributions under the assumption that they are Gaussian. It captures the similarity of the generated samples to the real samples better than IS does. A smaller FID value says that the generated samples are more similar to the real ones. High IS and low FID values are desirable.
We used Cats, a DCGAN neural architecture, and the ADAM optimizer, with fixed (m, L) = (100, 10). Figures  7 and 8 show, respectively, the IS and FID values over variations in the regularization ε and learning rate η. Our model produces better quality samples as ε decreases, as indicated by the progressively higher IS and lower FID values. In theory, the Sinkhorn algorithm requires ε → 0 and L → +∞ [5]. A clear benefit of using the Sinkhorn iteration in Algorithm 1 is a reduction in the running time, since we do not have to set a high value for L.

E. IMPACT OF SINKHORN ITERATION
We have also investigated the effects of the number of iterations on the training stability and the quality of the generated samples. From a numerical point of view, our proposed Sinkhorn algorithm is a fixed-point iteration. Hence, L can be chosen to be large, say L = 100, or small, say L = 10.
Larger L produces images with better quality. One must keep in mind, however, that L also depends on the chosen value of the regularization parameter ε. The latter must be kept small, as was already explained.
We used Cats, a DCGAN neural architecture, and the ADAM optimizer, with fixed (m, ε) = (100, 0.001). Figure  9 presents the IS and FID charts as L changes. The higher IS and lower FID values indicate that our model produces better quality samples as L decreases. In addition to the theoretical requirement of L → +∞, the output of the Sinkhorn algorithm also depends on ε and m. Our simulations confirm that L can be kept relatively small in practice.
Intriguingly, some values of L, for example L = 50 in Figure 10, lead to a vanishing gradient. We therefore hypothesize a trade-off among m, L, and ε as the main cause. We note that the Sinkhorn GANs in [7] and [12] use higher values of m and L when using a low ε.

F. COMPARISON TO WGAN
For further evaluation, we benchmark our model against the state-of-the-art WGAN-GP [2] and SN-GAN [4], using Cats and the same DCGAN architecture. We select WGAN-GP and SN-GAN because both are formulated by using the earth-mover distance, which is a variant of OT-based distance. We want to test if our model can generate samples with comparable quality.
A WGAN uses a sequential, instead of simultaneous, SGDA with several critic updates on the discriminator N c . Our model uses the hyperparameters in (28). For WGAN-GP and SN-GAN, we choose (N c , m, η) = (5, 100, 10 −4 ). The coefficient for the WGAN-GP's gradient penalty is fixed to λ = 10. Figure 4 shows that all three models generate samples of the same quality. Furthermore, the outcomes, plotted for only the generator iterations in Figure 11, confirm the IS and FID convergence in all three models. They follow a similar trend. Compared to ours, WGAN-GP and SN-GAN achieve slightly better sample qualities, indicated by their higher IS and lower FID values. Figure 11 displays how WGAN-GP and SN-GAN achieve convergence faster than ours. Game theoretically, a model may converge faster but not to a local Nash equilibrium or a minimax point [15].
Let ∆ be the total number of gradient updates for both the critics and the discriminators. To approximate the gradient complexity, we performed a linear regression comparison on the captured data with respect to log(1/SSIM) and log(∆). We made use of the back-and-forward tracks, each with 1, 000 iterations, from a stationary point when the generator loss had already converged. The generator parameters are taken at specific gradient updates. Thus, the generator is structurally steady, producing the same tensor. Empirically, the convergence rate of Sinkhorn with positive features is comparable to that of WGAN-GP, as indicated by the positive slopes in Figure 12. Previous study showed that the iteration complexity to return to an -stationary can be approximated to O(κ log( − 1)), where κ is a value that depends on the      Sinkhorn divergence's convexity and the learning rate in SeqSGDA algorithm [13].

V. CONCLUDING REMARKS
We have proposed a new variant of Sinkhorn GAN. The generator is trained to minimize the asymmetric distance W ε c (θ, ϕ) while the discriminator maximizes the symmetric divergence S ε c (θ, ϕ). To accelerate the training, we incorporate a linear-time Sinkhorn with positive features and the more efficient SimSGDA algorithm. The model is shown to be uniquely stable when trained with the right choice for the set of hyperparameters {m, L, η, ε} on datasets of images, each of size 32 × 32. A detailed comparison between our model and prior Sinkhorn GANs is given in Table 1.
Our theoretical review assumes the existence of a local Nash equilibrium when using the SimSGDA Algorithm with CC objective. An investigation to explain this fact is an open direction to pursue. We have highlighted that our method works well in practice. One downside of SimSGDA in deployment is that it runs into the notorious rotational behavior at higher iterations when a low and a fixed η are chosen. We also discover that the method becomes less stable in deeper architectures such as ResNet. This may be because the gradients in such architectures can vary greatly in scale. We are currently looking into the observed rotational and architectural issues. Another task is to scale the approach up to yield higher resolution images.
Our model holds a two-fold advantage over prior Sinkhorn GANs. First, the time complexity O(n r) in approximating the Sinkhorn divergence S ε c (θ, ϕ) is linear. This allows for a significantly enlarged batch sizes. Second, the approach is fully differentiable. It directly computes the gradients of W ε c (θ, ϕ) and S ε c (θ, ϕ) with respect to θ and ϕ. In comparison, OTGAN does not differentiate directly through S ε c (θ, ϕ). It uses, instead, a mini-batch energy distance. SWGAN requires an optimal discriminator oracle whereas SGD-AutoDiff differentiates through the iterations of the Sinkhorn Algorithm. It must, therefore, keep track of the ground cost computation and is applicable only when the entropy regularization ε is large, since the number of Sinkhorn iteration L cannot be too large.
We have demonstrated that the more efficient SimSGDA can be used to train Sinkhorn GAN. We show that a minimax optimization with a convex-concave Sinkhorn objective problem can be solved to a local Nash equilibrium, under some mild approximations. The overall insight is that the approximations become simpler and computationally more efficient than prior Sinkhorn GANs, while still generating images with comparable quality to state-of-the-art WGAN-GP and SN-GAN. As we continue to strengthen the theoretical foundation, we hope that our model can inspire a wider deployment of EOT in GANs. Our results open up possibilities for future research. For wider practical implementation, the rotational issue and deeper neural architecture adaptation are open for exploration. On the theoretical side, the proof of local Nash equilibrium, the generalization properties as well as the analytical convergence rate of SimSGDA Algorithm in our proposed Sinkhorn GAN are worth investigating.