Scalable Optimal Transport Methods in Machine Learning: A Contemporary Survey

Optimal Transport (OT) is a mathematical framework that first emerged in the eighteenth century and has led to a plethora of methods for answering many theoretical and applied questions. The last decade has been a witness to the remarkable contributions of this classical optimization problem to machine learning. This paper is about where and how optimal transport is used in machine learning with a focus on the question of scalable optimal transport. We provide a comprehensive survey of optimal transport while ensuring an accessible presentation as permitted by the nature of the topic and the context. First, we explain the optimal transport background and introduce different flavors (i.e., mathematical formulations), properties, and notable applications. We then address the fundamental question of how to scale optimal transport to cope with the current demands of big and high dimensional data. We conduct a systematic analysis of the methods used in the literature for scaling OT and present the findings in a unified taxonomy. We conclude with presenting some open challenges and discussing potential future research directions. A live repository of related OT research papers is maintained in https://github.com/abdelwahed/OT_for_big_data.git


INTRODUCTION
Selecting the right notion of discrepancy or distance between objects is at the heart of many machine learning problems.This work is about Optimal Transport (OT) [1], a field that defines many such notions between a variety of mathematical objects such as probability histograms, shapes, and point clouds.OT views objects as heaps of sand (mass) and quantifies the distance between them as the least costly way to transform the first heap into the second.This seemingly abstract concept has found applications in problems including domain adaptation [2], object detection [3], reinforcement learning [4], graph representation and matching [5], [6], [7] , feature matching [8], analyzing deep learning generalization [9], [10], [11] , generative modeling [12] , knowledge distillation [13], fairness [14], [15] and many others.
Interestingly, the simple idea of moving mass in an optimal way has a rich history.It evolved over the years, branched into many fields, and produced a wealth of theoretical and practical knowledge.The recent treatment of the topic was shaped by the work of Leonid Kantorovich when a practical wartime optimal allocation problem confronted him.He realized that the problem was actually of broader interest and devised a "simple general method of solving this group of problems" [16]  1 .By doing so, he unknowingly [18] revisited a similar transport problem studied 150 years earlier by the renowned mathematician Gaspard Monge; the problem of soil transport for construction purposes.Ultimately, Kantorovich's solutions constituted the main tools of linear programming (i.e., linearly constrained linear optimization) and won him a Nobel Prize in economic sciences in 1975.This historical connection between the theory of OT and its applications persists until today, fostering successful developments on both fronts.On the theoretical side, the success can be associated with OT's connection with several branches of mathematics.The parallel practical success is reflected in the growing adoption of OT in diverse applications.
1. His method is regarded as a variant of the simplex method proposed in the late 1940s by Dantzig; who advanced the field of linear programming [17].Fig. 1.Optimal Transport geometry awareness.Individual word-toword distances may not capture the semantic similarity between two documents due to the lack of common words.OT leverages the underlying geometry (captured by the ground cost) and lifts it to the Wasserstein space where the distance represents the cost of the optimal transportation of a whole document (distribution) to another.

"lifting"
So, what does OT bring to the Machine Learning toolbox?Many machine learning problems boil down to comparing probability distributions.Consider the problem of assessing the similarity of two documents containing "Obama speaks to the media in Illinois" and "The president greets the press in Chicago" [19].A similarity notion that relies on direct pairwise comparisons of the individual elements (i.e.words) would render the documents dissimilar, due to a lack of common words.However, our common sense tells us that the documents are actually similar.Looking at the two sets of words as a whole, the first sentence is semantically very close to the second.The latter insight motivates the OT perspective [20], [21].OT accounts for the shapes of the distributions by computing a distance that measures how much effort is needed to align a distribution to another (Fig. 1) while allowing the distributions to have different supports.Among all possible ways to do such "global" alignment of

Machine Learning
Optimal Control Fig. 2. Survey Position.Among OT review papers [27], [28], [29], [30], [31], this work presents a very comprehensive and updated coverage of OT in machine learning.The circles on the left column represent the existing reviews with their size indicating the number of the reviewed papers.The connections between the left and right columns depict the topics covered in each considered review article.The underlined "Scaling OT" is a key focus of this survey.
distributions, OT picks the one with the least effort (optimal plan) guided by a notion of "local" distance/cost between the basic features of the distributions' supports (e.g.words).This local cost captures the geometry of the data and is usually called the ground cost 2 .In summary, OT lifts the basic wordto-word ground cost to a global distance in the distributional space.The appealing properties of OT can be summarized in the following two main points.
Underlying geometry awareness.Allowing geometric information of the support to be taken into account (unlike fairly common distances/divergences) can result in meaningful and smooth divergences, even when the two distributions don't overlap.This powerful feature is leveraged in many applications including OT-based embedding [22] and generative modeling [12].Geometry awareness also means that OT borrows properties of the underlying space [1].For example, concepts such as interpolation and barycenter from Euclidean ground space carry on to the Wasserstein space.Thus, one can perform smooth interpolation between probability distributions [23] or even perform regression on histograms [24].
Explicit Correspondence: in addition to distances between distributions, OT typically provides correspondences through the optimal plan.Such correspondences can be used for applications such as object matching and registration [25], [26], and interpreting machine learning results ( §B. 1).
What is the price tag?The above-mentioned desirable characteristics come with longstanding computational and statistical challenges.First, OT-based distances are hard to compute, and explicit closed-form expressions for OT-based distances are rare beyond Gaussian and one-dimensional cases.In its original formulation, OT-based distance computation requires solving a linear program with a super-cubic complexity in the number of data points.Reducing this problematic complexity using various computational techniques and workarounds is an ongoing and interesting trend in the OT literature.We review many of these techniques ( §5) and categorize them in a taxonomy that reveals the connections among various techniques.Second, in high dimensional settings and without applying any scaling techniques (Sec.§5), OT suffers from a severe "curse of dimensionality".The sam-ple complexity 3 of estimating the Wasserstein distance using a finite sample from the distributions grows exponentially in the dimension [32], [33].To understand the gravity of the issue, contrast OT with another divergence, the Maximum Mean Discrepancy [34], whose sample complexity is dimension independent!Finally, there are limitations inherent to the vanilla OT formulation itself such as sensitivity to outliers [35], inability to incorporate context/structure in the optimal plan [36], strong reliance on the chosen ground cost [37], assuming a common ground cost [38] and the inability to handle measures (generalized, un-normalized distributions) of non-equivalent masses (normalization constants) [39].Unsatisfied with these limitations, researchers have produced a diverse spectrum of OT formulations over the years with the potential to meet a broader range of applications and requirements.Indeed, new formulations are still evolving as of this writing.For each formulation, we briefly discuss the characteristics, computational methods for solving it, and ML applications in which the formulation was used.
In the last decade, many significant developments in OT have taken place.New mathematical formulations, highly scalable algorithms (coping with datasets with millions of samples), interactions with deep learning, the rise of learned neural solvers, and empirical advancements in domain adaptation and generative modeling are a few developments to name.These developments call for a refreshed review of the field that sheds light on recent developments and identifies emerging trends and paradigms.On one hand, there are excellent textbooks [1], [40] that discuss the computational aspects of OT in great detail.However, their discussions of OT applications in ML and its scalability are dated.On the other hand, the existing review papers (Fig. 2) either have a narrow coverage of OT in machine learning or a specialized coverage irrelevant to machine learning.Our goal is to present a timely and comprehensive survey of the important topic of OT and OT applications in modern machine learning.
We summarize the contributions of this work as follows: 1) We present a comprehensive survey on the topic of OT with a focus on the applied side of the literature in a way that best targets the machine learning research audience.2) We compose a high-level view that organizes the techniques in literature for addressing the scalability issues of optimal transport (i.e.computational challenges) in a single framework ( §5). 3) We highlight literature gaps and discuss promising future research directions in OT and its applications.
The remainder of this paper is organized as outlined in Fig. 3.We took some efforts to make the paper self-contained.It is difficult to fit the OT-related optimization background into a few pages; we suggest that §C can help the uninitiated readers before seeking other resources.

A MOTIVATIONAL EXAMPLE
In this section, we give a simple motivational example to introduce OT.Consider the problem of comparing point cloud objects.The objects are composed of 3D points that can be acquired by contemporary sensors such as Lidar, Radar, or depth sensors.Figure 4(a) shows examples of point cloud objects.To compare these objects, we seek a matching that redistributes the first object into the second in the least costly way (i.e.optimal transportation).Let's assume that 3. Sample complexity, informally, is the number of samples needed to estimate a function within a certain level of accuracy.Fig. 3. Outline of the Survey.We start by ( §2 Motivational Example) introducing the reader to OT through a motivational example that paves for ( §3, §4 Background) the discussion of OT formulations.Next, ( §5 Scaling OT) we present a taxonomy for scaling OT methods to big data regimes.Then, open issues and future research directions are discussed in ( §6 Discussion).In the supplementary material, we extend the background discussion ( §A Extended Background) and include a summary of OT applications in machine learning( §B Applications).All the figures are best viewed in color.
there are n and m points in the objects µ and ν; respectively.Furthermore, there is some mass (depicted as the volume of the point) that can be exchanged between the two objects during transportation.This mass can be, for example, the intensity of the point 4 .The term "mass" may be also used to refer to abstract or conceptual quantities.When unknown, we can assume that the masses are uniformly distributed.The mass of one point in the source can be exchanged with one or multiple points in the target.
Regardless of the notion of mass used, we assume that the points {x i } n i=1 of the object µ have non-negative mass each that sum up to 1 (i.e.normalized).We denote the vector of point masses in µ by a ∈ R n .The same assumption is made for the object ν whose points {y i } m i=1 are assumed to have a normalized mass vector b ∈ R m .
Suppose we wanted to transport the mass from the particles in µ to the particles in ν in a way that minimizes some total measure of cost, e.g. the sum of "mass moved" × "the distance it moved".More rigorously, we would like to decide on how much mass to move from point x i to point y j ; P ij , considering the distance between x i and y j ; C ij .The first P ij is sometimes known as a transfer plan which is to be identified while the latter C ij is known as ground distance (e.g. the squared Euclidean distance ∥x i −y j ∥ 2 ) which is given.Let P ∈ R n×m and C ∈ R n×m denote these quantities.The goal of OT is to find an optimal plan P * that dictates the mass distribution plan from the source points to the destination points in a way that minimizes the total transport cost.The set of all valid plans (formally defined in Couplings) is described as follows: The constraints ensure that P preserves the mass so no mass is created or destroyed during transportation.In light of this, the optimal plan P * would be one that minimizes Posed this way, the OT problem (sometimes known as the Kantorovich formulation) possesses a useful set of characteristics as well as limitations, which motivate other formulations of the problem.We discuss many of them in §4.In some cases, one may wish to relax the mass preservation constraint.For example, as shown in Fig. 4(c), the pink hand object can only be partially matched to the incomplete (grey) hand copy as the thumb is missing.The vanilla OT forces the transportation of all points, resulting in inaccurate matching and (unrealistically) increased cost.Note the incorrect matching of the thumb's points.A partial OT formulation ( §4.2), however, is a better alternative as it has the flexibility of leaving some points out (the thumb) from the transportation.Another limitation is that transportation that relies only on point correspondences conveyed by the ground distance (without considering the inter-points relations) can be misleading.In Fig. 4(d), we see a simple rotation transformation of the target object causes a degradation in the alignment by vanilla OT.This can be clearly seen in the incorrect correspondences of the index and the thumb fingers.This is not the case with a relational formulation (Gromov Wasserstein §4.5) that handles this issue well at the expense of increased computation.This formulation can also compare points defined on distinct (incomparable) spaces which can be useful for comparing graphs or measurements from different modalities.

BACKGROUND
We begin with notation and background of the key OT formulations.We state the Monge and Kantorovich formulations below, and refer to an extended discussion of Kantorovich formulation in §A.

Notation
Probability measures.Let X and Y be two non-empty sets.Denote by P(X) the set of probability measures supported on X.Let µ ∈ P(X) and ν ∈ P(Y).A function ϕ : X → R is said to be integrable with respect to µ if it has finite expectation under µ, i.e.E x∼µ ϕ(x) = X |ϕ(x)| dµ(x) < ∞.The space of functions integrable with respect to µ is denoted The mathematical objects that we discuss can take on a discrete, continuous or mixed nature.Averages can be computed using one or more of sum, integral or expectation notation.While in principle these objects can be dealt with using one unified notation as an integral with respect to a measure, for ease of reading we sometimes prefer to explicitly write out the same equation using multiple notations.To simplify notation, we will assume that every probability measure has an associated probability density function or probability mass function.
Nonnegative unnormalized measures.Most of the time, we will be interested in distances defined over the space of probability measures.Unless otherwise stated, all measures are probability measures.In some cases (for example, § 4.2), these distances generalize to spaces containing nonnegative and finite measures that are not necessarily probability measures.Where we deal with such nonnegative unnormalized measures, we will mention this close to where they are introduced.In such cases, the expectation notation E x∼µ • is inappropriate and we resort to integrals X •dµ(x) or sums.We denote by M + (X) the space of finite nonnegative unnormalized measures supported over X.

Coupling.
A useful notion appearing in OT (for example, in the Kantorovich formulation Eq. ( 2)) is that of coupling.Given two random vectors x and y, we may construct a random vector (x, y) having marginal distributions that are the same as the distributions of x and y, and a non-unique joint distribution consistent with those marginals.The space Π of all such joint distributions constitutes a useful space over which to search.Define the set Π(µ, ν) of joint probability measures with marginals µ and ν.In other words, Π(µ, ν) is the set of plausible plans

Matrix representations for discrete and finite support.
It is useful to instantiate measures and couplings in terms of vectors and matrices in the case where the random variables in question are discrete and have finite support.
Let a and b be the probability mass functions corresponding to µ and ν.Let the support of a and b be finite, that is The set Π(µ, ν) corresponds with a set of finite-dimensional matrices U (a, b) representing joint probability mass functions consistent with marginals µ and ν.Define and note that for any P ∈ U (a, b), we have that the ijth entry of P satisfies P ij = Pr(x = x i , y = y j ) under some joint probability mass function for random variables (x, y) with marginal probability mass functions a and b.
Empirical measures When dealing with real-world data and problems, we usually don't have access to explicit forms of distributions µ and ν.Instead, we have samples {x i } m i=1 ∼ µ and {y j } n j=1 ∼ ν.In these settings, it is useful to define and work with the empirical measures µ n = 1/n i δ xi and ν n = 1/n j δ yj .Here δ z is a unit mass centered at z.

Monge Formulation (Optimal Map)
Given probability measures µ and ν, the OT map t * : X → Yis defined to be one that maps random variables x following µ to random variables y following ν with the minimum expected cost between x and t * (x) = y.The Monge formulation [40,Chapter 1] is : Here T µν = {t : X → Y | t # (µ) = ν} constitutes the set of all possible mappings.t # (µ) is the pushforward measure of µ under t.In the context of probability measures this means that t # (µ) is the probability measure of the random variable t(x).c is the ground cost responsible for transporting one unit of mass from x to y.The choice of c is typically application-dependent and influenced by domain knowledge and awareness of data.
The Monge formulation Eq. ( 1) may be ill-posed; there may not exist any mapping t such that t # (µ) = ν.For example, if µ places probability mass over 2 states, it can never be mapped to a measure that places probability mass over 3 states using a deterministic function.The Kantorovich formulation, introduced next, does not suffer from this issue.

Kantorovich Formulation (Optimal Plan)
Given µ and ν and some cost function c, the Kantorovich OT problem [40,Chapter 1] (the general formulation of the discrete example in §2) is defined as: Any π * that attains this infimum is called an OT plan.In the case that x and y are discrete random variables with probability mass vectors a and b, we write where C = [c(x i , y j )] ij is the matrix containing the pairwise costs between x and y.Interestingly, Eq. ( 2), known as the primal formulation, can be written in a number of equivalent formulations (such as the dual formulation).Note that interesting theoretical and practical connections were made between Monage and Kanorovich formulations.Brenier's theorem [1, Theorem 2.1] establishes the equivalence of Kantorovich and Monge's problem in the Euclidean space where the optimal map is the gradient of convex functions.Also, some works [43] augment Kanorovich's formulation with a Monge map variable and jointly learn the optimal coupling and an approximation of the transport map.Check §A for intuitive and comprehensive exposition on alternative Kantorovich formulations and a discussion of Hierarchical Optimal Transport formulations.Wasserstein Distance: It may be shown that if c(x, y) = d(x, y) p for some distance d on S = X ∪ Y, i.e. d : S × S → R and some p ≥ 1, then the p-Wasserstein distance [44, Definition 6.1] : is a distance metric over the space of probability measures with finite moments up to order p (Theorem 7.3 (i) [45]).In the case where 0 ≤ p < 1, W p p (µ, ν) is a distance metric over the space of probability measures with finite moments up to order p (Theorem 7.3 (ii) [45]).

COMMON OT FORMULATIONS
Here we discuss OT formulations with a focus on the balance between proper coverage and concise treatment.

Regularized OT
Formulation and Characteristics: Some of the computational and statistical limitations of the original OT formulation ( Eq. ( 2)) can be mitigated by augmenting the original objective with a regularization term.Regularized OT was brought to the forefront of applied machine learning after Cuturi published his seminal work on the topic ten years ago [46].The work demonstrated the computational superiority of the entropic regularized formulation and its compatibility with modern ML pipelines (i.e., parallelism and differentiability).Yet, the formulation can be traced back to the works of Erwin Schr ödinger in 1930s [47]  5 .The regularized OT problem is defined as: 5. Recently, OT and Schr ödinger bridge have provided insights into diffusion-based generative models [48], [49].
where Ω is a regularization operator and λ is a positive regularization coefficient.The regularized discrete OT problem is similarly given by: One choice for Ω is the KL divergence between the joint distribution P (or measure π) and the product measure m 1 × m 2 for a set of reference measures m 1 and m 2 .That is, Ω(P ) = KL(P || m 1 × m 2 ).When additionally the reference measures m 1 and m 2 are uniform, the regularization term (up to a constant) is called entropic regularization: where H is the Boltzmann-Shannon entropy function.Adding the negative of the Boltzmann-Shannon entropy function as a regularization gives us the famous entropicregularized OT: which is a version of the original problem (Eq.( 3)) with an additional strictly convex regularization term (negative of the Boltzmann-Shannon entropy).Intuitively, this regularization biases the optimal plan towards the uniform distribution.Interestingly, entropic-regularized OT (whose optimal objective value is also known as the Sinkhorn "distance") can be rewritten as minimization of a Kullback-Leibler (KL) divergence, where the matrix K (also known as Gibbs kernel) is the element-wise negative exponential of the scaled ground cost The KL divergence is known to be a strictly convex function of P for a fixed K, which means that there exists a unique P that minimizes the above expression and that is differentiable with respect to K. Note that the Sinkhorn distance, despite not being a proper "distance", is symmetric in µ and ν, i.e., OT λ (µ, ν) = OT λ (ν, µ), if the cost function c(x, y) is symmetric in x and y.Computational Motivation: While the regularization term can be motivated by application requirements through the inclusion of an additional prior [50], one can safely argue that the computational advantage of entropic regularization is a key appealing benefit.Specifically, the formulation in Eq. (7), which was popularized in the GPU era by Cuturi [51], can be solved efficiently using an iterative matrix scaling algorithm known as the Sinkhorn-Knopp [52] algorithm, often shortened to the Sinkhorn algorithm (Alg.1).This algorithm boils down the process of computing the optimal plan to a series of matrix-vector multiplications.The kernel matrix capturing the measures µ and ν dissimilarity is denoted as where C is the ground cost matrix and λ is the regularization coefficient in (Eq.( 6)).The algorithm seeks the unique optimal solution that must satisfy the following: (Sinkhorn theorem) Finding the vectors u and v that satisfy the requirements above is achieved simply by performing repeated alternate scaling (line 3 in Alg. 1) until convergence.The number of iterations needed to converge to a specific tolerance δ is controlled by the scale of elements in C relative to λ [53].
The iterations are merely applying kernel matrix K (or its transpose) to vectors u and v.The Sinkhorn algorithm has a worst-case quadratic complexity in the number of points per iteration in its base form (with many enhancements possible, e.g.[54]) and linear convergence rate.This is much more efficient and scalable than the linear programming methods employed by solvers for the un-regularized formulation.
Algorithm 1: Sinkhorn(K, a, b, δ) Sinkhorn Divergence: Despite the computational appeal of the the Sinkhorn "distance", two limitations arise in Eq. (7).First, the regularized OT optimal value is not a distance.Second, the formulation induces a bias in the minimizer known as the entropic bias [55].These issues are handled by the Sinkhorn divergence [55], [56]; another variant of regularized OT that builds on Sinkhorn "distance" and is defined as follows: Sinkhorn divergence is smooth and differentiable in the inputs and enjoys a better sample complexity [57] compared to regular OT.Yet, it can be seen that it requires three evaluations of OT λ rather than one, although the symmetric Sinkhorn algorithm can be used to compute OT λ (µ, µ) and OT λ (ν, ν) more quickly than the regular Sinkhorn algorithm.Altschuler et al. [58] provides analysis of the computational complexity of Sinkhorn and the faster variant called Greenkhorn.Note that we discuss scalable variants of Sinkhorn later in §5.3 and §5.5.
Applications: Entropic-regularized OT and the Sinkhorn "distance" have been extensively used in many machine learning applications.Some examples include self-labelling [59], adversarial examples [60], permutations learning [61] [62] in deep models, initializing graph correspondence in graph matching learning [63], differentiable sorting and ranking [64] [65] and enforcing doubly stochastic attention in transformers [66].Sinkhorn was also employed in graph comparison pipelines (check §B.3).Since the kind of regularization used in OT can be adapted depending on the application requirements, some works investigate alternatives to the entropy term in Eq. (7).Graph-inspired regularization was used in domain adaptation [2] and color transfer application [50].This was extended by Laplacian OT [67] and [2].Other regularization terms that reflect various biases/prior, such as squared Euclidean distance regularization [68], Lasso and group Lasso [2], also exist in the literature.Temporal regularization [69] was used in human pose alignment [69] and [70] weakly supervised action segmentation.
Limitations: A known limitation of entropic regularized OT is the difficulty of setting the regularization strength λ.Big computational gains are tied to a high level of regularization (i.e., bigger λ), which can cause overspreading (i.e., blurry optimal plan) [68].For example, in the point matching example ( §2), high regularization translates to dense matching where most points are connected to each other.This can hurt the correspondence interpretation.On the other hand, concentrated sparse plans can theoretically be obtained for very small λ , but in practice this causes numerical issues [2], [71].

Unbalanced and Partial OT
Motivation: The mass preservation constraint in the Kantorovich formulation Eq. ( 2) requires that the total mass between the two probability distributions be the same.Unbalanced OT (UOT) refers to formulations that relax this constraint to allow the transporting of arbitrary (unnormalized) measures or partial masses.This is useful in applications like multi-object tracking [72], and crowd counting [73], [74].
Formulation and Characteristics: One way of extending OT to arbitrary positive measures is by adopting marginals relaxation, in which the hard marginal constraints Eq. ( 2) are removed, and the objective is augmented with soft regularization that penalizes the mass variation.Typically, this is done using Csiszar divergence D ϕ [75]: where M + (X × Y) is the space of finite non-negative measures over X × Y and D ϕ is a divergence induced by ϕ that quantifies mass variation between the marginals of the plan's marginals (π 1 , π 2 ) and the measures (µ, ν).The penalization strength is controlled by (λ 1 , λ 2 ).An obvious consequence of replacing constraints with penalties is that π's marginals no longer need to be equal to (µ, ν).Particular instances of D ϕ include KL divergence [76] and Total Variation [77].
Recently, [78] proposed a regularization based on Maximum Mean Discrepancy (MMD) and proved that the modified formulation has desirable properties such as dimension-free sample complexity.
Another way of extending OT to unbalanced measures, usually leveraged in discrete cases, is partial assignment.The formulation is called Partial OT (POT).When transporting from n to m points, one can adapt the formulation Eq. (2) by augmenting the plan P ∈ R n×m to be P ∈ R (n+1)×(m+1) and similarly for the ground distance matrix C to be C ∈ R (n+1)×(m+1) .The additional row and column are called dustbin [79], [80] or dummy points [81] and should absorb the mass from unmatched points.This, in effect, turns the formulation into a balanced one, and thus, the traditional computation of Eq. ( 2) can be recycled.
Computation: Entropic regularization may be combined with UOT [82] or POT [80].This allows one to invoke the Sinkhorn algorithm and speed up the computation.
Applications: The balanced OT formulation in Eq. ( 2) can be highly non-robust when there are outliers.Since it endeavors to transport all the mass from µ to ν, a single contaminated data point can increase the OT cost arbitrarily.UOT, on the other hand, can be less sensitive to this issue by assigning small masses to outliers.This makes UOT more suitable for machine learning applications that deal with corrupted and noisy data.OT variants that detect and don't transport outliers build on the unbalanced formulation.For example, ROBust OT (ROBOT) [35] performs outlier detection using a formulation that builds on UOT.
Limitations: Unbalanced OT performance can be sensitive to the choice of the mass variation parameters (λ 1 and λ 2 ).Additionally, when regularization is integrated, tuning the regularization coefficient is challenging and applicationdependent [83].

Sliced OT
Motivation: One approach to speed up the OT computation relies on the idea of low-dimensional projections by aggregating OT distances computed on 1D projections of the data.
What motivates this approach is the fact that solving OT problems in the univariate/1D case is cheap [84].Specifically, we obtain a representation for the p-Wasserstein distance between one-dimensional µ and ν with quantile functions of F −1 µ and F −1 ν , respectively, as follows: where c p (., .) is the ground cost raised to the p th power.When µ and ν are empirical distributions with n and m samples, Eq. ( 11) can be computed very efficiently through simple sorting with a complexity of O(n log(n) + m log(m)).
Formulation and Characteristics: Sliced Wasserstein (SW) [85] is motivated by explicit utilization of the above computational efficiency.SW is computed as the average of infinitely many Wasserstein distances between one-dimensional projections of high-dimensional distributions.Formally, for any µ and ν the Sliced Wasserstein is: where g θ (x) is the linear projection map parameterized by θ.Here g θ# µ denotes the pushforward measure of µ, i.e. the probability measure associated with the random variable g θ (x), where x is a random variable with probability measure µ.Unif(S d−1 ) means uniformly distributed over the surface of a d-dimensional unit sphere S d−1 .Since the projections of measures on the picked direction (g θ# µ and g θ# ν ) are one-dimensional, the W p p in Eq. ( 12) can be efficiently calculated using Eq.(11).In practice, however, acquiring an infinite number of projections is not feasible, and thus Eq. ( 12) is usually approximated using a Monte Carlo scheme.The integral is replaced with the average calculated over a finite number of L random projection directions: where ).It might be surprising that OT on the real line (i.e., sliced OT) can convey geometric information of high dimensional distributions.However, theoretical studies have shown that SW satisfies the metric axioms [86] and has convergence properties similar to that of the Wasserstein distance [87], [88].Further theoretical results [89] show that SW has better sample complexity that does not depend on the problem dimension.Applications: The Sliced Wasserstein is easier to compute and enjoys theoretical properties similar to that of the Wasserstein distance.Thus, it is used in many ML applications as a better alternative.A few examples include defining kernels [90], generative models [91], [92], pooling [93], neural texture synthesis [92], model selection [94], learning representations of 3D point clouds [95] and sets [96], and domain adaptation [97].Additionally, it has been employed for alleviating the computational burden of expensive OT formulations such as Wasserstein barycenter [85] and Gromov Wasserstein [98] ( §4.5).Limitations: Research on extensions of SW is mostly concerned with two questions linked to limitations of Eq. ( 13): 1) how do we better determine directions for projections θ l ?and 2) what could be a better mapping than the linear g θ ?
What motivates the first question is the increased projection complexity of randomly chosen θ l .In other words, to achieve a good approximation of Eq. ( 13) we need a larger number of projections L [91].One can consider a few informative projections to overcome this limitation instead of using all random ones.Following this idea, Max Sliced Wasserstein (Max-SW) [91] advocates for the single "best direction".For order p = 2, it is defined as follows: where the best direction is one that yields the largest Wasserstein distance between the projected measures.Finding this best direction isn't trivial.In practice, it is replaced with direction, resulting in the largest difference between the means of the projected measures.Distributional Sliced Wasserstein (DSW) [99] claims that focusing only on the most important direction (e.g., Max-SW) ignores other potentially relevant directions.Thus, it takes a middle ground between SW and Max-SW by searching for a "distribution" of the important directions on the unit sphere.
The second question regarding SW limitations concerns the linear g θ .What motivates this is the hypothesis that nonlinear mappings could be better in high-dimensional settings.Generalized Sliced Wasserstein (GSW) [84] formalizes this idea.Also, it is possible to combine nonlinear projection with better projection complexity.Max Generalized Sliced Wasserstein (max-GSW) [84] is an example of this trend.

Wasserstein Barycenter
Motivation: The barycenter of a collection of measures is an intuitive notion of an "average" of the measures.For a set of vectors, the usual average of the vectors is the single vector v that minimizes the sum of squared Euclidean distances between v and every vector in the set.The barycenter of a set of measures provides a useful first-order summary statistic of a set of measures, just as the average of a set of vectors provides a useful summary statistic of those vectors.
Formulation and Characteristics: For a set of measures {µ i } N i=1 , we write the weighted Wasserstein barycenter µ * with nonnegative weights {w i } N i=1 such that N i=1 w i = 1.Note that each Wasserstein distance itself involves an infimum.The existence and uniqueness of solutions to Eq. ( 15) appears to have first been studied by [100].They also provide a dual formulation and provide solutions to the problem when each µ i is zero-mean Gaussian.
Computation: Unfortunately, the Wasserstein Barycenter inherits the usual computational cost associated with optimal transport distances.One way to circumvent this issue is to consider entropy regularized divergences (see § 4.1) instead of W 2  2 , so that the Sinkhorn algorithm can be exploited.Regularization in this manner is considered by [101].They show that in the Gaussian case, entropic regularization with uniform reference measures Eq. ( 7) leads to a blurry (or biased) Barycenter, but entropic regularization of the form Eq. ( 9) does not suffer from this problem.[102], [103] proposed efficient algorithms for Wasserstein barycenter.Claici et al. [104] proposed a stochastic algorithm that can extend to continuous distributions.Leveraging deep Euclidean embeddings and auto-encoders, [105] proposed an efficient estimation of exact Wasserstein barycenters.Iterative algorithms such as [106], [107] were also used to compute Wasserstein Barycenter.Follow-up works such as [108] addressed their lack of theoretical guarantees.Anderes et al. [109] develops theoretical results for discrete Wasserstein barycenters case.
The semi-dual formulation of OT (see Eq. ( 3) in §A ) can be used to cast Eq. ( 15) as a minimum over a minimax problem [110].Using an input convex neural network (which are universal approximators in the space of convex functions) allows one to search over the set of convex functions.This approach suffers from the usual theoretical intractability of optimization problems associated with neural networks, but in practice, the method appears competitive on toy problems.The computational complexity per training iteration is O(N M p), where N is the number of marginal distributions, M is the batch size, and p is the neural network size.
Applications: [110] represent the Barycenter as a generative model that resembles a WGAN.Due to this, they may generate infinitely many samples from the Barycenter.[110] generate samples from toy Barycenters of MNIST images and recolored high-resolution images.However, other applications should be possible.MOT was used in generative modeling [112], clustering [113] and domain adaptation [114].
Limitations: Computational hardness is the most obvious limitation of Wasserstein Barycenter [115].Efforts to innovate scalable algorithms will definitely increase the application range of Wasserstein Barycenter.MOT is severely limited by the lack of efficient algorithms.In some cases [116], MOT can be solved in polynomial time.Yet, in general, MOT requires exponential time in the number of marginals and their support sizes.Altschuler et al. [117] introduced a toolkit that was used to prove the intractability of a number of MOT problems.

Gromov-Wasserstein (GW)
Motivation: OT (2) assumes a common meaningful distance c exists for the points in the source and target measures [98].Thus implicitly assuming that the two supports lie in the same metric space.This can be limiting in many applications such as cross-domain alignment [118] and unsupervised bilingual lexical induction [119] where we seek an alignment between sets of word embeddings from two different languages without access to parallel data.This calls for relational OT formulation that measures how distances between pairs of words are mapped across languages.
In the lack of a common distance function, d one might seek a solution that relies on the relevant distance functions d X , d Y (i.e., metric spaces) corresponding to the measures considered.Then, define the transportation cost based on the relationships between distances.This turns the problem into a notion of distances between metric spaces [120] [121].Notably, this offers a way for comparing originally incomparable spaces.Additionally, this notion makes the resulting alignment invariant with respect to arbitrary distance-preserving transformations (e.g., rotations).
Formulation and Characteristics: Gromov-Wasserstein [120] generalizes OT distance to a notion of distance between distances [120] The GW is defined as where r(x, x ′ , y, y 2 is a proper distance [120].In addition to the formula above, there exist variations of Gromov Wasserstein for handling unbalanced and partial transport [81], [122].Note that GW is very relevant to the so called Quadratic Assignment Problem (QAP) [123] which is also linked with the extensive literature of graph matching problems.Apart from a few special cases solvable in polynomial time, the QAP is known to be NP-Hard.
Computation: Naively solving the GW formulation (Eq.( 16)) involves prohibitive computation (O(N 2 M 2 )) involving fourth-order tensor product [120] .Recent techniques for scaling GW to big graph and point cloud datasets are discussed in §5.
Applications: On the one hand, GW, in general, bypasses a classical OT limitation and allows alignment across "incomparable" spaces [124], [125] (e.g.spaces of different dimensionality or data type).Given this, it has found increased adoption in cross-domain (heterogeneous domains) alignment.In single Cell Omics [126], GW was used to align heterogeneous measurements from gene expression and Chromatin accessibility.In [81] partial GW was shown to be efficient for matching point clouds when they exist in different domains and have different features.Very recently, GW was leveraged in Cross-Domain Imitation Learning [127] to align and compare states between the different spaces of RL agents.A Gromov extension of Dynamic Time Warping [128] was used to align incomparable time series by leveraging the intra-relational geometries.An example is multi-modal data, such as the configuration space of a robotic arm and its representation as pixels of a video frame.
On the other hand, leveraging within-domain similarity (d X and d Y ) allows capturing the topological relations.This can be beneficial for learning relations.Such a relational perspective is also why GW is used in graph applications.GW has thrived in applications such as graph classification [129], graph partitioning [6], graph node embedding [130] and supervised graph prediction [131] and others.
Limitations: Relying only on the topological or relational aspects encoded by pairwise similarities during alignment and ignoring the features is one of the limitations of GW (check Fused Gromov-Wasserstein [129] below).Computationally, GW turns the OT linear program into a non-convex quadratic program with a prohibitive computational cost.From a computational perspective, as noted earlier the naive implementation of GW requires operating on a fourth-order tensor.This can be of less concern only in a few cases.For example, when working on small-to-moderate data size [119].Also, for certain classes, [5], the cost can be brought down to O(n 3 ) where n is the number of points.Going beyond these, GW can be expensive, and searching for ways to speed up the calculation is very active.More discussion about this is provided in Scaling OT ( §5).
In section 4, we covered common OT formulations.In this fast-growing field, it is impossible to turn over every stone.Formulations that were left out include those that emerged from multiple primitive formulations ( e.g., Fused-Gromov-Wasserstein-Barycenter [131] ), the dynamic formulations [132].Additionally, there is a continual effort to extend OT to new domains and develop specialized formulations.Examples include functional OT [133], OT on splines [134], and many others.

SCALING OT COMPUTATION
In this section, we discuss various approaches for scaling OT to handle large datasets and high dimensional data.We focus mainly on OT's computational complexity.OT's statistical complexity is another challenge that we don't cover here.
While computational complexity focuses on the efficiency and feasibility of computing the optimal transport solution, statistical complexity deals with the reliability and accuracy of these solutions in the context of data analysis and inference.Readers interested in statistical complexity research can consult sources like [135] , [89], and [136].
Looking at the big picture of the literature for scaling OT computation, one can argue that all the proposed approaches are either approximating the OT solution or tailoring it to a specific case that is easier to handle.The concrete techniques to achieve this can be categorized into five main themes.Fig. 5 (top) illustrates a roadmap of OT computation scaling techniques.It is possible to guess where the optimization can happen.One can work on the individual components corresponding to the measures, the plan, and the ground cost or accelerate the formulation optimization by, for example, partitioning.Combining some of these approaches can help and some papers fall into multiple groups.

Measures Simplification
A logical first stop for scaling OT is to consider simplifying the input to the OT computation pipeline; the measures.Such a simplification can take the following forms.
Cherry Picking (measures with closed form solution): The simplest form of measure simplification is picking from a special set of measures that have closed form solutions.As discussed earlier, the one-dimensional case has a closed form (Eq. ( 11)).The whole line of Sliced OT ( §4.3) builds on this fact.The closed form solution for the 2-Wasserstein metric for multivariate Gaussian distributions with means m 1 , m 2 and covariance matrices Σ 1 , Σ 2 is given by: where Σ 1 is the unique positive definite square root matrix such that Σ Recent fundamental investigations have revealed the existence of closed-form solutions for regularized OT on Gaussians [137], , the entropic Gromov-Wasserstein on Gaussians [138] and others.Another case is the tree Wasserstein for measures that can be supported on a tree, used mostly in document classification tasks [139] as it permits fast comparison of a large number of documents.Tree-Wasserstein can be computed in linear time with respect to the number of nodes in the tree [39], [140].Note that this ignores the computational and storage requirements for transforming measures into trees.Noteworthy variants include a sliced version [141] calculated by averaging the Wasserstein distance between the measures using random tree metrics, a fast supervised version that can be trained end-to-end [142], fixed-support Wasserstein Barycenter on a tree [140] and an efficient solver [39] for unbalanced OT with the capability of processing onemillion point measures on CPU.
Since the OT solution for these favorable cases is analytically known, they are employed in many applications.Additionally, they are used to alleviate the computational burden of other OT formulations [143].Some other usage examples of simple OT problems with analytic solutions include being used as test cases for establishing the correctness of OT solvers [144], and for providing pre-training data for supervised neural OT solvers [145].Despite the advantages, cherry-picked measures can be unrepresentative of real-world data.Also, solving certain OT problems in ground cost Scaling Optimal Transport Fig. 5. OT Scaling.Exaggerating a bit, we can scale all OT methods by turning one of the knobs (i.e., optimizing components) of the OT computation machinery.This can be done through ( §5.1 measures simplification) simplifying the input measures, ( §5.3 cost) structuring the ground cost or approximating the associated kernel, ( §5.2 plan) enforcing a specific prior on the optimal plan or ( §5.4 computation) accelerating the optimization.Of course, it is also possible to turn multiple knobs simultaneously and have ( §5.6)a combined approach.
higher dimensions can still be expensive even for closedform situations.For example, calculating the barycenter of Gaussians [146] incurs a cubic complexity in the dimension [147].
Measures Projection: Low-dimensional projection is another conceptually simple candidate for addressing the curse of dimensionality.The recent findings of spiked transport model [148, Theorem 1] provide a theoretical grounding for this approach.Projection robust methods family resides in this research direction.Subspace Robust Wasserstein (SRW) [149] builds on the principles of Sliced Wasserstein (SW §4.3) and extends the projection to a subspace of dimension k ≥ 2. SRW seeks the subspace that maximizes the OT cost between two measures to capture the major discrepancy (similar in spirit to max-SW §4.3).SRW shares some properties with the 2-Wasserstein distance while being more robust to random perturbations in data.In Wasserstein Barycenter, [150] showed that projecting the distributions into low dimensional spaces can provably preserve the quality of the barycenter.Projection robust Wasserstein barycenter [151] solves the fixed-support discrete Wasserstein barycenter problem.
One issue when working with subspaces is that the subspace plan is optimal between the projected and not the original high dimensional measures.This can be addressed by considering only the plan on the original space that is constrained to be optimal after the projection.This is the key idea behind the concept of subspace detours that was introduced by [152] and later extended to other formulations such as Gromov-Wasserstein in [153].
In cases where the input measures can be grouped and the grouping is known apriori, one can further speed up SRW and gain additional advantages.Feature Robust OT (FROT) [154] speeds up OT computation and makes it more robust to noise.While similar in spirit to SRW, FROT is convex and more scalable than SRW.
Another projection-based family of methods uses iterative random projection, where informative projections are determined sequentially.In this category, there is the Projection Pursuit Monge Map (PPMM) method [155] that picks the best projection in the kth iteration guided by information (e.g., residuals) from k − 1 projections.Check [156] for a focused tutorial on iterative projection approaches.
Measures Quantization: Another simplification approach is to quantize the measures before computing OT.Examples include [157], [158] over-sampling the input measures and applying the solver only on their summary (i.e., a k representative samples acquired by k-means like quantization).Usually, it is easy to obtain more samples when the data come from large datasets or generative models.The approach is solver agnostic in the sense it can be applied to different formulations (OT vs regOT) with no change [157].Along the same line, a simplification that considers transporting the measures centroids (instead of the actual measures) can be more efficient.For example, Word Centroid Distance [19] speeds up the OT computation by representing each document as the weighted average of its word embedding.
In addition to the approaches discussed for measure simplification, smoothing is being explored.Smoothing simplifies measures by adding Gaussian noise.Gaussian Smoothed OT (GOT) [159] simply convolves the input measures with isotropic Gaussian kernel before performing the computation.Research investigation into the theoretical properties of the formulation [160] and potential applications [161], [162] is ongoing.A notable trait is alleviating the curse of dimensionality by enjoying a convergence rate n −1/2 in all dimensions [163].

Structuring the Optimal Plan
Going beyond input pre-processing and looking into the OT optimization itself, Eq. ( 4), OT computation involves an optimization over the infinite-dimensional set of couplings π ∈ Π(µ, ν) between the measures µ and ν.Enforcing a structure or regularization on the optimal plan can often lead to structural changes in the optimization problem, making it simpler to solve.For example, searching for the transport plan only in a low-rank sub-space [164], [165] can be used to reduce the dimension of the problem, speeding up the optimization process.
Plan Regularization: The regularized OT formulation ( §4.1) is the most famous group of plan structuring methods.The entropy regularizer (Eq.( 7)) biases the optimal coupling towards low entropy couplings.The algebraic properties of the added KL term enable arranging the unique solution P λ in a simple form.This form can be solved using successive matrix scaling iterations.Matrix scaling iterations are significantly faster than using linear programming for solving the original OT problem.In other words, regularization changes the structure of the optimization problem, enabling the use of a faster algorithm.However, this speedup comes with a limitation.In regularized OT, there is a trade-off between convergence speed and precision, controlled by λ.As reported by many studies [166], larger λ results in a faster convergence but poor approximation.On the other hand, very small λ leads to numerical instability.To obtain a better approximation, recent works investigate the gradual reduction in λ [167], [168].This simulates gradually approaching the exact (unregularized) OT which is recovered with λ → 0.
Low-Rank Structuring: Another family of algorithms uses the idea of low-rank approximate factorization of the plan.In this family, the plan is decomposed into a product of low-rank matrices, improving the statistical stability of OT and reducing the sample complexity.Factored Coupling (FC) [174] was the first method used to impose a low-rank structure on the transport plan.This is a popular strategy in high-dimensional statistics [32].Their low-rank transport plan approximates the optimal coupling that builds on the assumption that µ and ν may only vary across a few dimensions.FC transports factored partitions or "soft clusters" (in which the clusters can include fractional points) of the measures, and the problem is solved as regularized 2-Wasserstein barycenters.Building on this, the authors of the Latent OT (LOT) method [175] observed that the low-rank transport plan is also more resilient to outliers and noise.A low-rank plan transports measures through common "anchors" and is observed to be better at preserving clusters and tolerating outliers.In FC, the anchors are shared between the measures, and their number is dictated by the rank r.To improve the interpretability of the plan, LOT allows each measure to have a different number of anchors.However, the anchors are additional hyperparameters that need to be chosen carefully.A generalization of FC is the Low-Rank Sinkhorn Factorization method [164]; a formulation compatible with all ground costs and not only the squared Euclidean distance.In the Low-Rank Sinkhorn Factorization formulation, the coupling P is explicitly factorized into a product of two sub-couplings Q and R linked by a common marginal g, The solution is obtained by jointly optimizing Q, R, and g.The demonstrated computational improvement of the factorization of the transport plan in various applications motivated a further investigation [176] into its theoretical properties.A debiased version of the Low-Rank Sinkhorn Factorization method [164] was introduced in [176] and was shown to interpolate between Maximum Mean Discrepancy [34] and OT.
An obvious issue with these approaches is the poor approximation when the low-rank assumption is violated (for example, when µ can be obtained by permuting the support of ν).To attempt to circumvent this limitation, Liu et al. [165] proposed OT approximation in which the transport plan can be decomposed into the sum of a low-rank and a sparse matrix as a possible solution.

Ground Cost Approximation
Application-wise, the fact that the OT distance admits a customizable ground cost has been leveraged in many applications.For example, in domain adaptation, we can bias similar (e.g., same-class) source domain points to be transported together to the target domain withoutsplitting, e.g., using a submodular cost [177].Computationally, having a structured cost ( perhaps one that can be decomposed) can be leveraged to improve the execution time of OT.
Structuring Ground Cost: Enforcing a structure on the ground cost can be seen as an implicit way of reflecting that structure on the optimal plan.For example, a structural infinity in the cost matrix maps to a structural zero in the optimal plan, which can be used to sparsify the optimal plan.This relationship between the cost and the plan can be exploited to speed up the OT computation, reducing the memory and time requirement of the various algorithms.Relatedly, in applications that deal with a large number of samples (e.g., document and image retrieval) such that storing the pairwise distances C i,j for all i and j samples can be prohibitive, a useful structure that can be used is the ground cost saturation.Specifically, we threshold the ground cost C i,j = C max if c / ∈ r where r denotesthe relative (nearest) samples.From a bi-partite graph matching (between the measures) perspective, this is equivalent to constraining edge connectivity and can result in significant savings.The approach was used by [20] and [178] for speeding up document and image retrieval; respectively.In Multi-marginal OT (MOT), the structure of the cost tensors is exploited to speed up the computation.In sensor fusion and tracking, [179] shows that the cost functions enjoy a structure that allows sequential decoupling, central decoupling, or a combination of both.Computing the Sinkhorn projections in these cases can be done efficiently [180].
Kernel Matrix Approximation: another popular strategy is structuring the ground cost C such that the dependent kernel matrix K(C) can be approximated efficiently.The Sinkhorn algorithm uses the kernel matrix K(C) by multiplying it (or its transpose) by two vectors in each iteration.In practice, K can be very large.This drives the computational cost of the whole procedure (Figure .6), and a naive matrix-vector multiplication will cause each iteration to have a quadratic cost.It is natural to think of reducing the computational burden of the matrix-vector product.
Reducing the computational cost of kernel matrixvector product is a common problem with several known workarounds, often used in kernel-based machine learning algorithms such as Gaussian processes.In OT, the same computational tricks were adopted by some works.Nys-Sink [54] combines Nystr öm approximation [181] with Sinkhorn scaling.LCN-Sinkhorn [169] follows in the same direction and notes that the exponential used in K makes its entries negligibly small everywhere except at each (support) point's closest neighbors.This motivates sparse approximation of K where only nearby neighbors are accounted for.This is equivalent to approximating the cost matrix C using a "generalized sparse" cost matrix C sparse where C sparse ij = ∞ for "far" points, i.e. the structural zeros of the sparse matrix have a default value of ∞.Filtering "near" points is done via Locality Sensitive Hashing (LSH) [182]; points within a certain distance r 1 are much more likely to be assigned to the same hash bucket than points with a distance r 2 > r 1 .Ultimately, this yields a version of Sinkhorn that can scale loglinearly with the number of points (i.e., O(n log n)).Scetbon et al. [170] considers the ground cost function of the form kernel-vector product (KVP)
c(x, y) = − log⟨ϕ(x), ϕ(y)⟩ where ϕ is a map from the ground space onto the positive orthant R r + , with r ≪ n.This results in a decomposable kernel (K in algorithm 1), making Sinkhorn computation more efficient.

Computation Partitioning
Computation Partitioning refers to breaking OT computation into smaller OT problems where the outcomes of the subproblems are aggregated into a solution of the original OT.Check the top panel of Fig. 7 for visual representation.Computation Partitioning is done in the following ways: Multi-scale, recursive, or hierarchical: several similar methods were proposed in the literature, relying on arranging the points in a tree structure of some sort."Multi-scale" OT was proposed in [183], [184], [185].In [184], the transport problem is decomposed into a sequence of problems with varying scales and solved in a top-down fashion (i.e., coarseto-fine scale).These methods proved useful when the ambient dimension d is small.Yet they fail to scale to high dimensional settings as they utilize space discretization.
In "recursive" OT, a similar divide-and-conquer strategy performs recursive partitioning of the measures/data.Examples in this category are the recent techniques for scaling GW to big datasets [6].The general theme is to partition the measures into smaller blocks, match the blocks based on their representatives (e.g., centroids), then match the paired blocks by proceeding recursively as needed (Fig. 7 top).MREC [186] adopts this approach using black box clustering for blocking and cluster centers as block representatives.Quantized Gromov-Wasserstein (qGW) [187] supports this simple approach with theoretical guarantees based on [188] and proposes an algorithm that scales to datasets containing up to 1 million points.It should be noted that blocking the representatives' choices (e.g., graph data community detection algorithms [189] and maximal PageRank [190]) is crucial, and careful selection of the methods might be needed depending on the data.A similar tree structure is leveraged in "hierarchical" OT (check §A.2), but the structure is more intrinsic to the data itself.
Mini-batch: Sommerfeld et al. [191] shows that the average OT distances (computed by any black-box solver) on random sub-sampling of the input measures is a fast approximation of the exact OT.Theoretical non-asymptotic guarantees reveal that the approximation error can be independent of the size of the problem in some cases.This surprisingly simple approach can scale well in practice using one percent of the exact computation in images [191].Recent OT acceleration strategies along the same line take inspiration from deep learning.Minibatch OT mimics the typical training of deep neural networks by performing the computation on mini-batches.The empirical success of mini-batch OT motivated recent theoretical investigation [192], [193], [194].One concern here is the minibatch strategy can lead to an undesirable smoothing effect for which Fatras et al. [195] suggests relying on unbalanced OT formulation to counter the issue.

Solver Acceleration
Solver acceleration targets optimizing the computation process either by learning to predict the solution quickly (usually for specialized problems) or admitting a relaxed version of the original problem, which is easier to compute.In this category, we identify convergence acceleration, learned solvers, and relaxed solvers.A depiction is shown in Fig. 7 bottom panel.
Convergence Acceleration.Convergence Acceleration aims at bringing the solver to convergence in fewer iterations.In regularized OT, for example, we can use a number of strategies to accelerate the Sinkhorn algorithm (see Fig. 6 bottom row).The Sinkhorn algorithm (Alg. 1) is fundamentally a fixed point iteration (FPI) algorithm for finding fixed points (u * , v * ).Therefore, all FPI algorithm acceleration techniques can be investigated.In [171], the classical Anderson acceleration for FPI algorithms was used.Anderson acceleration can speed up FPI algorithms by making them more robust to residual oscillation.Besides Andreson acceleration, the over-relaxed Sinkhorn method [172] adopts a relaxed FPI algorithm where the auxiliary vector update u l+1 ← a Kv l is replaced with the relaxation u l+1 ← u 1−ω l ⊙ a Kv l ω , Fig. 7. Accelerating OT Computation.Without optimizing any of the previous components(i.e., the input measures, the plan, and the cost), the OT computation can be accelerated by either breaking it into smaller OT problems (top) or accelerating the OT solver by predicting the solution quickly (bottom).Notable schemes for OT computation partitioning (top, §5.4) include the decomposing the problem according to a hierarchy ( [184]), applying computation in min-batch manner ( [194]) and recursive partitioning ( [186]).Solver Acceleration (bottom, §5.5 leverages the recent neural solvers ( [196], [197]), accelerates the optimization (e.g., Anderson Acceleration [171]), or seeks solution to a relaxed version of OT problem ( [19], [198]).
where l denotes the iteration and ω > 0 is the chosen relaxation parameter.[172], [199] show that for ω > 1 the algorithm can converge faster than traditional Sinkhorn.Recently, [200] identified an apriori range for ω for which global convergence is guaranteed.Other fixed point acceleration techniques were investigated in [201].In the Sinkhorn literature, better initialization has been mostly neglected as the potential speedup seems insignificant given the convex nature of regularized OT.However, a recent work [173] showed that careful initialization of the Sinkhorn subroutine (in a neural network) initialized by pre-training on problems with closed-form solutions (Gaussians and 1D) can lead to faster convergence.The technique can also be complementary to previous techniques.The previous approaches have little in common apart from seeking faster convergence.Yet, they all are holistic approaches that consider the inner workings of the solver and a potential revisit of the original formulation [202] before formulating an acceleration strategy.
Learned Solvers: A promising direction that has been accumulating interest recently is based on the idea of "neural solvers".Neural solvers are neural networks that attempt to predict the OT output from the input measures.Data-driven neural solvers reduce the expensive traditional computation by leveraging knowledge from past OT solutions.By constructing a model and training it on ground truth OT solutions and parameters, we can quickly predict OT solutions during test time (single forward pass) without working it out from the measures from scratch (iterative) every time.Courty et al. [105] learns a supervised autoencoder that embeds MNIST images such that the Euclidean distance in the latent space approximates the Wasserstein distance.The learned approach becomes more attractive in deployment scenarios that require solving OT repeatedly (e.g., coupling stream of image pairs).In addition to the computational improvements, a notable advantage of neural [203], [204] and stochastic [205] solvers is the ability to cope with parametric continuous µ and ν (densities).Empirically, neural solvers show great success in generative modeling ( §B.2) and domain adaptation ( §B.1).
Grey-box Solvers: These approaches replace a component of the OT formulation by a neural network, re-parameterizing the optimization to be in terms of the weights and biases of the neural network.Some of these semi-transparent models balance the regular OT computation, which can be expensive and hard to scale, and the data-rich (yet opaque) black-box learned solvers.So far in the literature, the Kantorovich-Rubinstein duality Eq. ( 25) and the semi-dual formulation Eq. ( 19) [12], [206] have received the most attention.This is because the (semi-)dual solution is comprised of 2 pointwise solutions: a dual value for each x in the support of the measure µ, and another dual value for each y in the support of the measure ν.This pointwise structure motivates using a neural network to replace the mapping from x and/or y to their respective dual solutions.However, formulation tricks need to be employed here to eliminate any constraints or non-smoothness in the (semi-)dual formulation.Current works alleviate this issue either by relying on soft constraints [204] or structuring the neural solver to ensure feasibility [12], [196].
Grey-Box Solvers with Soft Constraints: To understand the strategy of these solvers, we revisit the Kantorovich dual: The formulation shows the OT computation as maximizing an expectation term, thus suggesting that the maximization can be achieved using stochastic gradient methods on sampled batches from the coupling µ × ν.However, it is not clear how the constraints on ϕ and ψ can be satisfied in this scheme.It turns out that the dual of the entropy regularized OT problem is unconstrained and doesn't exhibit this problem, Eq. ( 6), which can be obtained by the Fenchel-Rockafellar theorem: where R λ = −λe 1 λ (ϕ(x)+ψ(y)−c(x,y)) is a penalty term.Intuitively, R λ represents a smooth version of the constraints in Eq. ( 18).Thus, we bypass the need to enforce the hard constraints.Given that the problem is now unconstrained, one can simply parameterize the dual variables ϕ and ψ using neural networks and use stochastic gradient methods to maximize the objective in (19) [206].
Alternatively, the semi-dual formulation Eq. ( 19) of the unregularized OT problem is also unconstrained, but its objective is non-smooth, so a smooth approximation of the semi-dual formulation of the unregularized OT problem can also be used.The semi-dual's objective is not smooth because it uses the max operator.One way [68] to alleviate this is to construct a smooth approximation of the max operator using the conjugate function of some strictly convex regularization/penalty function Ω over the unit Simplex ∆.The entropy or L2 norm are possible choices for Ω.The following function is known as the conjugate function of Ω(x): The above conjugate function is smooth and differentiable in x and is equal to max(x) when Ω(y) = 0 for all y (i.e.no regularization).Replacing max with max Ω for some choice of Ω in the semi-dual results in a smooth unconstrained optimization problem.The gradient of the above function can also be analytically derived for common choices of Ω. Grey-Box Structured Solvers: These solvers rely on structural constraints in the neural model itself (e.g., specific layers, training strategies, etc).Thus, we bypass the hard constraints by simply relying on the constrained structure of the neural network.A prominent example here is W-GAN solver [12].The potential f w in Wasserstein-1 formulation (Eq.( 7) in §B.2) has to satisfy a Lipschitz constraint.This is done by forcing compactness on the parameter space.In practice, this is achieved by weight clipping after each gradient update.Another way to import structure that ensures constraints feasibility is to use special layers.What gives rise to this approach is the formulation in Eq. ( 18) which allows using Input Convex Neural Networks (ICNN) [207].Specifically, one can rephrase Eq. ( 18) as an optimization problem over the space of convex functions while eliminating the distance constraints.This trick appears in [196], [197], although elements of the convexification trick first appear in Villani's work [45].Reparameterizing the problem in terms of convex functions allows one to leverage a deep learning approach to solving OT problems.The formulation is given by where and CVX(µ) is the space of all convex functions integrable with respect to µ.The space CVX(µ) can be parameterized by input convex neural networks (ICNNs).Generally, one can compose ICNN from two principal blocks: linear blocks consisting of linear layers, and convexity preserving blocks consisting of linear layers with non-negative weights.The key idea here is to use special parametric models based on deep neural networks to approximate the set of convex functions in Eq. (21).Note also that Eq. ( 21) involves only a single convex function ϕ.ICNNs were used for computing OT maps [196] and Wasserstein Barycenter [110].A number of variant architectures conforming to the same guidelines were also proposed, e.g., ConvICNN [208] and DenseICNN [203], [208] were used in generative networks ( §B.2).
A question pertinent to grey-box solvers is how to differentiate through the OT computation?The OT plan π * : X × Y that attains the infimum (Eq.( 2)) (or any discrete and/or regularized variant) is a function defined implicitly as a solution to an optimization problem.As such, given sufficient smoothness, functional operators such as differentiators for such an implicitly defined mapping can be examined.Given an iterative algorithm that estimates the optimal plan, one may estimate such derivatives using automatic differentiation by unrolling the steps of the algorithm [209] and applying the chain rule to compute derivatives through the algorithm.A popular example of this approach is unrolling the step of the Sinkhorn algorithm [65].Clearly, this approach becomes more expensive as more iterations are required for the algorithm.
Another approach is to use the implicit function theorem [210] to compute derivatives without any knowledge of the steps involved in the algorithm.This idea has been explored more generally for solutions to optimization problems [211] and fixed point equations [212].When used inside neural networks as layers, these are called implicit layers.Layers that solve optimization problems such as OT problems, are sometimes called Deep Declarative Networks [211].
A unified framework for implicit differentiation of Sinkhorn layers is presented in [213].They provide expressions for the implicit derivatives of loss functions that accept the result of the pre or post-composition of a neural network with a Sinkhorn layer.This allows for gradients to be backpropagated through the neural network pipeline so that variants of stochastic gradient descent can be used to update the neural network parameters.Such implicitly differentiated Sinkhorn layers have been used to solve the blind Perspective-n-Point problem [214] and in few-shot image classification [215] Exploiting the problem structure with implicit differentiation can further increase computational gains significantly, as shown by [216].However, it requires caseby-case handling, lengthy mathematical derivations, and custom implementation.Modular implicit differentiation [217] presents a potential technology for handling such difficulty.
But, how good are the neural solvers ?While a neural solver can result in computational savings for a pre-specified class of OT problems on which it was trained, neural solvers can fail to approximate OT solution [218] or exhibit poor generalization beyond the training dataset (see future directions §6).The accuracy of neural solvers is a topic of recent investigation.Recent studies [208] show that neural solvers might not faithfully recover the optimal map yet continue to perform well in downstream tasks.Architectural constraints (e.g., ICNN) also can be a concern.Despite architectural restrictions, [219] demonstrated the richness of this class by showing its capability to accurately capture the temporal dynamics of complex physical systems.Yet, [220] noted that ICNNs are difficult to train, which led [145] to investigate a better initialization.Beyond the approximation capability, the neural solvers can have characteristic variations.For some solvers, obtaining the OT plan/map is possible.For example, from the optimal dual potential ϕ and ψ in Eq. ( 19), one can recover the optimal plan using the primal-dual relationships [205].For other solvers, obtaining the correspondence information (through a plan or a map) isn't feasible.
Relaxed Solver: Sometimes, the context in which OT computation is used provides clues for optimizing the computation.If the OT distance is not the ultimate target, then one can consider relaxing OT computations according to subsequent application requirements.Relaxed Word Movers Distance (RWMD) [19] computes a lower bound of WMD by relaxing one of the marginal constraints on the coupling.This reduces the running time complexity from cubic to quadratic in the number of data points.Linear-Complexity RWMD [198] further pushes it down to a linear complexity.

Combined Approaches
While each of the previous strategies can bring benefits on the computational/statistical side, it can be noticed that there is always an associated side effect.Consider, for example, the entropy-based regularization ( §5.2).It produces an easyto-parallelize algorithm and a well-behaved optimization problem (e.g., unique optimal coupling and a differentiable objective).Yet, it can produce a blurry optimal plan ill-suited for some applications [68] or numerical instability issues depending on the magnitude of the regularization λ.Given this, a valid line of thought is combining multiple of the previously discussed (compatible) strategies above in a single framework and aggregating the benefits.Schmitzer et al. [168] combines ideas of λ-scaling heuristic ( §5.3), an adaptive truncation of the kernel, and the multi-scale coarse-to-fine scheme ( §5.4) in a single algorithm to obtain the combined benefits.[164] proposes structuring both the coupling ( §5.2) and the kernel matrix ( §5.3) to be low rank.Similarly, [221] used the same strategy to obtain a linear time complexity of Gromov Wasserstein.Korotin et al. [147] combines neural solver with the fixed point approach from [106] to Wasserstein-2 barycenters of continuous measures.

FUTURE DIRECTIONS
Circumventing OT computational challenges almost a decade ago [46] through optimization advancements has led to its emergence in machine learning.Subsequent advances [205] in the same direction enabled embracing OT in many applications.Recently, with more adoption, computational limitations have become more pronounced in the big data era.OT is being used increasingly as a component of various contemporary learning pipelines [222], [223].Algorithms with high complexity (e.g.super cubic) can not match scalability ambitions and would be rendered as computational bottlenecks.Notably, theoretical advances are no less important in this aspect.They pave the way for principled scaling rather than 'hacking the way' to a better complexity.Below, we highlight, in broad strokes, some challenges we see as worthy of future research investment.
Computational complexity: The emphasis of this survey is on the longest-standing research challenge in OT, the computational budget.Despite the algorithmic advancements, more research is needed to develop computational approaches better suited for the massive amounts of data available.A recurring pattern observed in computational tricks in §5 is the presence of some sort of compromise.In general, the computational tricks involve a transformation of the original transportation problem into an easier approximate version that potentially works better in a specific niche of problems.Unfortunately, this process typically incurs losing some connections and properties of the original formulation.Research in this direction will continue to flourish.
Research at the intersection of optimization and deep learning [224] presents one promising direction in this regard.The recent trend of learned optimization [225] and amortized optimization [226] builds on the intuition that we can leverage the structure of multiple related optimization problems and learn a model to predict the solutions faster (without running the naive optimization from scratch).Evidently, this was shown to produce reliable results with orders of magnitude faster than classical optimization.Recently, some works have started exploring this direction [227] [228] in the OT literature.Interesting research questions in this direction are generalizing predicted solutions to out-of-(training data) distribution and, more importantly, addressing the lack of theoretical convergence guarantees in this setup.Formulations that can benefit the most from these efforts are the complex (higher order) formulations such as multi-marginal OT, Gromov Wasserstein, and hierarchical formulations.To date, leveraging these formulations in learning tasks is subject to careful judgment that weighs the potential benefits against the computational price.
Benchmarking.Given the wealth of OT formulations and solvers, standardized evaluation of solvers would greatly benefit future research.Yet, very few works [203], [229], [230] have addressed OT solver benchmarking in the literature.These works are mostly [203], [229] focused on discrete OT solvers with low-dimensional inputs that don't reflect the current nature of big datasets.Benchmarking OT solvers is challenged by the diversity of the OT landscape in terms of formulations and machine learning applications considered.One potential direction is to initiate specialized benchmarks that focus on specific tasks in which OT is used extensively (e.g.domain adaptation §B.1 ) or specific group of formulations.Another benchmarking challenge is designing metrics that assess the OT computational accuracy.This is difficult given that in some cases (such as OT between continuous measures [231]) it isn't straightforward to obtain the ground truth OT solution including the cost and plan.Recent efforts [231] address this issue for special cases.It would be interesting to see more research in this direction.Another issue with the metric design is that the performance on the downstream task isn't an indicator of a good approximation of the Wasserstein distance [218].
New data types.Extending the scalability of OT beyond data quantity to new data types is a promising future direction.There are recent efforts in this direction such as applying OT on functional data [133] Complex data such as structured objects (e.g.graphs [7], [129] and trees) and sequences and temporal streams [128] have received little attention in the literature.OT on structured objects requires generalizing the notion of transport to accommodate both nodes and relations while ensuring other properties.Current approaches imitate OT on these objects either by having multiple levels of computation [131] or embedding the data in a new space that preserves the structure (e.g.hyperbolic embedding [232]).Both approaches come with increased computational challenges.A native formulation that embraces the structure is yet to be realized.Potential advantages of native formulations include having the structure reflected in marginal constraints and the capacity to perform interpolation and Barycentric mapping on structured data.
Novel applications and better tools.By treating the weights of deep models as probability measures, one can cast the framework of OT on problems involving transportation among (trained) model objects.This can be useful, for example, in model fusion and interpolation without requiring re-training.For example, OT can be used [233] for fusing pre-trained neural networks as an easy 'one-shot' transfer mechanism.Another use of this can be in federated learning [234], in which multiple clients collaborate to train a global model in a coordinated server without exchanging their local data.Traditional clients model aggregation (i.e.averaging) on the coordinated server can be replaced with OT-based advanced averaging (e.g.Wasserstein barycenter) that can address clients heterogeneity [235].Along the same line, OT can be used as a geometry-aware analytics tool for studying the weights of deep models and addressing issues of interpretability.
A number of powerful OT tools such as POT [144] and OTT [236] exist.Tools like OTT and KeOps [237] are leverag-ing GPUs and TPUs.KeOps can handle a very large number of samples by storing computation as formulas and streaming them on the fly.Yet, it is difficult to extend the package and use it and deep learning [195].Looking forward, overcoming the limitations of current tools will drive wider applicability of optimal transport.

CONCLUSION
Optimal Transport provides a coherent mathematical toolbox for formulating and solving problems whose core is about similarity quantification and correspondence estimation between objects.Problems like these, which arise naturally in machine learning, are addressed by transporting collections of samples or features in a geometric sensible way.We discussed various transportation mechanisms (i.e.formulations), their characteristics, their application value, and their limitations.Then, a formulation-agnostic taxonomy for scaling OT computation to large datasets was presented.We reviewed a broad spectrum of techniques ranging from those native to the optimization framework (e.g.additional constraints or structures) all the way to those inspired by recent advances in machine learning (e.g.neural learned solvers).

APPENDIX A EXTENDED BACKGROUND A.1 Kantorovich formulations
Recall from the main paper the Kantorovich formulation of the OT problem, The benefit of the Kantorovich formulation in Eq. ( 22) is that multiple computational approaches and interpretations of the same problem can be obtained.The remainder of this section is devoted to discussing some of these reformulations and interpretations.
We can invoke the basic concept of "duality" 6 from convex optimization to obtain a sup (dual) formulation equivalent to the inf (primal) formulation in Eq. ( 22).The Kantorovich dual of Eq. ( 22) [44, Theorem 5.10] is : where The dual is obtained by noting that Eq. ( 22) is a linear programming problem.For more on linear programs, dual problems, and the Kantorovich problem, refer to the appendix section on optimization ( §C).Given that Eq. ( 23) bears little resemblance to Eq. ( 22), one might wonder how intuitively the original minimum-effort transportation can be interpreted as a maximization problem.This can be understood by imagining that the transportation is carried out by a third-party beneficiary.Their goal then would be to maximize the total sending X φ(x)dµ(x) and receiving Y ψ(y)dν(y) costs as long their rate stays below a 6.Any convex minimization problem admits a dual maximization problem whose optimal value lower-bounds that of the minimization problem.The optimal values coincide if the problem is a linear program.baseline cost c(x i , y j ) for all the points x i and y j .Otherwise, the interested parties would opt out (assuming a competitive market climate).
Note also that in Eq. ( 23), we no longer have the optimal plan π but rather the variables φ(x) and ψ(y), known as the dual potentials.They still encode the optimal plan and the optimal plan can be recovered from them.However, the dual problem Eq. ( 23) can be more computationally tractable than the primal problem Eq. ( 22).In the discrete problem in §2, the n × m memory footprint of the optimal plan is reduced to a linear (w.r.t the input) footprint required by the dual potential.This can be very useful when scaling the computation to thousands or even millions of points.On the other hand, we still have n × m constraints to check against when solving Eq. (23).
A simplification that makes the objective dependent only on one potential φ (or ψ) instead of two potentials φ and ψ is called the semi-dual: where ψ(y) was substituted with φ c (y) = inf x [c(x, y) − φ(x)] which is the c-transform of φ.The term semi-dual was named as such by [238] since this formulation underlies semi-discrete OT methods.Notably, the flexibility by which this formulation allows us to pick one potential over the other can be useful in applications.In some problems (e.g.crowd counting [74]) that require transporting a sparse measure (e.g. a few annotation points) to a dense one (e.g.many image pixels), one can eliminate the dense potential and optimize over the other.
If additionally the cost function c(x, y) is a distance function d(x, y) satisfying the triangle inequality, e.g.c(x, y) = d(x, y) = ∥x − y∥ 2 , the c-transform becomes φ c (y) = −φ(y) where φ(y) is a 1-Lipschitz function w.r.t the distance d: We denote the set of 1-Lipschitz functions w.r.t the distance d by L d (1).So the semi-dual formulation [44,Theorem 5.10] becomes: This is known as the Kantorovich-Rubinstein formulation or the Kantorovich-Rubinstein norm [239].Note that this only applies to c(x, y) = d(x, y), not to the more general (and popular) case c(x, y) = d(x, y) p for some power p.
The exact formulations discussed here are typically solved using classic linear programming solvers.Unfortunately, these methods can't exploit the parallel computing power of modern hardware.As we will see later, approximate formulations that embrace parallel computation and leverage GPUs (e.g.Regularized Optimal Transport §4.1 ) had gained more popularity in many recent applications.

A.2 Hierarchical OT Formulation
Motivation: Hierarchical Optimal Transport (HOT) extends the concept of OT by introducing a hierarchical structure to the problem.In HOT, the transportation problem is considered at multiple scales or levels of granularity.By grouping data points with similar features or predefined structures, the transport task can be handled at the group level before dealing with individual data points.This approach is particularly useful when dealing with complex data structures with inherent hierarchical relationships, such as in certain types of data analysis and machine learning tasks.
Formulation and Characteristics: There are several techniques known as HOT.It is an OT problem where the ground metric itself is defined through an OT problem.Generally, these [143], [240] involve a two-tiered approach: solving smaller, local assignments within each group and then addressing a larger, overarching assignment across groups.This layered approach simplifies the overall task and allows for flexibility in methods used at each level.
Applications Problems involving hierarchical data, like matching documents, aligning datasets, and adapting to different domains, can benefit from using hierarchical optimal transport (OT) methods.One example is HOTT [21], which uses the natural structure of documents (documents are modeled as distributions over topics that are modeled as distributions over words) to make comparing documents using OT faster.Similarly, HiWA [240] approaches domain adaptation in the multi-modal setup as a hierarchical transportation process.A notable development here is the introduction of a scalable ADMM algorithm, which can be divided across cluster pairs and allows for parallel computations.This two-level transportation approach has been applied to other areas such as unsupervised domain adaptation [241], semi-supervised domain adaptation [242] and measuring dataset distances [143].

APPENDIX B APPLICATIONS OF OT IN MACHINE LEARNING
In this section, we review the applications of OT.We focus primarily on domain adaptation and generative models.

B.1 Domain Adaptation (DA)
Domain adaptation is a very vast and vibrant sub-field of machine learning.Its motivation is the frequent violation of a classical learning assumption in the real-world application, namely, the assumption that the data used for model training and testing data come from the same probability distributions.A common example of this assumption failure is an image classification task in the acquisition conditions causing a non-negligible (distributional) shift between training and test datasets [243].In this vast topic, one can find these excellent recent surveys focusing on specific themes of domain adaptation whether it is generalization [244], open set adaptation [245] or theoretical advances [246].Here, we rather collect the many domain adaptation paradigms to which optimal transport contributes.It should be noted that the subtle but crucial differences between these domain adaptation paradigms can be confusing.Figure 8 provides a quick visual connection between various domain adaptation paradigms and highlights the differences.
Formally, domain adaptation considers two domains source domain and target domain denoted as s and t; respectively.The goal is to learn a robust model given that the source and target domains have different data distributions (i.e.D s ̸ = D t ).In the most common setup, the labels for the two domains are assumed to be available in abundance.In this case, it is called supervised adaptation.Yet, the problem also comes in many other variations.For example, the target data can be unlabeled (unsupervised adaptation) or partly labeled (semi-supervised adaptation).The classes in the source and target domain can be disjoint (open set adaptation).The learning goal, in this case, may go beyond merely rejecting new classes as 'unknown' to include learning the new classes from one (one-shot adaptation) or multiple (few-shot adaptation) samples.When the target domain is very dynamic, domain adaptation can be done continuously (continuous domain adaptation).All the previous configurations assume that the training samples from the source domain are available during adaptation.This is unrealistic in cases where it's impractical to take training data to deployment due to privacy or storage considerations.This motivates source-free domain adaptation, which makes different assumptions as shown in Figure 8.
Since the first use of OT in domain adaptation problems by [2], OT has been considered a strong baseline for domain adaptation.The mathematical flexibility of optimal transport formulations allows for catering to the various domain adaptation configurations.Before proceeding with the review of these works, it is instructive to see why one might consider OT in domain adaptation problems.The following are potential motivations: • Distributions matching is at the core of domain adaptation techniques.Given that, the problem naturally lends itself to optimal transport.Moreover, the previously discussed OT properties (e.g., geometry awareness and diversity of formulations) can offer a new perspective on the problem.

•
As a metric on the space of probability measures, the Wasserstein distance was used to derive theoretical bounds 7 on the domain adaptation error [247] and deep learning generalization [9].Thus, many techniques building on OT for adaptation [248] provide adaptation theoretical guarantees.
• It is possible to encode a prior structure useful for domain adaptation (e.g., preserving labels or graphical relations between samples [2]) in the OT mapping between the source and target distributions.
• A degree of interpretability can be gained by having explicit correspondences, represented in a transport plan, between domains/datasets.This might be even more true in works that consider the matching in the raw input space [143], [249] rather than (a less interpretable) feature space.
The above factors are possible reasons for OT to be a versatile building block in domain adaptation areas.Below is a selective review of these areas.
Transferability Assessment using OT.Transfer learning across domains or tasks is one the most promising and widely adopted machine learning methods and it remains to be a topic of extensive investigation [250].Yet, judging the effectiveness of a specific transfer learning setting prior to the actual transfer in a quantifiable way has received much less attention.In multi-task learning, one may want to save the joint training burden of weakly related tasks as the final performance will be actually worse compared to single-task training.Also, this can be used in source model selection [251] (i.e., select from a pool of source pre-trained models the best one for a target task).Rather than relying on human experience for this kind of relatedness assessment, estimating a transferability metric [252], [253] could be much more ubiquitous and practical.General metrics for transferability can be categorized into the analytical transferability metrics  [253], [254], [255] and empirical transferability [256].The first has stricter assumptions but is computationally efficient and enjoys theoretical generalization bounds.The latter is scalable but lacks theoretical guarantees.
Optimal Transport-based Conditional Entropy OTCE [252] was the first analytical transferability metric in a simultaneous cross-domain cross-task.It adopts the popular "retrain head" transfer model in which a source model with a frozen feature extractor is transferred to target data by fine-tuning only the classifier head following the extractor.The proposed metric predicts the model's performance after the tuning without actual re-training.
Unsupervised DA (UDA) using OT.Several works applied OT in UDA in which labels are available only for the source while samples are available from both the source and target domain.Courty et al. [2] established a framework that was extended by others later for using optimal transport as a principled method for UDA.The key idea is to unify the training and test samples in one domain in which the classifier can be learned.First, OT aligns the empirical measures of source and target domains.A process that can be done without using any labels.Then, the training samples are transported to the target distribution using the estimated transport plan.Finally, a classifier can be learned only on the target domain.A graph-inspired regularization is used to preserve the aftertransportation proximity of the source samples sharing the same label.
Instead of sample alignment, follow-up works seek to align the model's internal features corresponding toD s and D t .Sliced Wasserstein Discrepancy (SWD) [97] builds on Maximum Classifier Discrepancy (MCD) [257] that maximizes the discrepancy between task-specific classifiers as a part of an adversarial adaptation strategy.As the discrepancy is a key component of MCD, SWD proposes upgrading it with the Sliced Wasserstein distance.Interestingly, by merely replacing the L 1 discrepancy in MCD with SWD, the system consistently delivers better performance in image classifications, segmentation, and object detection tasks.
Enhanced Transport Distance (ETD) [252] uses a neural solver ( §5.5) for OT-based features alignment and notes the mini-batch instability issue.Specifically, optimizing OT on mini-batch can lead to inconsistent transport plans across iterations.To mitigate the bias caused by mini-batch, the system weighs (calibrates) the ground cost using a networkintegrated attention.
Wasserstein Guided Representation Learning (WDGRL) [258] proposes augmenting Adversarial Domain Adaptation (ADA) architectures, such as DANN [259], with additional Wasserstein loss to guide the domain adaptation process.ADA extends the concept of Generative Adversarial Networks (GAN) to domain adaptation.While GANs leverage the adversary to generate samples indistinguishable from the real data, ADA seeks a mapping that makes source features indistinguishable from target domain features.WDGRL adds a WGAN-like domain critic (WGAN discussed in §B.2) for calculating the 1-Wasserstein distance between the source and target features.The motivation behind this is the possibility of having vanishing gradients when the supports of the mapped features are disjoint, a setting in which OT is assumed to work better compared to other divergences.In fact, WDGRL significantly outperforms DAN in this setting, albeit the results are based on synthetic data.
Unsupervised DA using OT (multi-source).JCPOT [260] considers the mutli-source DA problem with a focus on cases where the label proportions between source and target domains are different.Target shift (or class imbalance) has practical values in applications like anomaly detection.Theoretically [261], the error on the target domain can be minimized by re-weighting the distribution of the source classes to match the target one.Motivated by the impracticality of acquiring the target class distribution, an optimal transport matching that automatically acquires the weights is proposed.Entropy regularized OT ( §4.1) is used for mapping between every single source and the target, and the problem is formulated as a Bregman projection that has a simple solution (proposition 1 [260]).The approach shows improvement over vanilla OT on the satellite imagery dataset.
Open  noticed that aligning the whole source and target domains followed by previous OT methods is actually inferior to the adaptation in an open set setting.Causing a problem known as negative transfer [266].Thus, Joint Partial Optimal Transport (JPOT) [265] employs partial optimal transport to align only "well-matched"/most-correlated samples and avoid far-fetched pairings that can cause the negative transfer.To obtain the partial matching, the mean cost of the complete optimal transport matrix [267] is used as a threshold.Coupled pairs whose distance is greater than this threshold are marked as non-transferable.
Open Set DA using OT (zero-shot).For the more challenging cases of totally unseen classes (i.e., zero shot), GZSL [268] proposes synthesizing features for the unseen classes from the corresponding auxiliary textual descriptions.The feature generator is encouraged to match real feature distributions by minimizing regularized OT distance.OT-based matching was motivated by the concern of sampling outliers (brought by domain shift) that can make point-wise matching approaches less robust.IPOT [167] is used instead of Sinkhorn as it maintains near-linear time complexity while being less sensitive to regularization weight.
Multi-modal and heterogeneous DA using OT.Applications that require models to operate across different modalities motivate the need for heterogenous domain adaptation.Clearly, aligning features living in heterogeneous (multimodal) spaces is more complicated than the alignment of homogeneous (unimodal) features.In a neural population decoding application, HiWA [240] used data from populations of (human brain) neurons to predict the arm movement direction in reach-out tasks.This involves aligning two domains of different modalities; the neural data distribution and the 3D movement data distribution.Given the difficulty of the adaptation, they suggest leveraging the additional clustering structure of the data to regularize OT and constrain the solution space.Thus, a nested formulation of OT is proposed for aligning clustered and multi sub-spaces data and the framework hierarchical optimal transport ( §4.6) was invoked.
Knowledge Distillation (KD) is a common option for cross-modal domain adaptation.That is; aligning latent features of two models, each of them trained on a different modality yet solving the same or related tasks.Wasserstein Contrastive Knowledge Distillation (WCKD) [13] encourages a student (compact) model features h s to be distributionally similar to those of an advanced teacher h t .Specifically, a GAN-like discriminator ensures that source samples are not distinguishable from target samples.The alignment is done using a neural version of OT (WGAN [12]).

USPS
MNIST spatial transofrmation optimal plan attention Fig. 10.Interpretability of optimal transport plan.Optimal plan is used to reveal in (a) features (represented by raw pixels) correspondences between USPS and MNIST datasets using an unsupervised approach [249] and (b) images object-to-word correspondences [270].The latter was shown to be favorable from an interpretability perspective compared to Transformer's attention due to sparsity.The figures were adapted from original references Continual DA using OT.Very few works developed techniques that address gradual distributional shifts in the target domain.Continual Domain Adaptation [248] is a problem that resembles Continual Learning (CL) [269] albeit without assuming the availability of labels in the target domain.LDAuCID [248] learns a source model capable of adapting to several sequential target domains without labels.A strong overlap between the source and target classes is assumed.LDAuCID aligns the consolidated internal learned distribution with the target domain using Sliced Wasserstein.To mitigate the catastrophic forgetting issue, representative samples from all tasks are stored and replayed in the adaptation stage.This technique is known as experience reply in CL literature.
Source-free DA using OT.Traditional UDA performs the adaptation between domains using the source and target data.Requiring the presence of source data during adaptation is impractical in cases when the data is inaccessible due to privacy (e.g., Federated Learning clients [271]), storage limitations (e.g., on edge device), or other reasons.Augmented Self-Labelling (ASL) [272] uses the pre-trained source model (instead of the source data) and the unlabeled target data for adaptation.ASL generates pseudo-labels for target data using the pre-trained model to be used during the model adaptation.It builds on a technique [59] that frames self-labeling as an optimal assignment between the model's predictions (softlabels) and the set of possible labels under an equipartition constraint (that avoids degenerate solution where a single label is assigned to all samples).Domain generalization using OT.UDA assumes that data can be collected in advance (prior to training ) from the target domain since this is required for training.This assumption doesn't hold in practice.Domain Generalization (DG) thus is envisioned as a more practical alternative that learns a model capable of operating in unseen target domains without requiring model update (adaptation).Learning to Augment by Optimal Transport (L2A-OT) [263] proposes learning to synthesize novel domains from multiple source domains.By augmenting source domains with synthesized ones, DG can be realized through increased training data diversity.The problem is framed as image-to-image translation despite being similar in spirit to gradient-based perturbation [262](i.e., perturbing input using adversarial gradients from a domain classifier).Interestingly, the conditional generator in their pipeline is trained to maximize (not minimize) the optimal transport distance between the source domain and the synthesized novel domains.Such divergence loss is balanced with cycle consistency loss [273] and task loss to ensure preserving semantic content.L2A-OT advocates using Wasserstein distance compared to f-divergences for divergence maximization as the near-zero denominators for f-divergences tend to generate large but numerically unstable divergence values.
DA interpretability using OT.One of the virtues of OTbased domain adaptation is the possibility of inspecting the optimal plan (when available) to gain insights into the domains' correspondence.This offers an opportunity beyond merely detecting and handling the distributional shift between the source and target domains.Specifically, an explanation of the domain shift between the source and target datasets through the samples or features correspondences can be attained.
We note that many OT works [68], [175], [249], [270], [274], [275] refer to interpretability of the optimal plan without formal specification of what are the interpretability requirements.Yet, there is consensus among them that some characteristics are desirable and favorable from an interpretability point of view.For example, sparse optimal plans [68], [274] are favorable.Sparsity is motivated by the principle of parsimony, in which simple solutions should be preferred.For example, a sparse optimal plan in a weakly-supervised adaptation setup [270] between image-sentence pairs can be easier to interpret compared to the dense attention from Transformers.As shown in Fig. 10(b), the two constructs (the optimal plan and the attention) visualize the strength of association between words (e.g., 'man') on the vertical axis and visual objects, denoted by words on the horizontal axis.Clearly, having the word strongly associated with a few visual objects (e.g., the 'man' in the optimal plan) is easier to interpret.Note, however, that solving OT can yield an arbitrarily complex plan that's not necessarily interpretable per the abovementioned characteristics.Thus, a number of works explicitly place additional constraints on the optimal plan reflecting the interpretability requirements.For example, [274], [275] suggest enforcing user-defined interpretable mappings (e.g., k-sparsity) on the optimal plan as additional constraints.
Also, it is easier to interpret the correspondences when they can be traced back to the raw samples [143], [249] .For example, CO-Optimal Transport [249] simultaneously estimates the correspondence between samples (images) and features (independent pixels values across the dataset images) from two handwritten digit datasets.The features of the transport plan (Fig. 10(a) shows the spatial transformation of USPS into MNIST dataset.Although it is oblivious to the geometric structure of the images, it reveals how pixels would spatially rearrange in transportation.
DA theoretical guarantees using OT.Another virtue of OT is the possibility of leveraging theoretical error bounds that build on the Wasserstein distance.For example, Courtey et al. [276] proved that minimizing the joint distribution optimal transport (JDOT) quantity is equivalent to minimizing the learning bound on the domain adaptation problem.Wasserstein Guided Representation Learning (WDGRL) [258] builds on [247] and proves that the target error is bounded by the Wasserstein distance for empirical measures under the assumption that hypothesis class is K-Lipschtiz continuous for some K.

B.2 Generative Modeling
Learning models that generate images, audio, video, text, or other data is a major subfield of machine learning.Generative models have a vast range of possible applications and enormous potential can result from training on endlessly available unlabeled data.Let X ∼ P r , where P r be a fixed distribution from which we have access to samples.The fundamental problem of generative modeling is to make distributions P z and P r as close as possible, where P z is a distribution we may sample from.In high-dimensional and large data regimes, this is often done by minimizing a divergence between the two distributions.
Generative Adversarial Networks (GANs).The GAN consists of two networks -a generator network g θ ∈ G and a (0, 1)-valued discriminator network d ∈ D -that are jointly trained according to a procedure that may be understood as a minimax game [277].Given some easy-to-sample distribution P 0 , the generator implicitly defines a sample from a distribution g θ (z 0 ), where z 0 ∼ P 0 .The discriminator network d is then used to evaluate whether the sample g θ (z 0 ) comes from P z or P r .Whether or not the sample comes from P z or P r is measured through a quantity The GAN's minimiax objective between the generator g θ and the discriminator d is then The quantity L GAN provides a lower bound on the Jensen-Shannon divergence (JSD), up to a multiplicative and additive constant [277].Furthermore, if D were replaced by the set of all (0, 1)-valued functions, the lower bound would be an equality, and Eq. ( 27) amounts to minimizing the JSD [278].WGAN and followups.The notorious training instability of the vanilla GAN has been argued to be an issue associated with the JSD metric [12].Specifically, the JSD remains constant when the two distributions are disjoint (not overlapping).This can happen when P g and P r are low dimensional manifold in high dimensional space [278].Additionally, even if they overlap, the limited sampling (during training) might render them disjoint.The WGAN [12] minimization objective with respect to a generative network g θ is where F is a space of 1-Lispchitz neural networks.f w here replaces GAN's (0, 1)-valued discriminator and is called the critic.Recall that the equivalent form of Kantorovich-Rubinstein duality [44] is where F is the space of all 1-Lispchitz functions.Since F ⊂ F, it follows that the GAN objective Eq. ( 28) is a lower bound of the 1-Wasserstein distance Eq. ( 29).The 1-Lipschitz constraint (Lip(f )) in Eq. ( 28) also ensures a smooth WGAN WGAN-GP WGAN-LP SWGAN Fig. 11.Connection between WGAN [12] and the followup works WGAN-GP [280], WGAN-LP [279] and SWGAN [281].Lip and Sob denote 1-Lipschitz constraint and Sobolev integral constraint; respectively.
critic and, thus, a more meaningful gradient for updating the generator.In practice, WGAN enforced this constraint using the weight clipping heuristic to enforce the parameters to be in a compact space ([−0.01,0.01]).WGAN-LP [279] augments the loss with a regularization term that penalizes the deviation of the critic's gradient norm from one.Followup works argued that explicit enforcement of Lip(f ) might be unnecessary.WGAN-GP [280] observed that the optimal critic has a gradient norm equal to 1 almost everywhere.They replace the weight clipping with a gradient penalty term that encourages this characteristic.Sobolev WGAN [281] argues that the strong Lipschitz constraint might be unnecessary for optimization.Another direction for tackling the instability that can be caused by imperfect optimization of the critic is to use a fully tractable divergence.Genavy et al. [282] used regularized OT evaluated on mini-batches.Diffusion and score based methods.Score based [283], [284] and diffusion generative models [285] both rely on a similar mechanism.Input data (such as images or text) is perturbed by incrementally adding small amounts of noise by a procedure called the forward process.The forward process is a so-called diffusion process.The input data is recovered by running another diffusion process called the reverse process.The reverse process involves the score functionthe derivative of the log probability density function of the perturbed data distribution -which is unknown.Instead of using the exact score function, score-based methods use a neural network approximation.The parameters of the neural network are learned using the so-called (conditional) scorematching loss, which is coarsely an expected weighted meansquared error between the neural network and the true score function [286].Samples from the approximate data distribution can be generated by running the reverse process.
The forward and reverse mechanisms do not explicitly minimize the divergence between the input data and the output of the reverse process.Surprisingly, however, scorebased methods are equivalent to minimizing an upper bound on the KL divergence between the reverse process and the input data distribution [286].More recently, it has been shown that the 2-Wasserstein distance is upper bounded by the (conditional) score-matching loss, up to multiplicative and additive constants [287].
Several works have revealed the interesting connections between OT and diffusion models.Providing alternative interpretations and solutions to Diffusion-based generative modeling.The seminal work by Jordan et al. [288] interpreted Langevin dynamics (a basic principals underlying diffusion models) as a gradient flow in the Wasserstein space of probability measures.Diffusion Schrodinger Bridges [48], [289], [290] were shown to be strongly related to the entropyregularized OT problem.The formulation can be solved using a modified version of Iterative Proportional Fitting (a continuous analog of the Sinkhorn algorithm).Khrulkov et al. [291] shows that, in specific cases, the optimal transport (Monge) map coincides with the Denoising Diffusion Probabilistic Models (DDPM) [292] encoder map.Rectified Flow generative model [293] finds a transport map between two distributions by learning a flow (ordinary differential equation) that favours the transportation in straight paths.
In addition to the above, OT was utilized to optimize the diffusion model's pipeline.For example, to speed up the data sampling.Typically, diffusion models require many evaluation iterations to generate data samples.OT methods were used to enable efficient sampling in Diffusion Models.DPM-OT [294] views the inverse diffusion as an OT problem, significantly improving sample quality and speed.The model leverages semi-discrete optimal transport maps to better capture the discontinuity of the target data manifold.The model is currently limited to unconditional data generation.

B.3 Other applications
Graphs Matching: Graph comparison and matching is concerned with estimating the structural correspondences of nodes between graphs.Due to its combinatorial nature, the problem is NP-complete, and approximate solutions are commonly sought.In addition to the quadratic formulations for graph comparison discussed in Gromov Wasserstein ( §4.6), some approaches relax the problem into a linear assignment.Wang et al. [295] learns a supervised model for graph matching using a permutation loss that can work for an arbitrary number of nodes for graph matching.As the Sinkhorn layer was shown [296] to be the approximate and differentiable version of the Hungarian algorithm, it was used to "soft" assign graph correspondences.An issue here is that the "soft" output isn't necessarily a discrete permutation matrix.This is fixed by applying the Hungarian algorithm [297] on the output matrix.Other works [298] used this post-processing trick for a similar purpose.In graph correspondence learning application, [63] uses Sinkhorn to initiate soft correspondence between graph nodes before refining it using message passing.
Reinforcement Learning (RL): Reinforcement Learning (RL) is a framework that aims at effectively solving sequential decision-making tasks.In this framework, the learning agent interacts with the environment to improve its performance through trial and error [299].So far, there is limited adoption of optimal transport techniques in reinforcement learning.Of the few works in this area, [300], [301] applied OT to Curriculum Reinforcement Learning (CRL).CRL approaches complex and challenging RL scenarios by gradually increasing the difficulty of tasks presented to the learning agent.This way, it evades extensive engineering of reward functions.One interpretation of this paradigm considers the curriculum as a sequence of task distributions interpolating between auxiliary (intermediate) task distributions and target task distributions.Algorithmically, the process involves using KL divergence, which OT replaced in [300], [301] to improve the learning performance.

APPENDIX C OPTIMIZATION BACKGROUND
This section provides a concise description of some of the most important concepts from optimization theory relevant to new OT researchers and machine learning researchers learning and using OT.The section is not meant to be comprehensive although effort was made to ensure its accessibility to a wide audience not familiar with mathematical optimization theory.While this section can serve as a starting point for the uninformed readers, we also refer the readers to the following more comprehensive textbooks on optimization theory [302], [303], [304].

C.1 Minimum cost flow problems
An exact optimal transport problem on discrete measures µ and ν, with probability masses a and b over their d X and d Y points respectively, can be formulated as the classic minimum cost flow (MCF) network problem which is a graph-theoretic problem that can be formulated using the following linear program: where c is a cost function, usually a distance measure between x i and y j .General linear programs can be solved using a number of algorithms with the Simplex method [305], [306], [307] and the interior point method [308] being 2 of the most popular choices.The computational complexity of the Simplex method is exponential in the number of decision variables in the worst case while the interior point method has polynomial time complexity in the number of decision variables.However, in practice, the Simplex method tends to be significantly faster for most practical problems.Besides computational complexities and running time, the main difference between the Simplex method and the interior point method from a user's perspective is that the Simplex method always finds a so-called "extreme" or corner point of the feasible domain as the optimal solution, while the interior point method finds a point in the relative interior of the feasible domain as the optimal solution.An extreme point is a point at the intersection of N hyper-planes where N is the number of decision variables (d X × d Y above).Since the optimal transport problem has to be satisfied at equality will be bound constraints of the form P ij ≥ 0. The Simplex method will therefore naturally tend to find sparse optimal plans.The relative interior of a feasible domain is the set of points that satisfy all the linear equality constraints (e.g.dY j=1 P ij = a i ) at equality, but with all the inequality constraints (P ij ≥ 0) satisfied at a strict inequality.
That said, using the standard Simplex method or interior point method for generic linear programs is not the most efficient way to solve minimum cost flow (MCF) problems.MCF problems have a wide variety of more efficient algorithms with polynomial time complexities [309], one of the most famous of which is the so-called network Simplex method [310].The network Simplex method is a variation of the general Simplex method specialized for MCF problems to have a lower cost per iteration and a polynomial bound on the number of iterations needed to converge.The general Simplex algorithm is both more expensive per iteration and has an exponential worst case complexity for the number of iterations needed to converge.More recently, faster algorithms for the MCF problem have also been proposed [311].

C.2.1 Set convexity
A set of points S = {x} is said to be convex if every weighted average of some points inside the set is also inside the set.Equivalently, for a set S to be convex, the following must be satisfied for all (x 2 , x 2 ) ∈ S and for all 0 < α < 1.
The above weighted average is also known as the convex combination of x 1 and x 2 .
The intersection of any number of convex sets is either the empty set or another convex set.Therefore, in an optimization problem with multiple constraints, if the set of feasible points to each constraint individually is a convex set, the set of feasible points to all the constraints is either the empty set (problem is infeasible) or another convex set.

C.2.2 Function convexity
A function f (x) with domain X is called convex if the set of all points in X satisfying the constraint f (x) ≤ α is either an empty or a convex set for all α ∈ R. Note that the set of such points doesn't have to be a compact set, that is it can extend to along one or more directions.Equivalently, one can say that a function is convex if the set of points in its epigraph is a convex set.The eipgraph of the function f (x) is the set of points {(x, α) ∈ X × R : f (x) < α}. Figure 12 shows the epigraph of f (x) = x 2 .Showing that a function is convex can be done by proving any of the above properties.There are however other ways to check if a function is convex.Another common check for twice differentiable functions is to check if the function's Hessian is positive semi-definite everywhere in X .Identifying and building convex functions in a disciplined way is a well-studied field, often termed "disciplined convex programming" (DCP).For more on DCP, the readers are advised to check the excellent online resources available in https://dcp.stanford.edu.

C.2.3 Constraint convexity
In optimization, a constraint is said to be a convex constraint if the set of points satisfying this constraint is a convex set.One common way to construct a convex constraint is using a convex function f (x).If f is a convex function, the following constraint is convex: for any constant α, assuming there exists at least 1 point satisfying this constraint.Other common convex constraints are affine (aka linear) constraints of the form: where f is an affine function of x, e.g.f (x) = b T x + a for some constants b and a.Note that it's customary to use the term "linear" to refer to affine functions and/or constraints in optimization literature.
A constrained optimization problem with a convex minimization objective function and convex constraints is called a "convex program".Formulating decision problems as convex programs and solving them with convex optimization algorithms is often called "convex programming".Convex programs are theoretically appealing because under mild assumptions, they have efficient algorithms that can find a so-called global optimal solution.A global optimal solution is a (not necessarily unique) feasible point x that absolutely minimizes the objective function, i.e. no other solution can do strictly better.
Note that it has to be a minimization problem.Maximizing a general convex function is NP-hard.If your objective f (x) is a quantity you want to maximize, that's equivalent to minimizing −f (x) instead.If −f (x) is a convex function, f is known as a concave function.Therefore for a maximization optimization problem to be convex, the objective function must be concave and the constraints must be convex.

C.2.4 Structured constraints
Among convex constraints, there are some special classes of constraints that are generally considered nicer than "generic" convex constraints.These constraints have a specific structure which can be exploited in more efficient algorithms.These are broadly known as "structured constraints".A simple class of structured constraints are the affine/linear constraints, e.g: There are some algorithms that can handle linear inequality constraints more efficiently than generic convex constraints.Additionally, linear equality constraints are the only equality constraints that are convex.Any other equality constraint where the function in the constraint is not affine/linear can never be convex.Linear constraints are therefore among those special classes of so-called structured constraints.
Another broad class of structured constraints is conic constraints.A conic constraint is a convex constraint whose feasible set is a special convex set known as a cone.A convex set S is considered a cone if for every x inside the set S, αx is also inside the set S for all α ≥ 0. Note how a cone must be a non-compact set since α is allowed to go to ∞.However, a conic program can have multiple conic constraints whose intersection is a compact set.
A structured conic constraint can be written as: where C is a cone.There are various algorithms that can handle conic constraints more efficiently than regular convex constraints.It is therefore often appealing to convert regular mathematical constraints that are known to be convex to their conic equivalents to make use of these efficient algorithms [312].

C.3 Differentiability
A function is any one-to-one or many-to-one (but not oneto-many) mapping from a set called the domain to another set called the range of the function.The uniqueness of the value in the range for each value in the domain is a characteristic property of what a function is.For some parametric optimization problems with parameters θ, there exists a unique optimal solution x * (θ).In such cases, x * can be considered a proper function of θ.Functions can be continuous, discontinuous or semi-continuous.Continuity of a function means that for every infinitesimal change in the inputs, an infinitesimal change takes place in the output.In the context of optimization problems, this means that for every small change in θ, x * only changes by a small amount.Such continuity assumption is generally violated if the solution x * is discrete since discrete solutions can only move in non-infinitesimal steps, except in the trivial case where the solution x * does not change at all with θ.Semicontinuous functions are continuous almost everywhere but have a zero-measure subset of the domain where the function has a discontinuity.Optimal solutions to non-convex optimization problems with multiple modes can exhibit such semi-continuity.

C.4.1 What is duality?
Duality is one of the most important concepts in optimization.Sometimes, instead of solving an optimization problem in its "natural formulation", also known as the primal formulation, one can solve a related optimization problem, known as the dual problem, or one can solve the primal and dual problems together.Let the primal optimization problem be: where x is the vector of primal decision variables, f is the primal objective function, and P is the feasible domain of the primal problem, i.e. the set of all points that satisfy any constraints on x.Additionally, let x * be the unknown optimal solution of the primal optimization problem.The dual problem would be another optimization problem of some so-called dual variables λ: where h is the dual problem's objective and D is the feasible domain of the dual problem such that for any feasible λ ∈ D, h(λ) ≤ f (x * ).The fact that every dual feasible solution λ gives a lower bound h(λ) on the optimal primal objective value f (x * ) is what makes the primal and dual problems entangled with each other.The optimal dual solution λ * ∈ D that maximizes h gives the best/tightest lower bound on the optimal primal objective.In problems where the socalled strong duality exists, the best bound h(λ * ) is exactly equal to the best primal objective value f (x * ).For linear programs and convex programs (subject to a mild constraint qualification condition), strong duality holds.

C.4.2 Importance of the dual problem
There are various reasons to formulate and solve the dual problem.One reason is to identify if a primal solution is near optimal.Assuming the primal problem is a minimization problem, a dual solution that's feasible to the dual problem provides a lower bound on the optimal objective value of the primal problem.Having a good (tight) lower bound on the optimal value of our primal problem helps us to know how far we are from optimality in the worst case.For instance, let x be the best feasible primal solution found and let λ be a feasible dual solution.In this case, we know that h(λ) ≤ f (x * ) ≤ f (x).A feasible dual solution with a higher h(λ) or a feasible primal solution with a lower f (x) provide tighter bounds around the optimal value f (x * ).Therefore knowing λ, we know that f (x) can never be more than f (x) − h(λ) away from f (x * ).When strong duality applies, the best bound h(λ * ) is exactly equal to the best primal objective value f (x * ).In this case, the dual solution can be used to provide a proof/certificate of convergence to the optimal solution if we evaluate f (x) and h(λ) for the best x ∈ P and λ ∈ D found and they turn out to be equal.However, strong duality doesn't always exist for all classes of optimization problems.Another use of the dual problem is that sometimes it's beneficial to solve the primal and dual problems simultaneously like in the primal-dual hybrid gradient algorithm for large scale linear programs or the primal-dual interior point method for unstructured convex and nonlinear programs.In these cases, the dual problem can guide the optimization algorithm improving the convergence rate of the primal solution.
In some other cases, e.g., some linear and convex/conic programs, the dual can also have a special structure enabling more efficient algorithms or faster convergence compared to solving the primal formulation.In such cases, it's also common that there exists an easy way to get the optimal primal solution x * corresponding to the optimal dual solution λ * .In such cases, it might make sense to solve the dual problem alone instead of the primal one and then recover x * from λ * at the end.

C.4.3 Deriving the dual problem (general)
To derive the dual of an optimization problem, the following steps are followed: The Lagragnian function is a function of the objective and some (or all) of the constraints of the primal optimization problem such that the minimum of the Lagrangian function, subject to the non-relaxed constraints, is a lower bound on the optimal value of the primal problem.The standard way to construct the Lagrangian function is by relaxing some or all of the constraints in the primal problem and adding a function of them to the objective that ensures the above property.For instance, let the primal problem be an inequality constrained optimization problem: subject to g(x) ≤ 0 (element-wise), where f is the objective function, g is the constraint function with d g outputs, and P encodes all the "other constraints" if any exist or just the Euclidean space if no other constraints exist.The Lagrangian function resulting from relaxing the inequality constraint would be: where λ ∈ R dg + is a fixed vector of non-negative coefficients known as the Lagrangian multipliers or the dual decision variables.The so-called Lagrangian relaxation of the primal problem would then be a parametric optimization problem parameterized by λ as follows: minimize x L(x, λ) = f (x) + ⟨λ, g(x)⟩ subject to x ∈ P Since any solution x ∈ P that is feasible to the primal problem will satisfy g(x) ≤ 0 and since λ is non-negative, L(x, λ) ≤ f (x) is true for all primal feasible solutions x.This is true for the optimal solution to the primal problem x * as well, i.e.L(x * , λ) ≤ f (x * ).
Let the minimum objective value of the Lagrangian relaxation problem for some choice of λ be: L * (λ) is known as the Lagrangian dual function.It is important to note that the x ∈ P that minimizes L(x, λ) in the Lagrangian relaxation problem may or may not satisfy the relaxed inequality constraint g(x) ≤ 0.
The key condition that a Lagrangian dual function must satisfy by definition is that L * (λ) ≤ f (x * ) for any λ ∈ R dg + .A sufficient condition to ensure this property is satisfied is that L(x, λ) ≤ f (x) for any x that is also feasible to the primal problem and any λ ∈ R dg + , since L * (λ) ≤ L(x * , λ) ≤ f (x * ) would then be true by definition for all λ ∈ R dg + .To generalize the above Lagrangian dual function to different constraint types, other than g(x) ≤ 0, the following quantities and sets can be defined differently for each constraint type to ensure a generalization of the aforementioned property is satisfied: 1) Lagrangian multipliers λ, 2) Multiplier domain D L (instead of R dg + ), and 3) Lagrangian function L(x, λ) Note that the above terms can all be trivially generalized to combinations of constraint types by breaking down λ into multiple components, one per relaxed constraint type.The Lagrangian dual feasible domain D L can therefore be more generally defined as the Cartesian product of the domains of the dual variables associated with all the relaxed constraint sets (grouped by constraint type).Each relaxed set of constraints has a dual variable domain associated with it to ensure the lower bound property of the Lagrangian.The following are some of the most common dual variable domains for common constraint types: Using the more general notation, the Lagrangian dual function L * (λ) is a lower bound on the optimal primal objective value f (x * ) for every λ ∈ D L .Since for every λ ∈ D L , L * (λ) ≤ f (x * ), the choice of λ ∈ D L that gives the tightest bound is the one that maximizes L * (λ).This optimization problem in λ is known as the dual problem.This is a max-min optimization problem which from a first look may seem more complicated than the primal problem.However, for classes of optimization problems where the inner minimization of the Lagrangian function with respect to x has a closed form solution, the dual problem can be significantly simplified.We will see examples in the next section.
When no closed form solution exists for the Lagrangian relaxation, instead of solving the dual problem as a maxmin problem, primal-dual algorithms solve for x and λ simultaneously.For instance in the primal-dual interior point method, the optimality conditions of the outer and inner optimizations are written out as a system of (potentially nonlinear) equations in terms of x and λ which are then solved using a variation of Newton's method.

C.5 Dual of a general linear program
In this section, we will apply the general steps from the previous section to a linear program in its canonical form.Consider the following primal linear program: for some constants c, A and b of the appropriate dimensions where ⟨..⟩ is the inner product operator.There is no loss in generality in assuming the above form.Any ≤ constraints can be trivially transformed to ≥ constraints by multiplying the constraints by -1.Additionally, equality constraints of the form Ãx = d can be transformed to the following 2 sets of constraints: 1) Ãx where m is the number of constraints in Ax ≥ b, and λ is the vector of Lagrangian multipliers.This is just minimizing an affine function of x.The optimal value L * (λ) of the above relaxation has the following closed form solution: The general form of the Lagrangian dual problem can be written as: maximize λ min x L(x, λ) but it can then simplified given the closed form solution as follows: maximize λ ⟨b, λ⟩ subject to Note that the −∞ branch when c − A T λ ̸ = 0 can be ignored in the simplified form since −∞ is clearly less than ⟨b, λ⟩ and we are only interested in the λ that maximizes the dual function L * (λ).Also note that the simplification led to some additional constraints on λ compared to the general Lagrangian dual formulation.The above problem is known as the dual linear program.Similar derivation steps can be followed for conic and other special classes of optimization problems to derive their dual formulation.
In some cases, it can be more efficient to solve the dual linear program instead of the primal one.For linear programs, strong duality exists, and there exists an efficient way to recover the optimal primal solution x * from the optimal dual solution λ * so one has the choice to solve either the primal or dual problems.The dual can also be exploited in algorithms directly without changing the problem's formulation.For instance, the dual Simplex method is an efficient way to perform the regular Simplex method on the dual formulation but without having to explicitly derive the dual formulation.

C.6 Dual of the exact optimal transport problem
In this section, we shall derive the dual problem of the exact discrete optimal transport problem: minimize P ⟨P , C⟩ subject to P 1 = a, where C ij = c(x i , y j ).
The Lagrangian relaxation of the above linear program is: The main advantage of the dual optimal transport problem is that the decision variables now scale linearly with the number of points in the supports of the measures O(d X + d Y ).Additionally, the local constraints in the dual formulation over each (x i , y j ) pair enable certain stochastic optimization methods, e.g., the stochastic Frank-Wolfe algorithm.

C.7 Dual of entropy regularized optimal transport problem
The final example of duality and its application in optimal transport which we shall present is that of the entropy regularized optimal transport problem: This is just minimizing an unconstrained strongly convex function of P .The unique optimal solution P * (λ a , λ b ) minimizing L can be obtained using the stationarity conditions: The corresponding optimal value L * (λ a , λ b ) is therefore: L * (λ a , λ b ) = ⟨λ a , a⟩ + ⟨λ b , b⟩ Since λ log P * ij = −C ij + λ ai + λ bj (from the stationarity conditions above), the above expression simplifies to: This is an unconstrained maximization problem amenable to classical stochastic gradient descent and its family of stochastic optimization algorithms.

Fig. 4 .
Fig. 4. Conceptual Depiction of Optimal Transport: (a) The optimal transport problem is stated as follows.Given two points cloud objects µ and ν and the knowledge of point-wise distances (i.e. the ground distance), find a legitimate way (i.e.valid plan) to redistribute the mass of µ into ν in the least costly way P * .The optimal transport cost is then given by d OT (µ, ν) = ⟨C, P * ⟩.(b) Matching of two point cloud objects using optimal transport.The point correspondences estimated by the optimal plan are shown as straight pink lines.OT literature is filled with formulations that extend this key concept to new situations and applications.For example, formulations that allow (c) partial transportation or (d) transportation between incomparable spaces.where ⟨•, •⟩ is the Frobenius inner product.Applying the formulation above on point cloud objects (Fig 4(b) hand gestures from the CAPOD dataset [42]), we can learn about the global proximity of the two shapes through the distance d OT .Moreover, we learn the point correspondences conveyed by the optimal plan P * (visualized as straight lines).Posed this way, the OT problem (sometimes known as the Kantorovich formulation) possesses a useful set of characteristics as well as limitations, which motivate other formulations of the problem.We discuss many of them in §4.In some cases, one may wish to relax the mass preservation constraint.For example, as shown in Fig.4(c), the pink hand object can only be partially matched to the incomplete (grey) hand copy as the thumb is missing.The vanilla OT forces the transportation of all points, resulting in inaccurate matching and (unrealistically) increased cost.Note the incorrect matching of the thumb's points.A partial OT formulation ( §4.2), however, is a better alternative as it has the flexibility of leaving some points out (the thumb) from the transportation.Another limitation is that transportation that relies only on point correspondences conveyed by the ground distance (without considering the inter-points relations) can be misleading.In Fig.4(d), we see a simple rotation transformation of the target object causes a degradation in the alignment by vanilla OT.This can be clearly seen in the incorrect correspondences of the index and the thumb fingers.This is not the case with a relational formulation (Gromov Wasserstein §4.5) that handles this issue well at the expense of increased computation.This formulation can also compare points defined on distinct (incomparable) spaces which can be useful for comparing graphs or measurements from different modalities.

Fig. 8 .
Fig. 8. OT Domain Adaptation.Various domain adaptation configurations addressed by OT works in §B.1 Set DA using OT.In open set domain adaptation, the target domain can contain classes that weren't present in the source domain.While more practical, it is a challenge where domain adaptation methods fail due to interference with extra unknown classes.The key goal in open set domain adaptation [264] (OSDA) is to correctly classify known classes while rejecting the extra classes as "unknown".Xu et al. [265]

Fig. 9 .
Fig. 9. (Left) L2A-OT augments source domains (thick arrows) with synthesized novel domains (thin arrows) by learning to maximize the Wasserstein distance between the two sets.(Right) synthesized novel samples are qualitatively better compared to adversarial perturbation approach CrossGrad [262] (figures adapted from [263])

1 )
Derive the Lagrangian function 2) Derive the Lagrangian relaxation problem 3) Derive the Lagrangian dual function 4) Derive the Lagrangian dual problem 5) Optional: simplify the dual problem Constraint typeDual variable domain D L g(x) ≤ 0 element-wise R dg + : non-negative numbers g(x) ≥ 0 element-wise R dg − : non-positive numbers g(x) = 0 element-wise R dg : all real numbers x ∈ C where C is a cone The polar cone C o of CThe polar cone of a cone C is the set of points {λ} in the Euclidean space such that the inner product ⟨λ, x⟩ ≤ 0 ∀x ∈ C.This definition ensures that the Lagrangian is a lower bound on the objective function for any feasible x and λ ∈ D L .