Landmarks Augmentation with Manifold-Barycentric Oversampling

The training of Generative Adversarial Networks (GANs) requires a large amount of data, stimulating the development of new augmentation methods to alleviate the challenge. Oftentimes, these methods either fail to produce enough new data or expand the dataset beyond the original manifold. In this paper, we propose a new augmentation method that guarantees to keep the new data within the original data manifold thanks to the optimal transport theory. The proposed algorithm finds cliques in the nearest-neighbors graph and, at each sampling iteration, randomly draws one clique to compute the Wasserstein barycenter with random uniform weights. These barycenters then become the new natural-looking elements that one could add to the dataset. We apply this approach to the problem of landmarks detection and augment the available annotation in both unpaired and in semi-supervised scenarios. Additionally, the idea is validated on cardiac data for the task of medical segmentation. Our approach reduces the overfitting and improves the quality metrics beyond the original data outcome and beyond the result obtained with popular modern augmentation methods.


Introduction
Excessive labeling is required to obtain stable neural network models of high image-to-image mapping quality. The ability to prepare a training dataset of sufficient size strongly depends on the research area, with domains like medicine, space imagery, or geology demanding immense expertise, time, and financial resources to systematize and label the datasets. Yet, despite the unquestionable demand, the unlabelled data usually prevails the labeled ones regardless of the domain, creating a welcoming setting for the development of unsupervised/semi-supervised approaches.
Most of the existing unsupervised methods do not achieve * Contributed equally. † Corresponding author: d.dylov@skoltech.ru the quality of models optimized in a supervised training mode. However, some Cyclic Generative Adversarial Networks (Cyclic GANs), e.g., already a classic CycleGAN [66], DualGAN [62], or others [5,30,52], oftentimes yield results comparable to the supervised learning, without requiring the aligned (paired) data for training. This class of architectures is efficient, e.g., in a semi-supervised setup in the presense of a large number of unlabelled images. Data augmentation is a way to fictitiously inflate the available annotation, conventionally entailing either basic geometric or deep learning-based approaches [48]. While the latter is deservedly the subject of active research today [14], the former is prone to expanding the manifold of the annotations beyond the original distribution. Namely, augmentations that include non-trivial transformations (beyond basic image rotation and flips, e.g., a linear stretching) will inevitably lead to expansion of the original manifold, disturbing the physical nature of the data. Moreover, disciplines like medicine are known for their resilience to the classical augmentation methods such as affine transformations [48,50], demanding the development of more advanced methods.
Existing data augmentation approaches dismiss advances of modern optimal transport (OT) theory [1,12], whichallegedly -could be attributed to the lack of user-friendly solutions to handle the complex topology of large datasets. In OT, the manifold hypothesis [6,31] states that data do not occupy the whole ambient space, but (due to redundancy and co-relation of features) the data concentrate near a submanifold of intrinsically lower dimension than that of the entire ambient space. Yet, even if the ambient space is flat and lacks curvature, the data submanifold can be curved and topologically nontrivial, having clusters and topological holes of higher dimensions.
Motivated by these challenging attributes of the submanifold, in this work, we propose to construct a local nonparametric approximation of it by means of Wasserstein distance to devise an efficient augmentation arrangement. Each local region in such a representation will be a convex set of neighboring points, corresponding to some wanted landmarks or segmentation contours. Hence, to sample new Figure 1. Landmarks Augmentation with Manifold-Barycentric Oversampling. The barycentric manifold is obtained by calculating weighted Wasserstein barycenters between the simplices of the nearest neighbours of the landmarks graph. First, we sample new landmarks from random uniform barycentric coordinates. The landmarks sampled this way are close by distribution to the empirical ones, preserving the physical nature of the data and acquiring a natural look. In the second step, the sampled landmarks are used in the Cyclic GAN training. The augmented dataset increases the accuracy of the target image-to-image translation task.
points inside the region, we propose to compute random weighted Wasserstein barycenters [6] which inherit the geometric properties of the basis of landmark measures and such local regions will automatically prevent us from leaving the submanifold. We introduce the notion of barycentric manifold that we define as a set of weighted barycenters generated in the local regions of the manifold. Fig. 1 summarizes the proposed concept for the case of a Cyclic GAN that performs some image-to-image translation task.
The contribution of this paper is in the following:

1.
A new type of sample-efficient and non-parametric data augmentation approach based on weighted Wasserstein barycenters. This is the first adaptation of optimal transport theory to data augmentation problem, which guarantees to sample the new data that asymptotically converges to the original data submanifold.
2. Efficient (compatible to supervised outcome) Cyclic GAN model for unpaired and semi-supervised detection of landmarks and segmentation contours.
3. Extensive line of experiments that compares our method to modern data augmentation approaches. Our augmentation outperforms both classical (geometric, GMM) and deep-learning based models (GAN, VAE, and NF).

Related Work
Rapid progress of deep learning instigated a series of supervised landmark detection approaches: cascade of CNNs [51], multi-task learning (pose, smile, the presence of glasses, sex) [65], and recurrent attentive-refinements via Long Short-Term Memory networks (LSTMs) [61]. Special loss functions (e.g., wing loss [17]) were shown to further improve the accuracy of CNN-based facial landmark localisation. Ultimately, Hourglass [39] has proven to be a universal model for detecting landmarks in the supervised learning mode, having a stack of U-Net [45] type architectures at its core.
Unsupervised pretraining has attracted major interest in the community with the advent of data-hungry deep networks [54]. A classical approach for such a task is to use the geometric transformations for feature descriptors encoding, comprising different variations of embeddings [54][55][56]64] where pretrained descriptors had to reflect the image changes synchronously. Recent work [38] proposes an extension to this approach by matching descriptors from different images via deep clustering. However, none of these works report extraction of the landmarks in a truly unsupervised way. Their unsupervised nature is only as good as a generic pretraining could be, being prone to encoding some redundant information within the bottleneck. Only the authors of [5] managed to distill just the landmarks data in their encoder, establishing the first fully unsupervised extraction method.
Another class of unsupervised pretraining methods [26,35,60] uses conditional image generation. If images I 1 and I 2 have different landmarks but have the same style (e.g., sequential frames from the video), the reconstruction network G can generate I 2 from I 1 and the landmarks L(I 2 ) of the second image I 2 . The corresponding loss function then minimizes the difference between I 2 and G(I 1 , L(I 2 )) and the additional condition of sparsity on the landmarks heatmap corresponding to L(I 2 ). Methods [27,28] are similar, but they have an additional discriminator network to compare predicted landmarks to the landmarks from some unaligned dataset. Thus, the use of unaligned dataset is an intermediate step between the unsupervised and the supervised training. The method proposed herein also belongs to this class, significantly improving the precision of the GAN model and the quality of training with limited annotation.
An important part of the Cyclic GAN implementation proposed herein is the content-style decomposition of an image [23]. We use the architecture of stylegan2 [30] that allows to mix style and landmarks through weights modulation in convolutional layers. It makes landmarks extraction invariant to the style changes and vice-versa: the style is invariant to local geometric perturbations of the image [35]. Lastly, paper [42] explores style translation between images, augmenting the dataset for supervised training.
We conversely focus on datasets with large number of unlabelled images that require landmark augmentation. The idea of augmentation via barycenteric sampling was motivated by papers [6] and [29]. The authors of [6] introduce barycentric coordinates (defined in Section 3) for lowdimensional embedding. To augment available data, we propose to sample such coordinates w.r.t. arbitrary collections of data points being in the neighborhood and then to map them into the landmarks domain.
Data augmentation techniques can vary from simple geometric transformations [53] to more sophisticated ones, requiring interpolation of inputs in the feature space [10] or a targeted use of deep learning [11,25,41]. Where the oversampling is entailed, these methods oftentimes rely on SMOTE [10] technique -an established geometric approach to random oversampling and data augmentation in Euclidean space, followed by many extensions [7,22]. For faces, the survey [59] covers a large set of augmentation methods, ranging from attribute transformations [36] to generative methods (e.g., GAN-based [47,66]). Other modern augmentation methods include VAEs [32,57] and NFs [16].
Inexplicably, there is still no data augmentation method based on optimal transport theory, motivating this effort.

Method
We begin with an unpaired dataset of 2D images and landmarks (fixed number of keypoints in R 2 that may be either ordered or randomly permuted). The images and the landmarks are assumed to be mapped using a class of generative image-to-image translation models, generically referred to as Cyclic GANs.

Cyclic GAN Training
The Cyclic GAN architecture considered herein includes two generators, two discriminators, and a style encoder (refer to [30]). The first generator maps images to landmarks and uses hourglass model [39]. The second generator does the inverse mapping, taking landmarks and style as input and producing an image. The style may be either encoded from the image (to reconstruct the output image) or generated from a Gaussian noise (to approximate image distribution). We, thus, train bi-directional mapping image → (landmarks, style) → image and (noise, landmarks) → image → landmarks. Corresponding loss functions (L I rec , L L rec ) compare the reconstructed images/landmarks with the original ones (Section 4). The training also includes mono-directional mappings, where we compare the generated image/landmarks with the real ones by distribution. Algorithm 1 concisely covers universal optimization routine of Cyclic GAN training, regardless of the image-to-image translation task.
For given landmarks coordinates X l , we engage intermediate mapping to Gaussian heatmaps L = L(X l ) which are computed as follows. For the k-th point of the coordinates X l [k] and integer coordinates (i, j) in the heatmap, denote the distance from (i, j) to X l [k] by d(k, i, j). Then, up to the normalisation term, if the landmarks are ordered, In the un-ordered case (e.g., segmentation contours), one should use the landmarks summed by channels, such that

Manifold-Barycentric Oversampling
We consider point clouds as empirical probability distributions supported over finite sets of points in the Euclidean plane P(R 2 ), endowed with the 2-Wasserstein distance W 2 , a.k.a., the Wasserstein space [40] W 2 (R 2 ) = (P(R 2 ), W 2 ). So, the distance between two landmarks can be measured using W 2 as the distance between two empirical measures 1 .
Under the manifold hypothesis, the data comprise a submanifold in the ambient space W 2 (R 2 ). Due to the data submanifold curvature, we can only trust the ambient metric locally. Thus, our idea is to non-linearly interpolate the densely populated local regions of the space of point clouds to generate probable landmark point clouds, without leaving the data manifold. Otherwise, exceeding the data manifold could generate improbable, unnaturally-looking landmarks.
We formalize the Wasserstein-barycentric manifold sampling for the point cloud data augmentation in Algorithm 2. Given a hyper-parameter k we construct a symmetrized k-NN neighboorhood graph G k over the landmark point clouds, endowed with the Wasserstein distance. Then, we construct the clique complex K(G k ) by finding the maximal cliques (i.e., not contained in any other clique) of the neighborhood graph, associating a k-simplex with a maximal (k − 1)clique. Then, N aug simplices σ i are randomly selected into

Algorithm 1: Cyclic GAN Training Iteration
Input: image I, unpaired landmarks coordinates X l , hyperparameters of reconstruction losses c I , c L , generators G L , G I , discriminators D L , D I , heatmap variance hyperparameter σ; compute Gaussian heatmap: , with replacement according to the multinomial distribution (to provide uniform selection of the graph's vertices). From each k i -simplex σ i we sample a new pointμ i according to the uniform distribution over the simplex. This allows to sample the barycentric coordinates and to compute the weighted Wasserstein barycenter w.r.t. the vertices of k i -simplex corresponding to those coordinates.
Notably, Algorithm 2 entails time-consuming pairwise computation of edges with OT weights for the kNN graph. Its complexity is O(N 2 s 2 /ε 2 ), where s is the domain size (the number of key-points in a landmark) and ε is the accuracy of the Sinkhorn method [21]. For large datasets (N 1000), one may further improve Algorithm 2 by using faster kNN graph computation, taking O(N log s) steps instead of N 2 [15]. The algorithm's complexity also includes maximal-cliques search, O(min{k 4 , (N k) 3/2 }) [37], and the computation of barycenters, O(N aug s 2 k/ε 2 ) [33].
Wasserstein Barycenter. Barycentric coordinate system specifies the location of each point on a simplex with a reference to the points spanning it. For Euclidean space, the transition between barycentric and the ambient coordinates Algorithm 2: W 2 -Barycentric Sampling Input: landmark coordinates X = (X l 1 , . . . , X l N ), kNN parameter k, number of synthetic landmarks to sample N aug ; Output: augmented landmarks X aug ; # Construct neighborhood graph G N = landmarks count; is, by definition, a convex combination equivalent to the weighted sum. However, for generic metric spaces, it is a solution to an optimization problem. In particular, the Wasserstein space is not Euclidean, with a weighted mean generalized as the minimizer of the sum of squared distances, generally known as the weighted Fréchet mean [19] or, in W 2 (X ), the weighted Wasserstein barycenter [12]: For measures with free support, the solution of the Wasserstein barycenter problem is a locally optimal measure with a discrete support [18].

Loss Functions
Reconstruction. Let us denote the generator (decoder) model by G I . It maps a pair (landmarks, style) into an image, where the style may be either encoded from the image S(I) or generated from the noise S o (z). We apply random geometric transformation g before extracting style from image for better separation of style and content. Let G L be the landmarks encoder. Then, the following loss term compares the original image to the reconstructed one, for a given heatmap of landmarks and the encoded style [44]: . This loss has three terms: Euclidean image similarity, features from Alex network (Alex(I)) [63], and features from ResNet ArcFace (Id(I)) [13].
Similarly, for the landmarks reconstruction, obtained from a generated image (fake) L R = G L (I F ): where H is cross-entropy function, W 1 is Wasserstein distance with Euclidean ground metric, . Component H enables more accurate comparison of 'close' pairs of landmarks and W 1 speeds up the convergence of distant elements. Adversarial. We further increase the overlap of the real and the generated image distributions by employing the GAN losses for discriminator and generator: They are classical losses from binary LogitBoost classifier [20] that have been shown to provide a stable training of GANs with non-vanishing gradients [30]. The discriminator optimizes separation of real and fake images and the generator does the classification with opposite label signs. Similarly, for the landmarks distribution, we use the same discriminator loss and L L g (L F ) = IE log 1 + e −D L (L F ) . Implementation Details. The implementation details, featuring architecture of modified stylegan2 [30] and the code, are provided in the Supplementary material.

Experiments
We consider 300-W [46], Human3.6M [24], and Cardio MRI (private) datasets. Dataset 300-W contains 68 annotated key-points per face, treated as unordered, and 3k training pairs. Human3.6M dataset has about 13k training pairs of 32 key-points (ordered), depicting the human pose in the images. Cardio MRI dataset has 701 training pairs of MRI scans and segmentation contours with 200 unordered key-points each.
Typical for Cyclic GANs, the unaligned training needs a labelled dataset but does not need a direct image-mask correspondence. To evaluate the efficiency of augmentation via the barycentric manifold, we sampled 7k barycenters for each dataset. The barycenters themselves are various kinds of key-points distortions (Figs. 2), yet with the physicality of these data (proportions, angles, etc.) preserved.
In agreement with the theory, Fig. 2 also demonstrates asymptotic convergence to empirical data when the dataset size increases. Given such realistic augmentations, the Cyclic GAN performance improves in all three datasets considered (Fig. 3), with the new landmarks always remaining within the manifold of the original data. This could be confirmed both visually and by analyzing the post-augmentation distributions (Fig. 4).
In Table 1, we compare our augmented model with the state-of-the-art (SOTA) approaches, trained both in the supervised and in the unsupervised (pretraining with the supervised regression) scenarios, using typical for these datasets metrics of %-MSE, normalised by the image size, or the inter-ocular distance (IOD [9]).
Remarkably, our approach performs almost at the level of the supervised models trained on fully-annotated datasets, beating the result of the unpaired SOTA. When the portion of available annotation drops (< 600 images), our method yields a greater gain and outperforms even those supervised models that have the exact same encoder within. We speculate that our unpaired method performs better because 1) supervised models depend on the amount and the quality of the annotation, and 2) even the raw images without the landmarks can contribute to the performance of our model. The proposed augmentation technique improves precision significantly on small subsets (e.g., 100 and 300) and this gain reduces as the dataset size grows (see plots in Fig. 2).
Hyperparameters. In the augmentation Algorithm (2), parameter k = 15 and N aug = 7000, obtained by the grid search. As shown in Fig. 2 (top right), the method is robust to the choice of parameter k and its optimal value does not significantly depend on the dataset size. We have confirmed this by splitting the landmarks data of 300-W in half. The first half was augmented via the manifold-barycenteric oversampling and then compared to proper sampling of the other half. The optimal parameters are drawn from the minimum of the OT distance (W 2 ) between the dataset parts.
The parameters in the adversarial model were: image loss coefficient c I = 150, measure loss coefficient c L = 1.5, style reconstruction coefficient c 6 = 10 (used in loss L I g ), coefficients of L I rec are c 1 = 0.8, c 2 = 0.1, c 3 = 1, coefficients of L L rec are c 4 = 10 4 , c 5 = 100. The variance of the heatmaps σ = 4. The learning rates are: 2 · 10 −4 (image generator), 4 · 10 −4 (image and landmarks discriminators), 2 · 10 −5 (landmarks predictor), and 10 −5 (style encoder). The batch size is 8 and accumulation of weights is used to make training more stable (moving average updates weights every 100 iterations). The complete training of the model requires 100-150k iterations, taking ∼ 80 hours on one NVidia Tesla V100 GPU. The augmentation typically takes several minutes, depending on the dataset.

LAMBO vs. other augmentation methods
For our comparison, we considered the popular methods from several major categories of the augmentation approaches, systematized in the review article [49]. The outcomes of these augmentations are summarized in Fig. 4.
To approximate the landmarks distribution, we use a simple FC network with three Linear + ReLU layers and internal dimension of 256. In all cases, we train the generation of landmarks coordinates X l based on the empirical distribution and compare the performance of all augmentation methods to the baseline (unpaired Cyclic GAN).
GANs [30] are natural synthesizing tools, entailing nonsaturating logistic losses (ref. Section 4: L d and L g ). GANbased augmentation does improve the baseline score (Table  2); however, as a parametric model, there is always uncertainty in selecting the model's architecture and in tuning. Moreover, training a GAN model for a particular task could be unstable and time-consuming: small models are inaccurate, and the large ones overfit the data.
Variational autoencoder (VAE [32]) uses two FC networks as encoder and decoder. The latent vectors are sampled from the Normal distribution of size 100. In the aug- mentation task, VAE shows the worst performance and does not overcome the baseline. We considered several new types of VAEs (e.g., [57]); but they could not improve the target score and were often prone to the mode collapse (see Fig. 4).
Augmentation with Normalizing Flows (NF [16]) approximates parametric distribution by an invertible FC network. We observed that it requires relatively big landmarks sets to fit the data precisely, and it was not efficient in our weaklysupervised approbation (e.g., subsets of 100 landmarks).
A popular geometric transform (geometric [8]) augments data by random scaling up to 10% and by rotation up to π/12 with probability 0.3. Evidently, this procedure does not cover the manifold structure of any of the datasets and generates only some local perturbations to the landmarks.
Gaussian mixture model (GMM [43]) is a simple but efficient classical method of distribution approximation. We have set the number of components to 20 (count of Normal distributions). It improves the baseline but the performance decreases when the landmarks have high dimensions and are unordered (like the cardiac contours). Table 2 gauges Cyclic GAN's performance on all datasets for different augmentation types, but it also studies ablation of the effect of the augmentation by providing extra annotations to the model. We made subsets of original data to contain all the raw images but only a portion of available annotations (e.g., for 300-W, 100, 300, 600, and 3148 landmarks). When we add barycenter augmentation, we readily generate new 7000 landmarks that, according to the visual inspection of Fig.2, clearly belong to the same manifold as the original data. Classical augmentation methods (such as rotations, linear stretches, reflections) are often "damaging" to the dataset even if they help improve the performance of a target task. We tuned its parameters on validation set so that the deformation is small but sufficient to reduce the overfit.

Discussion
To the contrary, the barycentric manifold oversampling preserves the consistency of the generated augmentation data, and as we observed, also outperforms these classical and modern deep learning based augmentation methods in the target tasks. Additional experiments with a combination of classical augmentation and LAMBO yielded similar outcomes, from which one may conclude that LAMBO in- herently includes the small geometric augmentations. This is a valid hypothesis given that OT inherently encodes the ambient geometry of the manifold. Another notable advantage of LAMBO is its nonparametric nature. In practice, it is more preferable over the parametric models considered in Table 2, because it is asymptotically accurate and is sample-efficient in estimating the data density. In all of the experiments reported above, we have needed just a single parameter to initiate the graph representation in the barycenter domain and to, consequently, enlarge the annotation set (namely, the desired number of adjacent neighbors for computing the local barycenters).
We compare the unpaired training to the supervised case, but do not show the results of the semi-supervised scenario explicitly. The amount of augmented data is much larger than that of the paired data. Given that most of the generated landmarks are not paired, it is somewhat irrational to process the paired items separately. Actually, while trying to find the correspondences in the unpaired data, Cyclic GAN model learns the task compatibly to the analogous supervised variants (ref. Table 2), hence, also covering the semi-supervised setting [5] by the unpaired setup.
Also instrumental is LAMBO's computation time (compared to that of the Cyclic GAN training). Computing the matrix of Wasserstein pairwise distances between 3148 landmarks in the 300-W dataset took about 10 minutes, and given that our method typically needs 100−1000 landmarks, the augmentation can occur relatively fast as is. Nevertheless, it could be further accelerated for the large datasets by fast kNN search [15]. Overall, the classical augmentation methods (geometric and GMM) are somewhat faster and the deep learning based models (like GAN, NF, VAE) usually take more time than LAMBO. Recent advances in OT, such as approximate nearest neighbor search under the Wasserstein distance in linear time [3] and the computation of Wasserstein barycenters in polynomial time, either exact or approximate [2], will be the subjects of future work.

Conclusion
The problem of learning from unlabeled data is of utmost importance in computer vision, because the unlabeled data largely prevail over the annotated ones. Even if the annotations are available, they are oftentimes unpaired with the image data, coming from different sources and being rarely sufficient. Our data augmentation solution to this problems relies on a new knowledge representation approach, where we considered the available annotations in the space of barycentric manifold, computed with the help of Wasserstein distance and represented as a graph. Just like the classical augmentation methods, this new augmentation method inflates the size of the set of available labels; however, it does it entirely within the original data annotation manifold by its conceptual design. The new data are obtained as a nonlinear interpolation in the manifold domain, effectively preserving the natural look of the new images and eliminating the need to control the distortions.  , and then, we concatenate them with the outputs from the progression of the modulated convolution blocks (ModulatedConvBlock) in network G I . The term "modulated convolution" means that its weights are obtained from the style. The noise z passes through a series of linear layers to place the style S o (z) on the style manifold. These styles are further used in ModulatedConvBlock to obtain the corresponding modulated weights. So, each ModulatedConvBlock receives a pre-processed heatmap, multiplies it by the modulated weights, computes some non-linear transform, and passes it to the next block. In other words, it is a core block for combining the style and the content (landmarks) information. Finally, the very last block returns the fake image.
Landmark Generator (Encoder). Our landmarks encoder consists of two principal parts. Due to recent success of the stacked hourglass model [39], we have integrated it in our landmarks encoder, which produces an output with separate channels corresponding to the key-points. Afterwards, the application of the downsampling convolutions yields the coordinates of the landmarks (X l ). Figure S5 summarizes the Generator's architecture.
Discriminators. In the discriminators, we convolve the landmarks heatmaps and the images separately and then pass them through a sequence of standard ResNet layers.
Style Encoder. Style encoder is a PSP-type network [44] that maps the image to the style matrices of size 14 × 512. The rows of such matrices store ModulatedConvBlock layers with different corresponding resolutions (from 4 × 4 to 256 × 256).

Wasserstein space
Given a pair (X , d) of a metric space X and a ground (distance) metric d, the p-th Wasserstein distance is defined as where π ∈ Π(µ, ν) is the set of all joint probability measures on X × X whose marginals are µ and ν, i.e., for all measureable sets A, B ⊂ X , π(A × X ) = µ(A), π(X × B) = ν(B) [58,Def. 1.1]. If restricted to the subspace of the probability measures with a finite p-th moment, where x 0 ∈ X is arbitrary, the p-Wasserstein distance is indeed a valid metric [58,Def. 6.1]. Then, the pair (P p (X ), W p ) formally constitutes the Wasserstein space W p (X ), suitable for performing the data augmentation that we are after.
Any given point cloud could then be seen as the probability distribution living in a space of probability measures over P(R 2 ). The landmark data point is a point cloud, i.e., not a single, but a set of multiple points on an Euclidean plane. A point cloud in R 2 could be modeled as a discrete probability measure µ with the weights a ∈ R n ≥0 and the locations x ∈ R n×2 : µ = n i=1 a i δ xi , where δ xi is the Dirac measure at the position x i ∈ R 2 and n i=1 a i = 1. The distinctive feature of the Wasserstein distance is that it takes the geometry of the probability measures into account by using the ground distance of the base space. Because the existence of barycenters 2 is proven for the Wasserstein spaces [1], and given that the computational tools for computing them exist [12], we propose to characterize the space of landmarks with the Wasserstein distance. Such a knowledge representation allows us to compute weighted means of the existing annotations, ultimately leading to an efficient arrangement for the landmarks augmentation.
10. Barycentric sampling SMOTE [10] is an established geometric approach to random oversampling and data augmentation, followed by many extensions [7,22], with the original motivation to balance classes in the imbalanced learning problem [34]. Its main idea is to introduce synthetic data points with each new point being the convex combination of an existing data point and one of its k-nearest neighbors.
We generalize the SMOTE algorithm to introduce synthetic points as convex combinations of arbitrary number of points, instead of just pairs, thus sampling from the simplices of the clique complex of a neighborhood graph. That is, a position of a new point is defined by the barycentric coordinates with respect to a simplex spanned by an arbitrary number of data points being sufficiently close. Each simplex is seen as a dense local region, thus interpolating within a simplex is assumed to result in the probable data points close to the data submanifold.
Let Λ k be the set of all vectors of k + 1 elements, such that λ i ≥ 0 and k i=0 λ k = 1. Given a set of point clouds {µ i } k i=0 ∈ W 2 (R 2 ) a Wasserstein k-simplex σ({µ 0 , . . . , µ k }) is the set of all convex combinations of its vertices {µ i } k i=0 : where a given vector λ of barycentric coordinates uniquely determines any point cloudμ(λ) ∈ σ ⊂ W 2 (R 2 ) w.r.t. the simplex σ spanned by {µ i } k i=0 . Simplices could be "glued" together to comprise a simplicial complex, formally a collection of simplices. In particular, given a set of points X in a metric space X and a threshold parameter ε ≥ 0, the Vietoris-Rips (VR) complex [67] on X is defined as The expression above is a collection of k-simplices σ, such that d(x, x ) ≤ ε for all pairs of points (x, x ) ∈ σ for some distance function d(·, ·) on X . The VR complex is equivalent to the clique complex of the ε-neighborhood graph, that is a k-simplex σ corresponds a clique in the εneighborhood graph. In practice, we consider the clique complex of the k-NN neighborhood graph.
In our work, we chose the density-adaptive continuous kNN (ckNN) neighborhood graph [4] G δ,k = (X, E), where an edge (x, x ) is present in the graph, if and only if, where x k and x k are the k-nearest neighbors of points x and x respectively. Intuitively, the parameter k ∈ N \ {0} controls the radius of a sphere around each specific point, and δ ∈ R ≥0 affects their radii, all at once. By that, a sphere around a point adapts to the local density of the data and the ckNN graph has an edge if a pair of spheres intersects.
Examples of the manyfold-barycentric data points on such graph are illustrated in Figure S6 .

Distance between simplex and its vertices
We sample new landmarks from the convex hulls with adjacent basis elements. So, a natural question is: what is the distance between the new data and the basis elements? How does the performance depend on the convex hull structure?
The following theorem addresses this question and provides the upper bound approximation. Theorem 1. Consider a convex polyhedron with k vertices {x 1 , . . . , x k } in some metric space with a linear distance function ρ(x, y). Let x be an internal point of the polyhedron. For our purposes, this point could be taken as a barycenter. Then, the set of spheres {B 1 (x 1 , r 1 ), . . . , B k (x k , r k )}, ∀i : r i = ρ(x i , x ), covers the entire polyhedron. And, moreover, let µ 1 be a uniform measure in the polyhedron and µ 2 be a discrete uniform measure on {x 1 , . . . , x k }, then Proof. The first statement can be proved by induction by varying the number of vertices k in the polyhedron. The base of induction for k = 1, 2 is evident. Assuming that the statement is true for k, we will prove it for k + 1. Let's delete one vertex from the polyhedron. Consider two cases. In the first one, the internal point is inside of the polyhedron of k points. Then, by the induction assumption, it is covered by the circles. In the second case, the internal point is outside of the polyhedron of k points. Then, one projects the internal point into the nearest point of the k-polyhedron. With this projection, all distances to the projection point are reduced. For any two points and a plane that separates them, we may project one of the two points and reduce the distance between them. This statement holds for all points of the k-polyhedron and all the projections of the internal point 3 . We, thus, obtained that all k-polyhedrons are covered. Now, consider an arbitrary point of the k + 1-polyhedron and draw a line from the internal point to this point. This line intersects some k-polyhedron and belongs to some circle. We found two points (the internal one and the point of intersection) that belong to one circle and our arbitrary point, located on the interval between them. Hence, the first statement is proven.
We have covered the polyhedron by the circles. Consider an intersection of one circle and the polyhedron. We map all points of this sector to the center of the circle (the polyhedron's vertex). The transport distance of such mapping is r 2 /2 because the angle is independent of the radius. A uniform measure in this sector covers the uniform measure of the corresponding part of the polyhedron. Consider two uniform distributions, when one distribution is "inside" the other. The Wasserstein distance from the outer measure to any common point of the two measures is greater than the distance from the inner measure. Finally, the factor 1/k comes from the uniformity of the measures on the vertices, yielding the final formula. Figure S6. In each row, the simplex landmarks are shown in red and their computed barycenters are plotted in blue. Each simplex includes 10 faces. All vertices of a given simplex are ordered by the Wasserstein distance (d) with respect to the first one. Such locally sampled barycenters comprise the proposed LAMBO augmentation set, which guarantees to keep the new data within the original data manifold.