A Survey of Vectorization Methods in Topological Data Analysis

Attempts to incorporate topological information in supervised learning tasks have resulted in the creation of several techniques for vectorizing persistent homology barcodes. In this paper, we study thirteen such methods. Besides describing an organizational framework for these methods, we comprehensively benchmark them against three well-known classification tasks. Surprisingly, we discover that the best-performing method is a simple vectorization, which consists only of a few elementary summary statistics. Finally, we provide a convenient web application which has been designed to facilitate exploration and experimentation with various vectorization methods.


Introduction
Propelled by deep theoretical foundations and a host of computational breakthroughs, topological data analysis emerged roughly three decades ago as a promising method for extracting insights from unstructured data [32,14,42,44].The principal instrument of the enterprise is persistent homology; this consists of three basic steps, each relying on a different branch of mathematics.
(1) Metric geometry: construct an increasing family {X t } of cell complexes around the input dataset X, where the indexing t is a scale parameter in R ≥0 .(2) Algebraic topology: compute the d-th homology vector spaces H d (X t ) for scales t in R ≥0 and dimensions d in Z ≥0 .(3) Representation theory: decompose each family of vector spaces {H d (X t ) | t ≥ 0} into irreducible summands, thus producing a barcode.
The resulting barcodes are finite multisets of real intervals [p, q] ⊂ R, which admit concrete geometric interpretations in low dimensions -see Figure 1.The ultimate goal is to infer the coarse geometry of X across various scales by examining the longer intervals in its barcodes.Crucially, once the method for constructing {X } from X has been fixed, the entire persistent homology pipeline is unsupervised: one requires neither labelled data nor hyperparameter tuning to produce barcodes from X.
At the other end of the data analysis spectrum lies supervised machine learning using contemporary neural networks, which are replete with billions of tunable parameters and gargantuan training datasets [3].The practical aspects of deep neural networks appear to be light years ahead of the underlying theory.It nevertheless remains the case that machine learning has driven astonishing progress in the systematic automation of several important classification tasks.One direct consequence of these success stories is the irresistible urge to combine topological methods with machine learning.The most common avenue for doing so is to turn barcodes into vectors (lying in a convenient Euclidean space) which then become input for suitably-trained neural networks.
The good news, at least from an engineering perspective, is that barcodes are inherently combinatorial objects, and as such, they are remarkably easy to vectorize.Several dozen vectorization methods have been proposed across the last decade, and new ones continue to appear with alarming frequency and increasing complexity -the reader will encounter thirteen of them here.The bad news, on the other hand, comes in the form of three serious challenges which must be confronted by those who build or use such vectorizations: (1) Given the large number of options, even established practitioners are not aware of all the vectorization techniques; similarly, knowledge of which vectorizations are suitable for which types of data is difficult -if not impossible -to glean from the published literature.(2) There is a natural metric between barcodes called the bottleneck distance; when it is endowed with this metric, the space of barcodes becomes infinite-dimensional and highly nonlinear.As such, it does not admit any faithful embeddings into finite-dimensional vector spaces.(3) Even the stable vectorizations, which preserve distances by mapping barcodes into infinite-dimensional vector spaces, may suffer from a lack of discriminative power in practice: by design, they are poor at distinguishing between datasets whose coarse structures are similar and whose differences reside in finer scales.
In This Paper.Here we seek to comprehensively describe, catalogue and benchmark vectorization methods for persistent homology barcodes.The first contribution of this paper is the following taxonomy of the known methods, which we hope will serve as a convenient organizational framework for beginners and experts alike -(1) Statistical vectorizations: these summaries consist of basic statistical quantities; (2) Algebraic vectorizations: these are generated from polynomials; (3) Curve vectorizations: these come from maps R → H, where H is a vector space; (4) Functional vectorizations: these are maps of the form X → H for X = R; (5) Ensemble vectorizations: these are generated from collections of training barcodes.
There are unavoidable overlaps between these five categories.When such an overlap occurs, we have placed the given vectorization technique in the earliest relevant category among those in the list above; thus, an algebraic vectorization given by polynomial functions of basic statistical quantities will be placed in category (1) rather than category (2).The reader might claim, quite reasonably, that category (3) should be subsumed into category (4).However, the sheer number of curve-based vectorizations compelled us to set them apart.
The second contribution of this paper is a comprehensive benchmarking of thirteen vectorization techniques across these five categories on three well-known image classification datasets.These datasets were selected to simultaneously (a) provide an increasing level of difficulty for topological methods, and (b) to be instantly recognizable for the broader machine learning community.These are: the Outex texture database [43], the SHREC14 shape retrieval dataset [47], and the Fashion-MNIST database [59].Surprisingly, the bestperforming vectorization in all three cases is a rather naïve one obtained by collecting basic statistical quantities associated to (the multiset of) intervals in a given barcode.
Our third contribution is a companion web application which computes and visualizes all thirteen vectorization techniques which have been investigated in this paper.In addition to running online1 , this web app can also be downloaded 2 and run locally on more challenging datasets.
Not In This Paper.Vectorization methods form but a small part of the ever expanding interface between topological data analysis and machine learning.As such, there are several related techniques which are not benchmarked here.The precise inclusion criteria for our study in this paper are as follows.
(1) We restrict our attention to those methods which produce genuine vectors from barcodes.In particular, kernel methods [50,17] are beyond the scope of this paper.(2) We only consider those vectorizations that are either straightforward for us to implement, or have an easily accessible and trusted implementation.For instance, path signature based vectorizations [21,33] are excluded.(3) We do not compare machine learning architectures designed for the explicit purpose of inferring (persistent) homology [16,37,40].
(4) We do not touch upon various attempts to design or study neural networks using tools from topological data analysis [41,15].(5) Finally, even among methods which satisfy the first four criteria, we have discarded techniques which regularly obtained a classification accuracy below fifty percent.
Similar Efforts.The authors of [49] have summarised -but not compared -several vectorization and kernel methods for barcodes.Another summary (sans comparison) may be found in [53], with emphasis on metric aspects of the chosen vectorizations.The work of [23] describes a common overarching framework for what we have called curve vectorizations here.More recently, [7] and [24] have described and compared five and four vectorization methods respectively.
Outline.Notation and preliminaries involving barcodes are established in Section 1.In Sections 2 and 3 we introduce the thirteen vectorizations (suitably organised into our taxonomy) and the three datasets.Section 4 contains the results of our experiments whose finer details have been relegated to Appendices A and B. We provide a description of the web app in Section 5 and some brief concluding remarks in Section 6.

Persistence Barcodes from Data
At its core, persistent homology studies sequences of finite-dimensional vector spaces Such sequences (V, a) are called persistence modules.Among the simplest examples are interval modules -for each pair of integers p ≤ q with [p, q] ⊂ [0, n], the corresponding interval module (I [p,q] , c [p,q] ) has dim similarly, the map c [p,q] i is the identity whenever p + 1 ≤ i ≤ q and zero otherwise.
1.1.Structure and Stability.Every persistence module decomposes into a direct sum of interval modules.In particular, we have the following structure theorem [61,18].Theorem 1.1.For every persistence module (V, a), there exists a unique set Bar(V, a) of subintervals of [0, n] along with a unique function Bar(V, a) → Z >0 denoted [p, q] → µ p,q for which we have an isomorphism (V, a) [p,q]∈Bar(V,a) I [p,q] , c [p,q] µ p,q .
Thus, the algebraic object (V, a) may be fully recovered (up to isomorphism) from purely combinatorial data consisting of the set of intervals Bar(V, a) and the multiplicity function µ.Alternately, one may view Bar(V, a) itself as a multiset with µ p,q copies of each interval [p, q].This multiset is called the barcode of (V, a).It is often useful in applications to let the vector spaces V i be indexed by real numbers rather than integers.With this modification in place, Bar(V) becomes a collection of real intervals [p, q] ⊂ R.
The most important property of persistence modules, beyond the structure theorem, is their stability [18].There is a natural metric on the set of persistence modules called the interleaving distance and a metric on the set of barcodes called the bottleneck distance Theorem 1.2.The assignment (V, a) → Bar(V, a) is an isometry from the space of persistence modules (with interleaving distance) to the space of barcodes (with bottleneck distance).
The advantage of this theorem is that barcodes remain robust to (certain types of) perturbations of the original dataset, thus conferring upon the topological data analysis pipeline a degree of noise-tolerance.The significant difficulty from a statistical perspective, however, is that the metric space of persistence barcodes with bottleneck distance is nonlineareven averages can not be defined for arbitrary collections of barcodes [56,26,11].1.2.Barcodes from Data.Persistence modules arise naturally from a wide class of datasets.The first step in topological data analysis involves imposing the structure of a filtered cell complex -either simplicial [4,Chapter 8] or cubical [38] -from the data [32,14,42].The two most prominent examples of filtered cell complex structures arising from data are as follows.
(1) Given a finite point cloud X ⊂ R n , one constructs a family of increasing simplicial complexes {S | ≥ 0} defined as follows.A collection {x 0 , . . . ,x k } forms a k-simplex in S if and only if the (Euclidean) distance between x i and x j is no larger than for all i, j in {0, . . . ,k}.Since there are only finitely many values at which new simplices are introduced, the filtration is indexed by a subset of the natural numbers.The collection S is called the Vietoris-Rips filtration of X.These filtrations can be defined for any metric space in a similar fashion.(2) Consider a grayscale image I, given in terms of m × n pixels with intensity values in the set {0, 1, . . . ,255}.This naturally forms a two-dimensional cubical com- plex, which can be endowed with the upper-star filtration by intensity values.In particular, each elementary cube of dimension < 2 appears at the smallest intensity encountered among the 2-dimensional cubes in its immediate neighbourhood.
Higher-dimensional cubical filtrations may be similarly generated from higherdimensional pixel grids.
Once the given dataset has been suitably modeled by a filtered cell complex, persistence modules are obtained by computing homology groups with coeffiecients in a field.The reader who is interested in the definition and computation of homology is urged to either consult standard algebraic topology references such as [35,Ch 2] or see the more recent [44,30,42].
A substantial difficulty in topological data analysis is that although persistent homology barcodes can be readily associated with a large class of datasets, the space of all such barcodes is notoriously unpleasant to encounter from a statistical perspective.Fortunately, barcodes are combinatorial objects which can be mapped to Hilbert spaces in a plethora of reasonable ways.Indeed, across the last decade, such vectorization methods have been proposed by various authors, and our main purpose in this work is to benchmark many of these methods against standard classification tasks.

Vectorization Methods for Barcodes
Throughout this section, we assume knowledge of the barcode B := Bar(V, a) of an Rindexed persistence module along with its multiplicity function µ : B → Z >0 .We note that for each interval [p, q] in B the numbers p and q are called its birth and death respectively, and the length q − p is called its lifespan.(1) the mean, the standard deviation, the median, the interquartile range, the full range, the 10 th , 25 th , 75 th and 90 th percentiles of the births p, the deaths q, the midpoints p+q 2 and the lifespans q − p for all intervals [p, q] in B counted with multiplicity; (2) the total number of bars (again counted with multiplicity), and (3) the entropy of µ, defined as the real number where L µ is the weighted sum The entropy from Definition 2.1(3) was introduced in [22,52].Our second statistical vectorization is from [6], where entropy has been upgraded to a real-valued piecewise constant function rather than a single number.Definition 2.2.The entropy summary function of µ : B → Z >0 is the map S µ : R → R given by Here 1 • is the indicator function -it equals 1 when the conditional • is true and it equals 0 otherwise.The number L µ appearing in the expression above is defined in (1).
The entropy summary function has also been called the life entropy curve, e.g., in [23].

Algebraic Vectorizations.
The vectorizations in this category are generated using polynomial maps constructed from the barcode µ : B → Z >0 .
The first example considered here is from [2].It becomes convenient, for the purpose of defining it, to arbitrarily order the intervals in B as {[p i , q i ] | 1 ≤ i ≤ n} with the understanding that each [p, q] occurs µ p,q times in this ordered list.Definition 2.3.The ring of algebraic functions on µ : B → Z >0 consists of all those R-polynomials f in variables {x 1 , y 1 , . . . ,x n , y n } for which the following property holds: there exist polynomials (Here ∂ f /∂x i indicates the partial derivative of f with respect to x i , and so forth).
The desired vectorization is obtained by selecting finitely many algebraic functions from this ring and evaluating them at x i = p i and y i = q i for all i.The feature maps generated by making such choices are sometimes called Adcock-Carlsson coordinates -see for instance [46].Letting q max be the maximum death-value encountered among the intervals in B, four of the most widely-used algebraic functions are: Small changes in the barcode (in terms of bottleneck distance) are liable to create large fluctuations in the associated algebraic functions.The methods of tropical geometry were used in [39] to address the bottleneck instability of algebraic functions.In this setting, the standard polynomial operations (+, ×) are systematically replaced by (max, +).To define the resulting vectorization, we once again use an ordering {[p i , Definition 2.4.A tropical coordinate function for µ : B → Z >0 is a function F of variables {x 1 , y 1 , . . . ,x n , y n } which is both tropical and symmetric as described below.
(1) Tropical: there is an expression for F which uses only the operations max, min, + and − on the variables {x i } and {y i }.
Let λ i be the lifespan q i − p i of the i-th interval in B. To generate feature maps from the tropical coordinate functions described above, one simply evaluates them at x i = λ i and y i equal to either max(rλ i , p i ) or min(rλ i , p i ) for a positive integer parameter r.Examples of such tropical coordinate features include: along with the somewhat more complicated These seven tropical coordinates were used in [39] for performing classification on the MNIST database, with r = 28.
The third and final algebraic vectorization considered here is generated by extracting complex polynomials from barcodes [31,27].In what follows, the symbol i should be interpreted as √ −1 (and not as an index for the intervals in B).Consider the three continuous maps R, S, T : R 2 → C defined as follows: where α is the norm Definition 2.5.Given a barcode µ : B → Z >0 , let X : R 2 → C be any one of the three functions R, S, T defined above.The complex polynomial vectorization of µ of type X is the sequence of coefficients of the complex polynomial in one variable z given by In practice, it is customary to either take only the first few highest degree coefficients of C X (z) or to multiply it by a suitable power of z.This is done to guarantee that the feature vectors assigned to a collection of barcodes all have the same dimension.
Other Algebraic Vectorizations: In the subsequent section, we describe how to extract vectorizations by using barcode data to build curves which take values in a vector space.Once such a curve has been extracted, one can compute its path signature via iterated integrals [20].The path signature resides in the tensor algebra of the target vector space; elements of the tensor algebra are equivalent to coefficients of non-commuting polynomials, and hence constitute algebraic vectorizations of barcodes -see [21,33] for examples of this approach.

Curve Vectorizations.
There are several interesting ways of turning barcodes into one or more curves, which for our purposes here mean (piecewise) continuous maps from R to a convenient vector space.Feature vectors can then be constructed by sampling the given curve at finite subsets of R. Perhaps the simplest and most widely used curve-based vectorization is the following.Definition 2.6.The Betti curve of µ : B → Z >0 is the curve β µ : R → R given by Here 1 • is the indicator function as described in Definition 2.2, so this function counts the number of intervals (with multiplicity) in B which contain t.Very similar in spirit (and formula) to the Betti curve is the following vectorization from [23].
Definition 2.7.The lifespan curve of µ : B → Z >0 is the map L µ : R → R given by It is not difficult to create very different-looking Betti and lifespan curves from two barcodes which have arbitrarily small bottleneck distance -we can always add lots of very small intervals to a given barcode without changing its bottleneck distance by a significant amount.One way to rectify the bottleneck instability of Betti and lifespan curves is to test the containment not only of t in each interval [p, q] ∈ B, but rather of the largest subinterval of the form [t − s, t + s].This modification leads to one of the oldest and best-known stable curve vectorizations [10,12], as defined below.
Definition 2.8.The persistence landscape of the barcode µ : By convention, the supremum over the empty set is zero.Moreover, since our barcode B is assumed to be finite, the landscape functions Λ µ i become identically zero for sufficiently large i.An alternate approach to defining persistence landscapes comes from the function For each i ∈ Z >0 , the curve Λ µ i from Definition 2.8 equals the i-th largest number in the multiset that contains µ p,q copies of ∆([p, q], t) for each interval [p, q] in B. The fourth and final curve vectorization that we consider here was introduced in [19], and it is also defined in terms of the functions ∆ from (2).Definition 2.9.Let w : B → R >0 be any function, which we will denote [p, q] → w p,q .The w-weighted persistence silhouette of µ : B → Z >0 is the map φ w µ : R → R defined as the weighted average φ w µ (t) := ∑ w p,q • µ p,q • ∆([p, q], t) ∑ w p,q • µ p,q .
Here both sums on the right are indexed over all [p, q] ∈ B, and ∆ is defined in (2).
Reasonable choices of weight functions are provided by setting w p,q = (q − p) α for a real-valued scale parameter α ≥ 0. For small α, the shorter intervals dominate the value of the silhouette curve, whereas for large α it is the longer intervals which play a more substantial role -see [19,Sec 4] for details.
Other Curve Vectorizations: See the envelope embedding from [21], the accumulated persistence function in [9], and the persistent Betti function of [57].In [29], the persistent Betti function is decomposed along the Haar basis to produce a vectorization.More recently, [23] provides a general framework for constructing several different curve vectorizations.

Functional Vectorizations.
Here we catalogue those barcode vectorizations which are given by maps from spaces other than R. The first, and perhaps most prominent member of this category is the following vectorization from [1].Its definition below makes use of two auxiliary components besides the given barcode µ : B → Z >0 .The first is a continuous, piecewise-differentiable function f : R 2 → R ≥0 satisfying f (x, 0) = 0 for all x ∈ R.And the second is a collection of smooth probability distributions Ψ := ψ p,q | [p, q] ∈ B where ψ p,q has mean (p, q − p).Definition 2.10.The persistence surface of µ : B → Z >0 with respect to f and Ψ (as described above) is the function R 2 → R given by The persistence image I µ f ,Φ of µ with respect to ( f , Φ) assigns a real number to every subset Z ⊂ R 2 ; this number is given by integrating the persistence surface over Z: In order to obtain a vector from the persistence image, one lets Z range over grid pixels in a rectangular subset of R 2 and renormalizes the resulting array of numbers, thus producing a grayscale image.Standard choices of f and Ψ = ψ p,q are: Here λ max is the largest lifespan max [p,q]∈B (q − p) encountered among the intervals in B, and σ is a user-defined parameter which forms the common standard deviation of every ψ p,q in sight.
The second and final functional vectorization which we will examine was introduced in the paper [46].Set W := (x, y) ∈ R 2 | 0 ≤ x < y , and note that points (x, y) ∈ W parameterize intervals [x, y] ⊂ R with strictly positive length that could possibly lie in a given barcode.Let C c (W) be the set of all continuous functions f : W → R with compact support 3 .The given barcode µ : A subset T of C c (W) is called a template system if for any distinct pair µ 1 : B 1 → Z >0 and µ 2 : B 2 → Z >0 of barcodes, there exists at least one f ∈ T so that Definition 2.11.Fix an integer n > 0 and let Sub n (T) be the collection of all size n subsets of a template system T as described above.The template function vectorization of µ : B → Z >0 with respect to T is the map τ : Sub n (T) → R n defined as follows.Given Two convenient choices of T, called tent functions and interpolating polynomials, have been highlighted in [46].Tent functions are indexed by points (u, v) ∈ R 2 and require an additional parameter δ > 0; they have the form By construction, each such function is supported on the square of side length 2δ around the point (u, v) in the birth-lifespan plane.The normal pipeline for selecting finitely many template functions requires covering a sufficiently large bounded subset of W with a square grid and then selecting the appropriate tent functions supported on grid cells.We direct interested readers to [46,Sections 6 and 7] for details on interpolating polynomials and for suggestions on how one might select suitable n and f ∈ Sub n (T) for a given classification task.
Other Functional Vectorizations: See the generalised persistence landscape in [8] and the crocker stacks of [58].

Ensemble Vectorizations.
Our last category contains two methods which require access to a sufficiently large collection of training barcodes µ i : B i → Z >0 in order to generate a vectorization.The first of these methods, introduced in [48], is a modification of the template system vectorization from Definition 2.11.We recall that W ⊂ R 2 is defined as {(x, y) | 0 ≤ x < y} and that every barcode B is identified with a subset P(B) ⊂ W via the map that sends intervals [p, q] of positive length to points (p, q).Definition 2.12.The adaptive template system induced by a collection of barcodes {µ i : B i → Z >0 } is obtained via the following two steps.Letting P ⊂ W be the union i P(B i ), one (1) identifies finitely many ellipses E j ⊂ W which tightly contain P, and then (2) constructs suitable functions g j supported on E j , as described in (5) below.
The desired vectorization of a new barcode µ : B → Z >0 is now obtained by using these g j , rather than tent functions, as template functions in Definition 2.11.Three different methods for finding the E j can be found in [48,Sec 3].Let v * denote the transpose of a given vector v in R 2 .Now each ellipse E with centre x = (x 1 , x 2 ) * corresponds to a symmetric 2 × 2 matrix A satisfying The second instance of an ensemble vectorization framework which we benchmark in this paper is from [51].Let µ i : B i → Z >0 be a collection of training barcodes as before, and fix a dimension parameter b ∈ Z >0 .Much like the adaptive template systems of Definition 2.12, the automatic topology-oriented learning (ATOL) vectorization is a two-step process for mapping each B i to a vector space, which in this instance is always R b .Definition 2.13.The ATOL contrast functions corresponding to the collection of barcodes {µ i : B i → Z >0 } and parameter b ∈ Z >0 are obtained as follows: (1) Treating the point clouds P i := (p, q) ∈ R 2 | [p, q] ∈ B i and q > p as discrete measures on R 2 , one estimates their average measure E.
(2) Let z := (z 1 , z 2 , . . ., z b ) be a point sample in R 2 drawn (in independent, identically distributed function) along E. Define the real numbers σ i (z) for 1 ≤ i ≤ b by where • 2 denotes the usual Euclidean norm on R 2 .
The contrast functions The reader is directed to [51, Algorithm 1] for further details.Once the contrast functions have been produced in the manner described above, the corresponding ATOL vectorization of a given barcode µ : where Other Ensemble Vectorizations: The persistence codebooks approach from [60] proposes three different types of barcode vectorizations; these are based on bag-of-word embeddings, VLAD (vector of locally aggregated descriptors), and Fisher Vectors respectively.

Datasets
The vectorization methods described in the preceding section have been benchmarked against three standard datasets; these are described below and arranged in increasing order of difficulty for topological methods.All three of them have been used in the past for comparing vectorizations (or kernels) for persistence barcodes [46,48,50,17,34,21].

Outex.
Outex is a database of images developed for the assessment of texture classification algorithms [43] -see Fig. 2, right-bottom, for some samples of textures from the 68 categories.Each texture class contains 20 images of size 128 × 128 pixels, which results in 1, 360 images in total.We designed a reduced version of the experiment by randomly selecting 10 of the total 68 classes in the dataset, which we refer to as Outex10 below.The full classification is referred to as Outex68.In both cases, a train/test split of 70/30 has been applied.
We treat each image as a cubical complex; the filtration is induced by considering the pixel intensity on the 2-dimensional cells, which is inherited by other cells via the lowerstar and upper-star filtrations.Persistent homology barcodes are computed in dimensions 0 and 1 using the GUDHI library [28].No pre-processing has been applied to the images.

SHREC14.
The Shape Retrieval of a non-rigid 3D Human Models dataset, usually abbreviated SHREC14 [47], is designed to test shape classification and retrieval algorithms.It contains real and synthetic human shapes and poses stored as 3D meshes (which are already simplicial complexes).We use the synthetic part of the dataset; this constitutes a classification task with 15 classes (5 men, 5 women and 5 children), each one with 20 different poses -see the upper-right corner of Fig. 2.
We apply the Heat Kernel Signature (HKS) to obtain filtrations [54,50].For a fixed real parameter t > 0, this filtration assigns to each mesh point x the value Here λ i and φ i are eigenvalues and corresponding eigenfunctions of (a discrete approximation to) the Laplace-Beltrami operator of the given mesh.Every simplex of dimension > 0 is assigned the largest value of HKS t encountered among its vertices.We used the pre-computed barcodes (for such filtrations across a range of t-values) which have been provided in the repository4 accompanying [7].Of the 300 samples, 70% were used for training and the other 30% for testing.

FMNIST.
The Fashion-MNIST database contains 28 × 28 grayscale images (7, 000 images per class, with 10 classes) -see the left side of Fig. 2 for some sample images.We split this dataset into 60, 000 training and 10, 000 testing images.
The filtration used for generating barcodes is as follows: we performed padding, median filter, and shallow thresholding before computing canny edges [13].Then each pixel is given a filtration value equalling its distance from the edge-pixels.Finally, all other cells inherit filtration values from the top pixels via the lower star filtration rule.

Results
Here we report the classification accuracy of the thirteen vectorization methods from Section 2 on each of the three datasets from Section 3. Implementation details and parameter choices are provided in Appendix A. The source code is available at the following GitHub repository: https://github.com/Cimagroup/vectorization-maps. 1 displays the classification accuracy for the smaller (and easier) experiment on 10 classes.As one might expect, all techniques perform rather well, with Persistence Statistics and Algebraic Functions sharing the best performance with 99.2% accuracy each, followed closely by Persistent Silhouettes with 98.3% each.

Outex. Table
Results from the full experiment with 68 classes are contained in Table 2; as one might expect, the performance of every single vectorization degrades in the passage from Ou-tex10 to Outex68.Here Persistence Statistics is the clear winner by a significant margin, earning 93.4% accuracy.Tropical Coordinates ranks second with 88.7%.Setting aside the outstanding performance of Persistence Statistics, it appears clear from these results that the algebraic vectorizations perform far better on Outex68 than the vectorizations from the other categories.We note that the authors of [23] have also used Outex to compare the performance of various curve vectorizations, with Persistence Statistics being used as a baseline.They also obtained their best results with Persistence Statistics.

SHREC14.
We used 10 different t-values t 1 < t 2 < • • • < t 10 , as in [50,46,48], for generating filtrations via the heat kernel from (6).At t 10 we found several sparse or empty barcodes, which led us to discard that classification problem.Table 3 collects the best performance for each method across the first 9 values of t; it also contains values of the optimal parameters (see Appendix A) and the optimal values of t.Persistence Statistics yielded the best classification accuracy of 94.7%, followed closely by Template Functions at 94.4%.One remarkable feature of these results is that the dataset does not appear to favour any one category of vectorizations over the other -it is possible to achieve over 88% accuracy by using a suitable statistical, algebraic, curve, functional or ensemble vectorization.In fact, only the curve-based vectorizations failed to achieve over 90% accuracy on this dataset.The variation of classification accuracy with the heat kernel parameter t is discussed in Appendix B.

FMNIST.
The results of our experiments on FMNIST are recorded in Table 4.We note that these experiments only used information contained in the 0-dimensional barcodes and that the SVM classifier was not used.The classification accuracy of all the methods is much lower than the corresponding figures for the two preceding datasets.Once more, the Persistence Statistics vectorization takes the top spot with 74.9% and Template Functions are slightly behind at 74.7%One rather surprising aspect of these results is the fact that Adaptive Template Systems performed far worse than ordinary Template Functions despite having recourse to 60, 000 training barcodes.We do not have a clear explanation for this phenomenon, particularly in light of a fairly competitive performance from ATOL (which was also exposed to the same training data).

Web Application
In order to illustrate and visualize the vectorization methods described here, we have built an interactive web application that runs on any modern browser; it is available at https://persistent-homology.streamlit.app/ The app has been built in Python using the Streamlit library together and makes use of several existing Python libraries.The sidebar contains options for selecting different types of input data and displays several options for data visualization.One sample image/pointcloud from each of the three datasets used in this paper has been pre-loaded, but the user is free to upload their own data.Specifications, formatting guidelines, and downloading instructions are available in our GitHub repository: https://github.com/dashtiali/vectorisation-appThe Persistence Statistics vectorization is purely numerical, so we show its values in a table, as in Figure 5. Algebraic vectorizations are illustrated as bar graphs.In Figure 6, for instance, one finds bars whose heights correspond to values attained by the 7 chosen tropical coordinate polynomials on the input barcodes.Curve vectorizations, such as persistence landscapes, are depicted via piecewise-linear graphs (see Figure 7).Sliders have been provided to set the resolution parameter.It is our hope that users will benefit from the ability to generate these visualizations without having to write any code of their own.In order to facilitate downstream analysis, the web app also provides the ability to download the vectors generated by each vectorization method.

Concluding Remarks
At the time of writing, it remains difficult to accurately pinpoint those attributes which might make a given vectorization method a good choice for a particular classification problem.There are no powerful theorems or immutable doctrines available to guide scientists who wish to incorporate topological information into machine learning pipelines.In the absence of such theoretical foundations, the best that one can expect are principled heuristics supported by reproducible empirical evidence.This paper is an outcome of our efforts to provide such evidence.En route, we have organized thirteen available vectorization methods into five categories in Section 2 and provided a web application which will allow others to conduct their own experiments involving these methods.
One possible conclusion that may be drawn from the results of Section 4 is that we can dispense with sophisticated vectorization techniques and only use (some variant of) Persistence Statistics.We do not necessarily suggest such a course of action.While it is certainly true that Persistence Statistics earned top honors in all of our experiments and is much faster to compute than the alternatives, there are other factors to consider.In particular, no comparative study such as ours can be truly exhaustive.There is always the chance that making different choices -for instance, using another dataset for classification, or adding some new polynomials to one of the algebraic vectorizations -could dramatically update our priors about which methods perform best.

Figure 1 .
Figure 1.An increasing family of cell complexes built around a point cloud dataset; the associated barcode in dimensions 0 (blue) and 1 (red) catalogues the connected components and cycles respectively.

2. 1 .
Statistical Vectorizations.The first and simplest category of vectorizations considered in this paper are generated from basic statistical quantities associated to the given barcode.Variants of the following vectorization have been defined and used on several occasions -see for instance [5, sec 2.3] , [23, Sec 6.2.1] and [49, Sec 4.1.1].Definition 2.1.The persistence statistics vector of µ : B → Z >0 consists of:

Figure 2 .
Figure 2. Samples from datasets used in our experiments

Figure 3 .
Figure 3.A screenshot of the web app

Figure 4 .
Figure 4. Intervals in barcodes of dimensions 0 and 1 as displayed by the web app.

Figure 5 .
Figure 5.The Persistence Statistics vectorization as shown in the web app.

Figure 6 .
Figure 6.A visualization of the Tropical Coordinates vectorization from the web app.

Figure 7 .
Figure 7. Persistence landscapes in the web app

Figure 8 .
Figure 8. Persistence images as shown in the web app

Figure 9 .
Figure 9.The web app visualization of template functions

Table 4 .
FMNIST results.All the scores have been achieved for Random Forest classifier with 100 trees.