Straightforward Working Principles Behind Modern Data Visualization Approaches

From state-of-the-art visualization algorithms, we distill six working principles which are, by hypothesis, sufficient to produce visual projections qualitatively similar to those obtained with these state-of-the-art algorithms. These working principles are presented through the geometrical reasoning of the classical Multidimensional Scaling algorithm, and their effectiveness is illustrated through a novel straightforward algorithm for data visualization. We show, using several datasets originated from various applications, that our algorithm can produce visual projections qualitatively similar to those obtained with these state-of-the-art algorithms. Besides, under the same motivation (of simplification), the problem of visualizing large datasets is tackled through a companion algorithm which is able to embed new input patterns.


I. INTRODUCTION
As simply stated by Kruskal [1], in 1964, Multidimensional Scaling (MDS) ''is the problem of representing n objects geometrically by n points, so that the inter-point distances correspond in some sense to experimental dissimilarities between objects.'' It is frequently assumed that the current MDS formulation was first proposed in 1952 by Torgerson [2], although previous works such as the one by M. W. Richardson, published in the Psychological Bulletin, in 1938, suggest that MDS principles predate Torgerson's paper.
Originally used to determine the dimensionality of the stimulus space from similarity analysis between stimuli, MDS quickly became also an important tool for data visualization, for it allows 2D and 3D projection and computational visualization of high dimensional data. More recently, the replacement of dissimilarities in MDS with geodesic distances imposed by weighted graphs, as in the Isometric Feature Mapping (ISOMAP) [3] and similar approaches, renewed the public interest in visualization tools, in a turning point when the flow of high dimensional data was growing through the Internet, in the form of images, sounds and a The associate editor coordinating the review of this manuscript and approving it for publication was Rashid Mehmood . myriad of behavioral signals easily acquired with hand-held devices such as mobile phones. In that scenario, ISOMAP adapted MDS to compare geodesic distance matrices, which allowed 2D or 3D projection of points lying in possibly curved nonlinear manifolds. In ISOMAP, as in typical MDS formulation, once all pairwise distances (e.g. Euclidean, geodesic, L-metrics, Minkowski, rank-image) are computed, the choice of a convex stress cost function [1] allows the application of the Classical MDS efficiently, through eigenvalue decomposition of a double centered distance matrix [4].
Alternatively, whenever the cost function associated to the low dimensional projection problem is not convex, the steepest descent (or gradient) method can be used instead, through computational iterations [1], [5]. That is the same approach used in the visualization algorithm Stochastic Neighbor Embedding (SNE) [6], proposed in 2003, which can be loosely regarded as another steepest descent version of the MDS. This new method was further improved, a few years later, becoming the state-of-the-art t-distributed Stochastic Neighbor Embedding (t-SNE, where the ''t'' comes from the Student's t-distribution) [7], which quickly gained broad notoriety among data analysts, in part because of its visually attractive results, with many examples of labeled datasets forming self-organized clusters corresponding to known labels. Moreover, t-SNE was made available in many versions of programming languages, such as Python and Matlab, which possibly further boosted its popularization.
More recently, new visualization algorithms such as the LargeVis, an acronym used in the paper entitled Visualizing Large-scale and High-dimensional Data [8], and the Uniform Manifold Approximation and Projection (UMAP) [9] were proposed with evident inspiration in t-SNE, as they approximately follow its recipe. Indeed, besides the graph based reasoning already used in ISOMAP, they also make use of probabilities instead of distances. UMAP further includes some elements of fuzzy models in its theoretical background.
It is noteworthy that much older works based on MDS also included sounding probabilistic background, such as [1], but possibly the principal aspect shared by SNE, t-SNE, LargeVis and UMAP is the joint effect of: (a) a probabilistic perspective where distances between points are replaced either with conditional probabilities (in SNE, t-SNE and LargeVis), or with probabilistic norms (in UMAP), (b) and an imposed constraint of (almost) uniform density of projected points.
From this perspective, in this work we claim that these effects can also be obtained with straightforward MDS, under a few changes based on six geometrically explainable working principles shared by t-SNE, LargeVis and UMAP. This claim is corroborated by experimental results obtained with an intentionally simple algorithmic implementation of the six working principles. To expose these principles and test their effectiveness, this paper is laid out as follows. In Section II the MDS is reformulated in a broad perspective that allows it to connect to state-of-the-art algorithms. In Section III a new algorithm is proposed as a straightforward implementation of six highlighted working principles found in modern algorithms. We provide in Section IV an algorithm for estimating the underlying projection function which allows a straightforward coding of new points, paving the way to a method able to deal with a large amount of data. Both algorithms are tested on various datasets and their results are illustrated in Section V, which are discussed in Section VI.

II. FROM MDS TO STATE-OF-THE-ART VISUALIZATION APPROACHES
As explained in [10], given N objects, A 1 , A 2 , . . . , A N , such as ''variables, categories, people, social groups, ideas, physical objects, or any other'' the MDS analysis of relationships between these objects starts with the computation of pairwise similarities/dissimilarities (e.g. Euclidean distances, correlation coefficients, conditional probabilities or even psychological confusion measures). These pairwise measures, either metric or non-metric [2], are organized in an N × N matrix P, where P i,j is the numeric comparison between objects A i and A j . Then a set Y of N real-valued D Y -dimensional vectors, y i (i = 1, 2, . . . , N ) is adjusted in order to numerically reduce the discrepancy between Q and P, where Q is another matrix with elements Q i,j representing similarity/dissimilarity between y i and y j (not necessarily the same similarity/dissimilarity used to obtain P i,j ).
MDS has been used for many years in a myriad of theoretical developments and practical works on data analysis, some of them in data visualization, where the elements of Y are chosen to be 2D or 3D. In such cases, MDS analysis allows a visual inspection of the relationship among all N objects, as a consequence of the correspondence between the geometrical position of points representing elements of Y and the measures in matrix P.
Reducing the discrepancy between matrices Q and P is an optimization problem where Q is adapted through changes in y i . Under certain constraints, this optimization problem becomes convex and can be efficiently solved as in classical MDS, through eigenvalue decomposition of double centered versions of squared distance matrices. But for the purpose of this work, we prefer to tackle the optimization problem through iterative adaptation of vectors y i , where, in general, a cost function J (P, Q) guides the optimization process through its negative gradient, according to (1), where α is an arbitrary adaptation step (or learning/adaptation rate), and ∇ y i J stands for the gradient vector of J with respect to y i . In this work, the iterative formulation of the MDS is referred to as gradient optimized MDS, as opposed to the classical MDS, where optimization is derived by eigenvalue analysis.
In the specific case where objects {A i } are real-valued vectors x i , in R D X , D X ∈ N + , P and Q are filled with pairwise Euclidean distances between vectors x and vectors y, respectively, and then ∇ y i J (P, Q) is given as in (3).
for Q i,j = 0. In words, in each iteration vector y i is either pushed away or attracted by the j-th vector with strength proportional to , and the modulus of y i −y j has a multiplicative effect on this strength. Alternatively, one may note that is a unit vector, therefore vector y i should move according to a resultant vector, as in (5), where W i,j = P i,j − Q i,j stands for the weight associated to the unit vector u i,j . This vectorial perspective clearly shows that all neighbors of y i have their influence on the composition of the resultant vector determined by the difference between P i,j and Q i,j . For large values of D X , this causes a well-known problem for visualization (where D Y = 2 or 3), the crowding problem, where many points tend to be projected in the same spot because most weights tend to cluster around similar values. As an illustration, in Fig. 1 we consider Euclidean distances between image patterns from the emblematic MNIST dataset [11]. Each image is coded as a 784D vector (28 × 28 monochromatic pixels), and we randomly selected 3000 images for this illustration. The first image is taken as x i , and 2999 distances are computed, corresponding to the first row of P, whose values are represented in the upper plot in Fig. 1.
Most of the 2999 Euclidean distances are clustered around the interval from 2000 to 3000, and even their minimum is greater than 1200. If corresponding entries in Q are expected to represent distances in 2D or 3D, one should expect the negative gradient to adapt the set Y towards a configuration where its elements are apart from each other with similar distances, around 2500. However, in 2D or 3D, this goal cannot be satisfied, which eventually induces the crowding of points as a geometric trade-off between tensions. Similar distribution of distances are expected whichever x i is considered instead of x 1 , therefore, if (5) is applied to adapt projections y i in 2D or 3D, weights W i,j are not sufficiently discriminating to yield good visual projections of neighboring influences.
If instead of Euclidean distances, entries in P are replaced with carefully crafted similarity measures, such as the exponential of properly scaled and squared distances, as illustrated in the lower plot in Fig. 1, the crowding of weights can be avoided. For instance, for the MNIST dataset, and again for the first row of P, the division of all N distances by 200 yields about 36 non-negligible entries. Unfortunately, because the density of points is rarely the same everywhere, the same scaling factor may not be suitable for all rows of P.
The use of Gaussian functions to replace Euclidean distances suggests a probabilistic reasoning where P i,j can be regarded as a conditional probability of picking x j as the next sample, given that the current sample is x i . This was indeed the probabilistic framework used in [6] by Hinton and Roweis to propose the SNE. Besides, although SNE is not presented as a case of MDS, the application of the following three changes on MDS helps palliating the crowding problem, while it also makes MDS more similar to SNE: C1 The adaptation of a specific scaling factor for each row of P, thus yielding a constant effective number of relevant values per row. In [6] this number is referred to as Perplexity. C2 The normalization of both P and Q. More precisely, both constraints P 1 = 1 and Q 1 = 1 are imposed, where C3 Change (C2) allows the use of the Kullback-Leibler divergence instead of Euclidean distance as an improved criterion. Indeed, takes into account the restricted matrix manifold where P and Q are to be found. 1 A further fourth change (C4) was added in 2008 [7] to SNE, when the the Student's t-distribution replaced the Gaussian distribution in the construction of Q, whereas the Gaussian remained unchanged for the construction of P. This last change yielded the t-SNE, which was shown to reduce even more the crowding effect.
The remarkable success of t-SNE was followed by the proposal of similar approaches, such as LargeVis [8] and UMAP [9]. In LargeVis, the prohibitive practical cost of dealing with N × N matrices, for large N , was tackled through the use of efficient methods for finding near neighbors, along with the random sampling of far neighbors. As for changes (C1) to (C3), they were also used in LargeVis, although the Kullback-Leibler divergence was replaced with a likelihood function with similar effect. Only C4 was slightly disregarded in LargeVis, as some other probability density functions (PDF) beside t-Student were included.
In UMAP, P and Q are not computed with Gaussian and t-Student distributions, but with a parametrized negative exponential of distances, and a double-parametrized generalization of the t-Student distribution, respectively, in a fuzzy set theoretic framework. But in spite of their specificities, most essential elements of t-SNE and LargeVis can find equivalences in UMAP, as succinctly presented in Appendix C of [9]. For instance, UMAP makes use of approximate nearest neighbor search, and stochastic gradient descent with negative sampling for optimization, as in LargeVis, and the cardinality of the fuzzy set of 1-simplices, in fuzzy jargon, plays the same role as the Perplexity parameter, in t-SNE.
In summary, t-SNE, LargeVis and UMAP share the following working principles (WP): WP1 A limited amount of near neighbors are found for each x i . This corresponds to an underlying presupposition that the neighboring points lie in a locally continuous manifold. WP2 Pairwise distances between x i and x j , i, j ∈ {1, 2, . . . , N } are computed. UMAP does not require this distance to be Euclidean but, in most experimental results from all techniques Euclidean distance is used. This suggests an underlying presupposition that the neighboring points lie in a locally (almost) linear manifold. WP3 Pairwise distances are either shrunk or expanded by a local scale factor, σ i , so that a similarity measure As illustrated in Fig. 1, for the MNIST dataset, σ 1 = 200 is a local scale factor for x 1 that retains only 36 near neighbors with similarities above 1% of the maximum, and 14 near neighbors with similarities above 5%. As expected, according to (C1), for this same scale factor the corresponding Perplexity [6] is about 25, thus in the same range. This local distance scaling yields non-symmetrical metrics, as illustrated in Fig. 2, where points within two regions with discrepant densities highlight the need for a symmetrization strategy. The detail in Fig. 2 shows that the Euclidean distance d i,j is differently scaled around point x i and x j with scale factors σ i and σ j , respectively. This induces a density equalization effect, also illustrated, where the resulting average density is controlled by the Perplexity parameter, in t-SNE. On the flip side, the symmetry requirement for a metric to be a distance is violated, as To enforce symmetry, in t-SNE pairwise similarities are set to p i,j = p i|j +p j|i 2N . The local scaling of distances yields density equalization of uniformly distributed regions of the FIGURE 2. Illustration of density-sensitive clustering and density equalization yielded by the t-SNE application to a set of 2D points with two-densities of points. The between-density boundary is projected in a between-cluster gap, whereas cluster densities are equalized. Remark: no dimension reduction in this illustration, for D X = D Y = 2.
space, whereas irregularly distributed regions, such as between-clusters gaps and between-densities boundaries cannot be properly handled to yield an equalized density, as illustrated in Fig. 2. Therefore, beyond the intended dimension reduction, when D X > D Y , two remarkable effects are observed in t-SNE, namely: density equalization and density-sensitive clustering, as highlighted in Fig. 2. Note that, in this illustration there is no dimension reduction, as D X = D Y = 2, but although the original dataset has no remarkable gap, the projected data-points have a distinguishable one, d proj , resulting from the projection of cross-densities distances, whereas most points are packed in almost uniform density clusters. These visually attractive effects can be roughly induced by the flagging of K near neighbors (KNN) for each data point (as in WP1), followed by the normalization of the volume occupied by these KNN, thus inducing local space shrinking or expansion. This raw simplification of WP3 is used in Section III. WP4 Matrix P = {p i,j } is filled with symmetrized similarities between locally scaled pairwise distances in the input dataset, X , where similarities are obtained through a given Radial Basis Function (RBF) f X : R D X → R + , and elements of matrix Q are obtained VOLUME 9, 2021 as similarities between instances of the projected low-dimensional dataset, Y, through another RBF f Y : R D Y → R + . WP5 Symmetric matrices P and Q are compared, and projected points are adjusted according to rules similar to (5). The specificity of each iteration rule depends on the choice of functions f X and f Y , and the criterion J . WP6 For better visual results, the influence of points too far from each other in the projected space can be damped. This damping effect is presented as the advantage of t-SNE over SNE, as a result of the mismatch between a RBF f X given by a Gaussian PDF, and another RBF f Y corresponding to the t-Student PDF. More specifically, Equation 5 in [7] can be rewritten with the notation used in (5) as: where plays the role of a damping factor for distances y i − y j 2 either near zero, or much greater than 1. In Section III, we test the effectiveness of these WP by implementing them as simply as possible, so that if they are indeed the main engines behind state-of-the-art approaches, similar experimental results are expected from our simplified alternative.

III. PROPOSED STRAIGHTFORWARD VISUALIZATION ALGORITHM
The Straightforward Visualization Algorithm (SVA), as presented in Algorithm 1, iteratively adjust the projection of all N vectors in X into corresponding vectors in Y, thus it looks for a projection Y = sva(X ; K , f Y ) where parameter K represents the arbitrary number of near neighbors (e.g. K = 40) and f Y (·) is an arbitrary RBF.
In step 1, N (N − 1)/2 squared Euclidean distances are computed in R D x , thus requiring D x scalar multiplications per distance. Likewise, in step 6.1, N (N −1)/2 squared Euclidean distances are computed in R D y , were D y is set to 2 or 3. Therefore, for a fixed number of iterations, the SVA is O (N 2 D y ). This is also the case for most state-of-the-art algorithms, and complexity reduction has been addressed since t-SNE was first proposed [7]. Besides, computational burden reduction was the main motivation behind Largevis [8], and although it is beyond the scope of this work, most techniques mentioned there and in references therein are also applicable to SVA.

IV. VISUALIZATION OF NEW DATA
The projection of N given high dimensional data points into 2D or 3D for visualization purposes, through the approaches considered in this work, is a dimension reduction obtained through complicated space contraction/expansion around 1. Compute all N (N − 1)/2 pairwise Euclidean distances between x i and x j , j = i. 2. Find the subset of K near neighbors (KNN) of each x i , and set P i,j = P j,i = 1 if j is in this subset. P i,j = 0 otherwise. Therefore, each row of P plays the role of a vector of flags, indicating where the KNN are. Note that P remains symmetric, thanks to the simultaneous setting of flags at P i,j and P j,i . As a consequence, each row of P may have a few more hotspots (ones) than K . The diagonal of P is kept null, since x i in not regarded as a neighbor of itself. 3. Normalize matrix P as: P ← P P 1 . Unlike t-SNE and other approaches based on probabilistic reasoning, this normalization is not mandatory, but it has a suitable consequence in terms of algorithmic convergence, as the matrix space is restricted to a unit norm matrix manifold. 4. Randomly initialize a 2D or 3D set (i.e. D Y is either 2 or 3) of N Real valued vectors, Y = {y 1 , y 2 , . . . , y N }, typically with very small values. 5. Set a learning rate, α (around N , to compensate for the matrix normalization), a damping radius, R η , and a damping factor, η. We successfully experimented with values of R η from 1.5 to 3, and a fixed η = 0.1. 6. Iterate the following steps until some stopping criterion is reached (in our experiments, we used a maximum number of 2000 iterations as stop criterion, as indicated in Sec. V ). 6.1. Find Q i,j = f Y ( y i − y j 2 ). 6.2. Set the diagonal of Q to zero and project it on the same matrix space where P is, through the following attribution: Q ← Q Q 1 . 6.3. For every pair of vectors in Y, adapt y i according to: where, as in (5), W i,j = P i,j −Q i,j is the weight associated to the unit vector u i,j = (y i − y j )/ y i − y j 2 , and either β j = 1, for ||y i − y j || 2 ≤ R η , or β j = η, otherwise. These iteration steps are similar to (5), apart from the minus sign before α, which reflects the replacement of distances with similarities measures. 7. Return Y.
The SVA has a computational burden dominated by two steps, namely: • step 1, outside the iteration loop, and • step 6.1, inside the loop, which is expected to be the most relevant in terms of execution time.
each of the N observations. This projection has interesting properties in terms of density equalization and clustering, and it can be useful to represent this mapping as a function where d min is the minimum among the M distances, and is a small positive Real number which prevents division by zero. 5. Normalize weights: w i ← w i / M j=1 w j . 6. Return the projected vector y new = M j=1 w j y j , where y j stands for the j-th near neighbor found in step 2.
g 0 : R D X → R D Y , whose approximation y = g(x) can be learned from X and Y, after SVA reaches its stopping criterion.
To obtain an approximation g as straightforward as SVA, a data-driven piecewise-linear approximation is proposed in Algorithm 2. It is important to highlight that this piecewise-linear projector assumes that, with Euclidean distance, finding M near-neighbors in Y is more trustful than in X , for elements of Y are typically represented in much lower dimension than their counterparts in X . Besides, points in Y tend to be density-equalized, as illustrated in Fig. 2. Therefore, in Algorithm 2, except for step 1, even near neighbours of elements in X are always found through Euclidean distances between corresponding (projected) elements of Y, which is an originality of this piecewise linear interpolator. All distances in Algorithm 2 are Euclidean.
The motivation for having an approximation of g 0 is two-fold: first it allows the visualization of new incoming data without any projection re-adaptation, thus g plays the role of a data compressor, or an encoder. Besides, because the current version of the SVA is not adapted to large datasets (for naive manipulation of N by N matrices P and Q may become prohibitive), Algorithm 2 can also be used to tackle large datasets, by applying SVA to a small subsample of it, and then encoding all remaining data with g.

V. EXPERIMENTAL RESULTS
Most experimental results presented here are visual evidences that, with an adequate choice of the parameters, the SVA, which is a simple implementation of the WP listed in Section II, yields results visually similar to those obtained with t-SNE (which is itself a baseline for LargeVis and UMAP, as presented in [8] and [9], respectively). We start by using two publicly available datasets, MNIST [11] comprising of of 28 × 28 grayscale 10-class handwritten digits, and Fashion-MNIST [12], a more challenging dataset than MNIST, in terms of classification, although it is also composed of 28 × 28 grayscale images split in 10 classes. In Fashion-MNIST each class corresponds to a fashion product category. Both databases have two non-overlapping subsets, one labeled ''training'', with 60,000 images, and another labeled ''test', with 10,000 images. All experiments with MNIST and Fashion-MNIST used a constant step α = N , over 2000 iteration cycles. Of course, elaborated step adaptation strategies would sensibly improve results and avoid numerical instabilities, and should be considered in practical applications of SVA. However, these additional adaptation strategies would mask the similarities between results that we want to highlight in this work.
We experimented with the negative quadratic exponential, f E2 (r) = exp(−r 2 ), successfully used in the SNE, the inverse quadratic, f T 2 (r) = 1 1+r 2 , used in t-SNE and LargeVis, and the parametrized RBF f UMAP (r) = 1 1+ar 2b , with a = 1.929 and b = 0.7915, which is used in UMAP, and can be regarded as a modified version of f T 2 (r). It is noteworthy that in SVA there is not a probabilistic reasoning behind the choice of f Y , therefore its choice is not limited to a valid PDF.
As the standard version of t-SNE (see Section 5 of [7]), the current version of SVA is not adapted to large datasets. Therefore N = 3000 images were randomly drawn from each test dataset, and the same set, under the same initialization of Y was used in all experiments in this section, to yield better visual comparison of results. Fig. 3 is to be compared to the 2D t-SNE projection shown in Fig. 4, with perplexity parameter set to 40, whereas Fig. 5 and Fig. 6 are to be VOLUME 9, 2021  compared to the 2D t-SNE projection shown in Fig. 7, also with perplexity parameter set to 40.
Regarding Algorithm 2, Fig. 8 illustrates its use, where just 3000 images sampled from the ''test'' MNIST dataset, along with their projections were used as parameters X and Y  of g. Then, all 60,000 new images from the ''training'' dataset were projected (without further adaptation of the visualization projection). The experimental results presented in this paper were chosen as visually representative of experiments done so far. Some more results, along with suggestions of implementations of the SVA and the encoder g in some usual computer languages can be found as supplemental material posted on IEEE Xplore. https:// ieee-dataport.org/documents/supplementary-material-paperstraightforward-working-principles-behind-modern-data Visual evaluation is obviously the most usual approach for comparisons between visualization algorithms, insofar as visualization experiments are typically concerned with subjective (visual) aspects of classes and clusters dispersion, hardly replaced with any objective index. One may even conjecture that visualization algorithms are popular because there is not yet an objective index capable of replacing human cognition.
Nevertheless, by considering that the goal of all visualization algorithms considered in this work (including the SVA) is to preserve as much as possible the local neighboring structure of points before and after projection, and knowing that it can be a very difficult goal for points lying in manifolds with local dimensions much higher than 2 or 3, we crafted a simple index, namely, the Near-Neighbors Coincidence Rate (NNCR), which is computed as in (9).
where V is an index parameter representing the number of near neighbors to be considered, X n stands for a subset of X whose elements are the V near neighbors of x n . Like- wise, Y n stands for a subset of Y whose elements are the V near neighbors of y n , and |X n ∩ Y n | is the cardinality of the intersection set. Thus, if most V near neighbors of each point are preserved after projection of X into Y, C is expected to yield values close to one. By contrast, near zero values of C indicate disruption of local neighboring structures.
To test this new measure, we consider four public datasets whose points represent very diverse signaling phenomena. We used the following constant parametrization of methods across all next experiments, to allow better comparisons of results: In all experiments, points were projected into R 2 , and the number of iteration for t-SNE, SVA(a) and SVA(b) was set to 1000. Fig. 9 illustrates one projection with each algorithm of the emblematic Iris dataset used in [13], with 150 4D vectors numerically representing sepal and petal measurements (length and width) for flowers from 3 species, namely: Iris setosa, Iris versicolor and Iris virginica. Thus, data points were labeled here with Setosa, Versicolor or Virginica.
For the second set of comparative experiments, we took the recently published dataset explained in [14], here referred to as Meat volatiles, where an array of 10 sensors (8 Metal Oxide gas sensors plus temperature and humidity sensors), i.e. an e-nose was used to acquire multivariate signals through time from 7 controlled mixtures of beef and pork, (always 100 g of fresh ground meat per acquisitions sessions of 120 s). In this dataset, only 60 instances of each mixture are available, but each instance corresponds to a sequence of 60 10D measurements vectors taken every 2 seconds. Thus, from this dataset we randomly drawn 3000 measurement vectors and projected them from 10D into 2D, as illustrated in Fig. 10.
A fine analysis of each dataset is beyond the scope of this paper, where the comparison between 2D projections across methods is the main concern. However, it is worth noticing that all projections in Fig. 10 seem to present the same inconsistency, namely: that similar proportions of beef and pork are not projected in near clusters. As for this matter, one should be aware that, for e-noses based on Metal Oxide sensors, robust feature extraction from raw signals is yet a relevant research subject. In any case, in spite of these apparent raw signal inconsistencies, all four projection in Fig. 10 are in agreement with each other. Fig. 11 illustrates one projection per algorithm for the dataset here referret to as Newsgroups, 2 where each of 16242 postings (texts in natural language) is encoded as a 100D binary vector, where the occurrence for 100 relevant words (e.g. cancer, baseball, car, children) is flagged with 1. Thus, each binary vector is labeled with comp.*, rec.*, sci.* or talk.*, corresponding to the group name in which it was posted. For the projections presented in Fig. 11, 3000 vectors were randomly drawn along with their corresponding labels.
Finally, a public dataset of male and female clean-speech utterances in Brazilian Portuguese [15], was used to yield 19 Mel-Frequency Cepstral Coefficients (MFCC) per short speech frames of 25 ms, taken from all utterances, from 2 Also labeled as ''20 Newsgroups,'' and publicly available at https://cs.nyu.edu/roweis/data.html.  all speakers. The first element of each MFCC vector was systematically discarded (for it does not carry relevant acoustic information), thus yielding around 200,000 18D MFCC vectors, from which only 3000 were randomly drawn and labeled according to the corresponding speaker gender. In this paper, this dataset is referred to as Speaker gender, and its projection in 2D is presented in Fig. 12.
The NNCR for all 16 projections (4 per dataset) are gathered in Table 1. As compared to the corresponding visual aspects, the NNCR seems to yield meaningful comparative values. For instance, for the datasets Iris and Meat volatile, most clusters concentrate unmixed classes, and NNCR values above 0.7 confirm that, on average, more than 70 % of the local structures in the corresponding original datasets were preserved. This further suggests that the corresponding underlying manifolds in Iris and Meat volatile datasets are more easily projected into 2D than the ones in the Newsgroup and the Speaker gender datasets, where only less than 45% of the local neighboring structures were preserved.
From the standing point proposed in this paper, what is perhaps more important than measuring the difficulty of keeping local neighboring structures, is to notice that, for each dataset, the NNCR also yields an objective index for comparing algorithm projections. Indeed, as much as the visual qualitative comparisons, this quantitative measure seems to confirm that t-SNE, UMAP and the two versions of SVA are almost equivalent in projecting high-dimensional data into 2D.

VI. CONCLUSION
State-of-the-art visualization approaches are based on elaborated probabilistic and fuzzy models. By contrast, in this work we assume that a few simple working principles would be sufficient to yield similar results, which was corroborated by experiments done with an algorithm where these principles were straightforwardly implemented. This algorithm was applied to several public datasets corresponding to various application domains.
The proposed reduction to simple principles has a first useful aspect in terms of potential boosting of new developments in visualization tools, because the simplicity of the six listed working principles allows for new contributions and improvements from researchers with a broad range of different backgrounds. For instance, we observed that the choice of an RBF, f Y , is not restrained to probability distributions, and the knowledge of the exact cost function (and its gradient) is not imperative for a visualization result qualitatively similar to that yielded with t-SNE. These simplifications allow the experimentation with a virtually unlimited set of RBF, that can be heuristically selected and easily tested for suitable (subjective) visualization effects, without the need for (potentially laborious) algebraic manipulations of gradient cost function.
Besides, the replacement of the Perplexity, in t-SNE, with a simpler parameter K representing a fixed number of near neighbors also yielded a simple piecewise linear encoder, which was used for projection of new incoming observations, after a visualization projection was adjusted. This is a useful companion algorithm for SVA, for it allows the visualization of an unlimited amount of data, whereas SVA itself is kept simple, in terms of implementation. Indeed, the usual representation of N by N matrices P and Q is a limiting aspect that can be tackled, for instance, with efficient KNN graph construction (see [8] and references therein). On the other hand, the approach to solve the same problem implemented in this work was to split the task into two parts, namely: first a small subsample of N points is projected with SVA (e.g. N = 3000), then the encoder g takes the N projected points as parameters and is ready to project any amount of new incoming data. We believe that this choice is algorithmically simpler than modifications on the SVA to cope with large datasets. Moreover it allows for applications beyond data visualization, in projected dimensions higher than 3D, where SVA and g can be jointly used to yield auto-encoding structures, as a matter for the follow-up of this work. Team. She has authored over 300 research articles and has supervised over 20 Ph.D. thesis. Her research interests include pattern recognition and machine learning applied to activity detection, surveillance-video, and biometrics.