Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values

Missing data are unavoidable in the real-world application of unsupervised machine learning, and their nonoptimal processing may decrease the quality of data-driven models. Imputation is a common remedy for missing values, but directly estimating expected distances have also emerged. Because treatment of missing values is rarely considered in clustering related tasks and distance metrics have a central role both in clustering and cluster validation, we developed a new toolbox that provides a wide range of algorithms for data preprocessing, distance estimation, clustering, and cluster validation in the presence of missing values. All these are core elements in any comprehensive cluster analysis methodology. We describe the methodological background of the implemented algorithms and present multiple illustrations of their use. The experiments include validating distance estimation methods against selected reference methods and demonstrating the performance of internal cluster validation indices. The experimental results demonstrate the general usability of the toolbox for the straightforward realization of alternate data processing pipelines. Source code, data sets, results, and example macros are available on GitHub. https://github.com/markoniem/nanclustering_toolbox


I. INTRODUCTION
In many machine learning tasks, the volume of data is limited, necessitating that all the available data values be utilized as extensively as possible.The assumption that the data is complete is often invalid in real-world applications [1].A simple strategy for avoiding the problem of missing data is to omit incomplete observations.However, this is not an efficient use of data because the important information may be lost.A more sophisticated strategy is to impute missing values as part of a data preprocessing step.Different imputation mechanisms have been developed for various data types, e.g., binary, ordinal, categorical, and string attributes [2].The nearest neighbors method is a common imputation approach for numerical values, which uses an average (with or without weights) of the k-nearest neighbors [2].
Estimating distances is an alternative way to address problems with missing values.A well-known distance estimation method is the partial distance strategy (PDS) [3], which is The associate editor coordinating the review of this manuscript and approving it for publication was Xi Peng .also known as a general similarity measure [4].This approach involves similar limitations as the nearest neighbors method so that its accuracy is highly correlated to the number of missing values in data.In [5] and in [6], the expected distance estimations were reported to be more accurate than the data imputation or the PDS for selected real-world data sets.However, the performance of these methods has not been tested in unsupervised machine learning tasks such as data clustering.Clustering can benefit from accurate distance estimation with missing values because both currently popular initialization methods like K-means++ [7] and the computation of cluster centroids are based on distances and not on observations themselves.In [5] and in [6], data values were assumed to be missing at random (MAR), where missingness may depend on the available data.MAR is a less restrictive mechanism than missing completely at random (MCAR) in which the values are missing independently of any other data values.
Many unsupervised and supervised techniques, and their combinations, have been used for data imputation.Imputation of missing sensor spatially and/or temporally dependent data using autoencoder and alternation projection onto convex sets based training was proposed in [8].Shallow neural networks (both multi-layered perceptron and radial basis function) with genetic algorithms were put forward in [9].Fuzzy clustering and support vector regression, also with a genetic algorithmbased parameter estimation, were hybridized in [10].Decision trees and their ensembles were applied in [11].More recently, deep learning methods, especially deep autoencoders, have been proposed and tested for mainly spatiotemporal data, e.g., in [12]- [18].We do not address these more complex techniques here because of laborious tuning of an extensive number of metalevel parameters (e.g., what network architecture, how many layers, what kind of layers, which loss function, what training method, how much and what kind of data needed, etc., see [13], [19]).
Cluster analysis is often considered as one of the core techniques in descriptive data mining and knowledge discovery [20], statistics [21], and pattern recognition [22].It is a stepwise process with at least nine elements to be chosen/carried out before achieving the results [23]- [25].The elements are related to data selection, data preprocessing, selection of distance measure, choice of clustering criterion, selection missing data strategy, validation of the created algorithms, selection of the number of clusters, and finally, interpretation of results.
Clustering divides data into disjoint groups (clusters) where an ideal cluster is compact and isolated [24].Partitional clustering methods use prototype points to represent clusters and, therefore, are also referred to as prototype-based clustering methods [26].The methods are aimed to minimize the variance around the prototype points based on an error (score) function, and they are also called variance minimization techniques [27].The iterative relocation procedure decreases the values of the error function until final convergence is reached [28], [29].
Cluster validation is a crucial part of cluster analysis, in which a clustering solution's quality (ideality) is being assessed.Cluster validation indices (CVIs) provide quality measures that indicate the number of clusters.The three main types of indices are relative, external, and internal [30].The relative index compares multiple clustering results obtained with different initial settings of the clustering algorithm, whereas an external index utilizes additional information or metadata that can explain the number or form of the cluster structures.The external indices can be used, e.g., for comparing different clustering methods using the metadata of the actual cluster labels.However, internal cluster validation indices are probably the most commonly used estimates because they utilize only the information obtainable from data and clustering results.Numerous different clustering methods, including internal cluster validation indices, have been developed because of the high diversity of data [24]; for example, challenging data sets may include noise, overlapped clusters, multiple dimensions, and/or different densities [31].
This paper introduces a toolbox that encapsulates many methods and algorithms to perform cluster analysis in the presence of missing data.The versatile functionality allows a toolbox user to generate many forms of experimental settings and to realize various forms of new experiments to better understand and improve unsupervised learning with missing values.
The methodological bases in Sections II-V explain background theory related to distance computation with missing values, data preprocessing, clustering, and cluster validation.Section VI gives an overview of the toolbox, including descriptions of the sample data sets and essential toolbox functions.Section VII describes experiments that are divided into three parts.In the first part, the performance of distance estimation algorithms is measured in the direct estimation of pairwise distances in data sets with missing values.The second part compares clustering methods and cluster validation indices on two-dimensional (2D) data sets with missing values.In addition, the validation results, which are based on a key point selection function [32], are given.The results are validated against the reference results given in previously published research papers.In the third part, experiments are conducted on multidimensional data sets that were created by a recently published data generator [33].Finally, the content and the toolbox performance are discussed and summarized in Sections VIII-IX.

II. COMPUTATION OF DISTANCES WITH MISSING VALUES
Let X = {x i } N i=1 , where x i ∈ R n for all i, denote the observational data set with N observations of size n.

A. AVAILABLE DATA STRATEGY
The available data strategy (ADS) (see [34]) restricts distance computations to available values via binary projection vectors, {p i } N i=1 , p i ∈ {0, 1} n , which represent the sparsity pattern of each observation: The ADS is used in K-spatialmedians clustering (see, e.g.[35]), and it generalizes easily for various distance measures.For instance, the Euclidean distance between two incomplete n-dimensional column vectors x 1 and x 2 is defined as

B. PARTIAL DISTANCE STRATEGY
The PDS computes the sum of pairwise available vector values and scales the sum by the ratio of the original dimension of the vectors and the number of available pairwise values [4].The Euclidean distance reads then as where n * is the number of pairwise known values.Similarly to the ADS, the PDS can be generalized to other distance measures, such as the City block distance (see [36]).

C. EXPECTED SQUARED EUCLIDEAN DISTANCE
The framework for estimating the expected distance between two data vectors is presented in [5].The proposed framework was designed for estimating squared Euclidean distances in the presence of missing data values and under the assumption of multivariate normal distribution.The assumption of multivariate normally distributed data is used for estimating expected values to replace the missing values in the data.
The central limit theorem states that normal distribution can be used to approximate nearly any continuous distribution with a sufficiently large sample (see, e.g., [37]).The basic elements of the framework are given in Appendix A, and a more detailed description is given in [5].
Let us define the index sets of missing M i and available A i values of observation x i as specified by p i , i.e., M i = {1 ≤ j ≤ n|(p i ) j = 0} and A i = {1 ≤ j ≤ n|(p i ) j = 1}.Following the assumption that missing values are generated from conditional multivariate normal distribution, in which data values are independent, and missing values depend on the available values under the MAR assumption on the sparsity pattern, the expectation of the squared distance between two data vectors reads as: With the complete derivation given in Appendix B, the ith observation concerning the missing values is normally distributed with the mean vector and covariance matrix Estimating µ and for incomplete data is not a simple task, especially if the number of missing values is large compared to the number of available ones.A method based on available data is a fast alternative for estimating the covariance matrix [38].However, the iterative expectation maximization (EM) algorithm with the maximum negative log-likelihood convergence criterion is more commonly used, e.g., in [5], [39], and [6].

1) EXPECTATION MAXIMIZATION
The EM is an iterative method to find the best estimates for the parameters in a statistical model [40], [41].It consists of two alternating steps: expectation and maximization.
The expectation step estimates the missing values in the data set.The maximization step optimizes the model parameters to fit the data best.The steps are repeated until the final convergence is reached.
The EM algorithm for estimating the mean vector µ and the covariance matrix of a data set with missing values under the assumption of the conditional multivariate normal distribution is given in Algorithm 1.The algorithm includes a bias matrix B with the same size as the covariance matrix .The termination criterion for Algorithm 1 is based on the negative log-likelihood function that for the multivariate normal distribution N (µ, ) reads as: ln(L(µ, ))

Algorithm 1 Expectation Maximization
where L is obtained from the Cholesky decomposition of the covariance matrix , i.e., = LL T .Convergence is reached when there is no significant change in the values of the loglikelihood function between successive iterations.

2) FINAL ALGORITHM
The steps for computing ESDs for incomplete data are given in Algorithm 2.

D. EXPECTED EUCLIDEAN DISTANCE
In [6], the work in [5] and [39] was continued by extending the ESD distance for the expected Euclidean distance (EED).
It was shown that the EED distance could be modeled with a

Algorithm 2 Expected Squared Euclidean Distances
Input: An incomplete data set {x i } N i=1 , x i ∈ R n .1. Compute the mean vector µ and covariance matrix of incomplete data set using Algorithm 1. for each x i for which M i is nonempty do 2. Compute the conditional mean (µ i ) M i using formula (5) and the conditional covariance matrix M i M i using formula (6), respectively.3. Impute missing values of x i by values from (µ i ) M i to obtain x i .4. Impute conditional variance terms of σ 2 i from the diagonal of . for each pair of x i and x j in {x i } N i=1 and in {x j } N j=i+1 do 5. Compute the expected distance by utilizing the formula (4).Output: Pairwise squared Euclidean distances d ij of data vectors.
Nakagami distribution if the distances are assumed to follow the Gamma distribution.The expected Nakagami distributed values can then be obtained as follows: where m and are the shape and spread parameters of the Nakagami distribution, respectively, and is the Gamma function.
Under the independence assumption (as in [5], [39]), the variance can be expressed as where the expected values are obtainable using non-central moments.Table 1 presents moments of the normal distribution that can be used directly in the case of multivariate Gaussian distribution.However, weighted moments are needed if the data are assumed to follow Gaussian mixture distribution (see [6] for more details).The computation of the EED distances is based on the same framework as in Algorithm 2. However, additional steps are required which are given in Algorithm 3.

Algorithm 3 Expected Euclidean Distances
Input: An incomplete data set {x i } N i=1 , x i ∈ R n .1. Utilize Algorithm 2 to obtain the spread parameter for each pair of data vectors.for each pair of x i and x j in {x i } N i=1 and in {x j } N j=i+1 do 2. Compute Var(z) by using formula ( 9) and non-central moments

III. DATA PREPROCESSING A. FEATURE SCALING
Feature scaling is a typical preprocessing step in data analysis.Various data types are often measured in different units, which may lead to data types with large scales dominating the other data types in data-driven models.Various feature scaling approaches have been proposed, but the most commonly used approaches are the z-score and min-max normalization.
The z-score method equalizes the data type weights by transforming each one to a zero mean and unit variance.It is obtained by a linear transformation, subtracting the mean and by dividing the standard deviation: where µ and σ are the sample mean and standard deviation of the available values in the data set, respectively, and x is the scaled value.
Min-max normalization scales data to the selected range.The range may depend on the performed task, but [−1, 1] and [0, 1] are probably the most common choices.The min-max formula for an arbitrary range [a, b] can be written as follows: where min and max are computed for the available values.

B. K-NEAREST NEIGHBORS IMPUTATION
The k-nearest neighbors (kNN) method is a well-known and popular approach for imputing numerical values [2].This method can be implemented by finding the closest complete observation for an incomplete observation and imputing the missing values or taking an average of k closest observations, of which some can be partially incomplete.If the missing values of the data vector are not available in the k closest observations, the k should be increased until imputation succeeds.
In the literature, there exist many variants of kNN imputation, e.g., complete-case kNNI (CCkNNI), where a data vector with missing values is imputed by using the average value of a set of k nearest complete observations, or incompletecase kNNI (ICkNNI), where data vectors are selected from the case library in which the eligible nearest neighbors share the same complete values as x i and a missing value is available.In [42], it was suggested that up to k = 5 neighbors should be considered.If there are not enough neighbors, the missing value is imputed by the sample mean of all the available values for that data type.Even though nearest neighbors imputation is a straightforward approach for dealing with missing values, it can be inefficient when the number of missing values is relatively high [5].

C. LOW-RANK MATRIX COMPLETION
A low-rank solution for matrix completion is a common technique for data imputation.The low-rank matrix has a decreased number of degrees of freedom and, therefore, it makes the estimation problem of missing values practical to solve [43].The rank minimization problem can be addressed by using convex relaxation techniques utilizing the nuclearnorm [43], [44], which yields to the minimization of the following optimization problem: min where X is the completed data matrix which will be estimated, the data vectors x and x are flattened versions of the data matrices X and X, respectively, and • denotes the dot product.Moreover, the vector p is the flattened version of the projection matrix defined in (1), λ is the regularization parameter, and ||•|| * denotes the nuclear norm.The optimization problem in ( 12) can be solved iteratively by using a softthresholding technique to obtain the updated data vector x.The initial guess of x is given by the zero vector.Then, in the kth iteration, x is updated as follows: After that, xk is reshaped to a matrix form X k , the singular value decomposition is applied to the reshaped matrix, the singular values are soft-thresholded to obtain the updated ˆ k .The Xk = U ˆ k V T is flattened to obtain the final xk in the kth iteration.The λ is reduced by a cooling algorithm such that λ 1 > λ 2 > . . .> λ ∞ .The final result is obtained when there is a sufficiently small relative change in the target function ||p • x − p • x|| 2 or when λ reaches the predefined tolerance.

D. TRANSFORMATION INTO SPHERICAL FORM
The prototype-based K-means and K-spatialmedians clustering methods are not intended to discover any shape clusters because the used location estimates (mean and spatial median) assume spherical symmetry.That is the difference from kernel-based methods; see, e.g., [45].Such assumption is also inherent in the computation of Inter for cluster validation indices (see Table 2).However, the assumption that a data set contains clusters with spherical shapes can be unrealistic, making the clustering and cluster validation problems more challenging.In [32], a new approach for transforming and normalizing an arbitrarily shaped subset of data to an approximately spherical shape with a specified radius was introduced.The method is based on the notation of chains around high-density key points.The original method assumes a 2D data space.Thus, multidimensional scaling (MDS) [46] can be applied to project high-dimensional data sets into the 2D.

1) DEFINITION OF KEY POINT
The M points from X with relatively higher density and larger density-based distances are associated with the key points which can be determined by selecting M largest values based on the following equation: where ρ denotes the density of x i and x i,1 . . .x i,k are k = 4 nearest neighbors of x i , the minimum distance from x i to other points with a higher density is denoted as r i .The method connects points in the data set using density-based distance as the connection rule.Density-based connections are created until the key points are visited.In [32], the number of key points M was suggested to be selected as | √ N |.

2) DEFINITION OF CHAIN
Points that are connected to a key point form a chain.Multiple chains to one key point are allowed.Let us assume c chains.
Then the chain lengths can be defined as: where n c is the total number of points in chain c.In data set normalization, distances are transformed into a new one as follows: After normalization, the lengths of the longer chains are shortened, whereas shorter chains are lengthened, i.e., longer chains move closer to the key points, and shorter chains move away from the key points.The normalized chains can be optionally scaled to a fixed size.

IV. CLUSTERING A. BASIC ALGORITHMS
Prototype-based clustering methods consist of two main phases: selection of initial prototypes and iterative refinement until final convergence is reached, i.e., the cluster partition does not change (see Algorithm 4).In the classical K-means [47], MacQueen's initialization phase is combined with Lloyd's search phase [48].In general, the initialization phase is based on the random selection of initial prototypes, which most often causes the points to be selected from the same dense region yielding a poor performance [48].Moreover, due to the initial point selections, it is known that the basic algorithm does not guarantee a unique solution to the global minimum of the error function [24].Finding the global minimum is an NP-hard problem because there are Stirling number of the second kind different partitions for N observations into K groups [49].In practice, the common way to perform clustering is to repeat the algorithm with multiple restarts and to use the smallest local clustering error as a selection criterion for the final prototypes [50].
The mean of the cluster points is the statistical estimate of the cluster prototype in K-means.The method assumes that data are spherical Gaussian distributed with normally distributed noise and equal variance in each cluster.The median and the spatial median, the latter also referred to as the Fermat-Weber or Weber point, are robust estimates of location [51], whose spherical symmetric distributions are uniform and Laplace distributions, respectively.The spatial median is a multivariate generalization of the univariate median.The median and the spatial median are robust prototypes of a data distribution since they can tolerate up to 50% of incorrect data values without being disturbed.The spatial median is rotation invariant so that robustness improves as the dimension of the continuous problem space grows [49].

Algorithm 4 The Main Phases of Prototype-Based Clustering
Input: Data set and the number of clusters K .1. Select K observations as the initial prototypes.until the partition does not change do 2. Assign each observation to the closest prototype.
3. Recompute the prototypes with the assigned observations.Output: Partitions and prototypes corresponding K disjoint data subsets.
In the general case, the clustering error function can be written as follows: where d(•) is the distance computation strategy in the l q p space, and {c k } K k=1 is the set of cluster prototypes that minimizes locally the error function (17) and partitions the data into K disjoint subsets.J k is the within-cluster error in cluster C k , and l p -norm to the q-th power is the distance measure corresponding to the different location estimates of the error function (see [51], [52]).Specifically, the sample mean, median, and spatial median are obtained by choosing (p = q = 2), (p = q = 1), and (p = 2, q = 1), respectively.The sample mean and the median are straightforward to compute, whereas the spatial median requires minimization of a non-smooth (i.e., nondifferentiable) optimization problem (see [52]) that requires more complex iterative methods to be computed [53].For instance, the solution can be obtained efficiently by using the successive over-relaxation (SOR) method.
In the K-means++ initialization approach, the first prototype is selected as the centroid of the data set.Then, the following prototypes are selected iteratively from X based on the probability function obtained from previous prototype(s) min d(x i , {c} K −1 k=1 )/ ).Thus, the initial prototypes are very probably selected separately.The selection procedure can also be performed incrementally [54].It means that previously obtained K − 1 cluster prototypes are used as a fixed set of initial points where only one point is sampled according to the K-means++ principle.In highdimensional problems, K-means++ may show deteriorating behavior which can be compensated by using dimension reduction techniques [33].

B. CLUSTERING BASED ON EXPECTED DISTANCES
Computing the expected distances rely on the assumption of normally distributed data.The central limit theorem suggests that the assumption is valid with many continuous data sets with appropriate sample sizes [37], [55].However, the statistical parameters of data distribution are usually unknown, and missing values make estimating parameters more challenging.Usually, the EM algorithm can produce sufficiently accurate estimators of the unknown parameters, making the clustering task more approachable because the data characteristics are better known.
A clustering algorithm based on estimated distances was presented in [56].The core steps of the method are shown in Algorithm 5.The algorithm skeleton is identical to the traditional clustering (see Algorithm 4) but consists of two additional steps (steps 2 and 3) that utilize distance estimation.Steps 4 and 5 are repeated with estimated distances until final convergence is reached.We noticed in the previous study that, on average (over 100 repetitions), clustering based on the distance estimation produced better initial prototypes than the clustering based on the ADS.However, giving the distance-estimated prototypes as the initial points to the K-spatialmedians based on ADS produced even more accurate solutions to the clustering tasks.Thus, step 6 was included in the developed method in Algorithm 5.

V. CLUSTER VALIDATION INDICES
Many clustering algorithms require the number of clusters as an input parameter.However, often this information is not available, and deciding the number can be challenging, especially in the case of multidimensional data, which humans cannot directly conceive.Even though there are many methods for illustrating multidimensional data, i.e., using different multidimensional visualization techniques [57] or dimension reduction techniques [58], [59], the data structure may not be obvious.Cluster validity provides a way to validate the quality of the clustering results by discovering the partition that best fits the nature of the data.Thus, because of the high diversity of data, cluster validation measures, e.g., CVIs, are

Algorithm 5 Clustering Based on EED-ADS Distance Computation
Input: Data set X with missing values and the number of clusters K .
1. Select the spatialmedian as the first prototype of the data set.
2. Iteratively select the initial prototypes by using previously selected K − 1 prototypes and K-means++ initialization.2. Compute the mean vector and covariance matrices of the data using the EM method.recommended, even essential, methods for determining the final number of clusters [31].

A. INTERNAL CLUSTER VALIDATION INDICES
Internal cluster validation indices are commonly based on two measures: 1) Compactness, also referred to as Intra, indicates how close the observations are to each other within the same cluster.A commonly used Intra is a clustering error itself, e.g., in the Ray-Turi index.2) Separability, also known as Inter, indicates how distant a cluster is from the other clusters.Typically, Inter is computed as the minimum or maximum distance between all prototypes.Variability between prototypes around the centroid of the data is also used by many indices, e.g., in the Calinski-Harabasz index.In general, the purpose of CVIs is to minimize Intra and to maximize Inter, so that the argument minimum or maximum of division indicates the number of clusters.
Table 2 specifies the Inters and Intras of the best internal cluster validation indices according to [56].Indices are presented in a general fashion for l q p -norm settings.Explanation of abbreviations are given in Table 3.The whole data prototype is denoted by m, whereas n k indicates the number of observations in the kth cluster.The special distance computation strategies given in Section II, denoted by d(•), are required if at least one data vector includes missing values.Note that the WB-index, Calinski-Harabasz, and kCE-index include penalization terms for a high number of clusters that were originally derived in the context of the squared formulas.Therefore, l 2 p -norms were used for these indices regardless of the clustering error criterion used.In the Silhouette index, Intra is the average dissimilarity of x i to all other points in the same cluster, and Inter is the minimum average dissimilarity of x i to all points in a different cluster: where x i belongs to cluster C k .

B. EXTERNAL CLUSTER VALIDATION INDICES
External cluster validation indices can validate the quality of the clustering result if the actual clustering labels are known.The simple external index is Accuracy-index (ACC) which computes the quotient of the correctly predicted data labels and the total number of the labels [60].The normalized mutual information index (NMI) origins from information theory.The mutual information explains the reduction in the entropy between the real and the predicted cluster labels [61].The normalization is used to scale the result to the range of [0, 1].Many variants exist to normalize the mutual information, e.g., min, max, and square-root normalizations [61].However, the arithmetic method is often used, which divides the mutual information by the average value of entropy terms as follows: where I (•, •) denotes the mutual information between the real and predicted clusters and H (•) denotes the entropy function.
The adjusted Rand index (ARI) measures similarity between two clusterings of the same data using the permutation model [61].The equation can be written as: where n ij is an intersection table between the real and predicted cluster labels, the row and column sums of the intersection table are denoted by a i and b j , respectively.

VI. OVERVIEW OF THE TOOLBOX
The toolbox was implemented by using MATLAB (R2018b, 64-bit), and it is freely available with the MIT License on GitHub online 1 .The toolbox contains benchmark_data, toolbox, test_macro, and results folders.
Eight real-world classification data sets were selected from the University of California at Irvine (UCI) 2 Machine Learning Repository [62].Seven were used in the first part of the experiments, and three were used in the second part.Further, eight synthetic data sets, including the four S sets3 (15 centers and 5000 observations in each set), the Sim5D2 set 4 (5 centers and 2970 observations), the Sim2D2 set 4  (2 centers and 2000 observations), the O200 set 4 (5 centers and 200 observations), and the O2000 set 4 (5 centers and 2000 observations) were selected from a previous study [56] for the second part of the experiments.These synthetic data sets are two-dimensional.In addition, in total, 12 synthetic multidimensional data sets (10D, 50D, 100D) with 15 centers and 6000 observations in each set were created with the data set generator5 [33] for the third part.
The toolbox includes routines for handling missing values, data preprocessing, clustering, and validating clusters.All the developed methods generalize to missing values in the data.The descriptions of the most vital MATLAB functions are given in Table 4. Notice that more detailed descriptions of functions are available using the help command in MAT-LAB (see the next section for the use case examples).Further, the help command shows the function calls, input and output parameters, and default values of the input parameters for each toolbox function.The toolbox supports computation strategies based on available data (ADS), partial distance (PDS), and expected distances (ESD, EED) that are used by clustering methods, cluster validation indices, and data preprocessing methods.In total, ten well-performing internal cluster validation indices depicted in Section V are supported.Further, as depicted in Sections II-III, the preprocessing functionality includes routines for data imputation, distance computation with a selected distance strategy, selecting key points, and transforming data sets into spherical forms.

A. GENERAL USE OF THE TOOLBOX
General use of the toolbox is demonstrated in the toolboxdemo macro (see the next section).The correct functionalities of the toolbox functions can be evaluated with test macros divided into three test case folders.The first test case folder includes the Main macro that performs comparisons of techniques for handling missing values.The macro selects the parameters used from the params file.The cluster validation process can be divided into three tasks in the second test case folder: data preparation, clustering, and cluster validation.The Main macro pipelines these tasks to one process and outputs an Excel file of the cluster validation results.Further, the toolbox offers missing values generation, clustering, and cluster validation as separate processes implemented in the generatemissdata, clusterdata, and validateclustdata macros, respectively.An optionally visualizeresults macro can be used to visualize the final results of the clustering and cluster validation.
The third test case folder includes the same Main macro functionalities as given in the second test case folder.However, the mechanism for generating missing values was modified to restore 0.5% of the original observations.It was required because the initialization of clustering uses the complete observations, and removing data values completely at random from high-dimensional data causes all observations to contain missing values.

B. EXAMPLES OF BASIC USE
basic use of the is given in the toolboxdemo file.It includes function calls for data preprocessing, clustering, and cluster validation.In the first example, 10% of missing values are generated for the input data.The result is min-max scaled to a range of [-1, 1], and the k-nearest neighbors imputation with five neighbors is performed.Then, the dimensionality of the imputed data set is reduced to 2D and transformed into a spherical form (Section III-D).Finally, the transformed data are visualized on a scatter plot.load fisheriris; X = meas; addpath('../../../toolbox/preprocess'); Xm = genmissdata(X, 0.1); Xnorm = normalizedata(Xm, 'min-max', [−1, 1]); Ximp = knnimpute(Xnorm, 5); Xmapped = datasetmap(Ximp); scatter_data(Xmapped); In the second example, clustering is performed based on available data in distance computation, i.e., using a In the first case, the experimental settings and the reference results were obtained from [5].The real-world data sets were selected from the UCI repository.The experiments consisted of the z-score scaling of the data to the zero mean and unit variance.Then, the fixed probabilities (5, 15, 30, and 60%) of the data values were removed completely at random from each data set.The estimated distances were compared to the real distances, which were computed beforehand.The root mean square error (RMSE) between the real distances and the estimated distances was used.The RMSE included only the cases where estimations were needed, i.e., distances over complete observations were omitted.Further, in the cases of empty data vectors, the average distances over the data samples were used in error computing.The mean values and standard deviations of the results were recorded utilizing measurements over 250 repetitions.
We validated the functionalities of the implemented distance estimation algorithms against the reference methods given in [5].An extension of the reference paper was the self-made implementation of the EM algorithm so that the ecmnmle function was not required (available only in MAT-LAB's commercial Financial Toolbox).Further, in addition to the ESD, PDS, and ICkNNI (k = 5) methods, the EED, ADS, kNNI (k = 5), and iterative soft-thresholding methods were added to the comparisons.Table 5 shows the results, which are in line with the reference results in the six cases over seven data sets.The exception is the wine data set, in which all distance computation mechanisms produced different results.In [5], a Monte Carlo simulation was used to remove data values in each repetition, whereas in our experiments, data values were removed completely at random.That may explain the differences in the results.In general, the results indicate that the EED is the best-performing algorithm.However, the ESD results are only slightly worse, and the method is computationally less expensive.Thus, the ESD method is highly recommended for computing pairwise distances.

B. PERFORMANCE EVALUATION OF CLUSTERING AND CLUSTER VALIDATION
In the second part, the data clustering and cluster validation indices methods were evaluated.The initial settings were selected from [56].These settings included removing data values completely at random from data sets (see the toolbox overview section for detailed descriptions of the data sets), min-max scaling that results in a range of [−1, 1], repeating the K-spatialmedians clustering with 100 replicates, and selecting the lowest local minima as the best clustering partition.The prototypes were initialized incrementally, benefiting the previous prototypes (see the last paragraph in Section IV-A).In [56], the clustering method based on the expected distances and giving the obtained prototypes as inputs in the K-spatialmedians with ADS algorithm was suggested.The clustering and cluster validation indices were revealed to be slightly more accurate based on the two-stage clustering approach.Thus, the same procedure was repeated in this study among the K-spatialmedians clustering.
The results given in [56] were reproduced to validate the functionality of the cluster validation.Note that the results/params folder includes the parameter files used in different experiments related to cluster validation.The experiments showed that the best cluster validation results are obtained using K-spatialmedians clustering based on EED-ADS distance estimation.The new approach improved the performance, especially when compared to the results, which were available using the real centers of the synthetic data sets as the initial points to K-spatialmedians based on the ADS (see results folder).The best-performing index was Calinski-Harabasz (CH) that always recommended TABLE 5.The average RMSEs and standard deviations (over 250 repetitions) of distance computation algorithms in the direct estimation of pairwise distances with data sets consisting (5, 15, 30, and 60%) of missing values.The best results for each test set are underlined, and the results that are not statistically significantly different (two-tailed paired t-test, α = 0.01) are in bold.
the correct numbers of clusters even if the data sets and degrees of missing values varied.The other well-performing indices were kCE-index (KCE), Silhouette, and Ray-Turi.
Three external cluster validation indices were selected to measure the quality of the K-spatialmedians clustering results based on ADS and EED-ADS distance estimations strategies.The selected indices were: Accuracy (ACC), adjusted Rand index (ARI), and normalized mutual information (NMI).Table 6 shows the comparison results.Clearly, the EED-ADS-based estimation produces better solutions for the synthetic data sets.Especially, the better results were obtained with the S2 data set and with the challenging S4 and Sim5D2 data sets.These results are in line with the results obtained by the internal indices, which especially recommended the better solutions with the EED-ADS estimation for the Sim5D2 data set.
We applied the expected distance estimation to the actual cluster validation indices.It appeared that only Silhouette, Wemmert-Gançarski (WG), and Davies-Bouldin (DB) benefited from the distance estimation, and the other indices decreased the performance for finding the correct number of clusters in the data sets.Compared to the other indices, which compute the pairwise distances between observations and complete centroids, Silhouette computes the pairwise distances between observations, which may be incomplete (see eq. ( 18)).Thus, it was expected that Silhouette performed better using the expected distances.
Key point selection (presented in Section III-D1) was used in the cluster validation.The number of key points can be fixed to || √ N ||, as recommended in [32].However, we provided two modified versions of the original algorithm based on key point pruning, i.e., the algorithms started from the given maximum for the key points and then removed irrelevant points one by one.This was performed iteratively until the value of K of the chosen number of clusters was reached.The selected points were then used in the initialization of the selected clustering algorithm in each iteration.The experiments were performed for 2D data sets.For this purpose, Ecoli, Iris, and Seeds real-world data were transformed to 2D using multidimensional scaling.The selection assumed that the data sets were complete, therefore, the ICkNN (k = 2) imputation strategy was applied to the data sets with missing values.The figures for the key point selection result are given in results/key_point_selection/img folder in toolbox.The results for the cluster validation indices are given in Table 7.The reference results for the synthetic data sets were obtained from [56].On average, the validation results for the key point selection were almost the same as the reference results, which were based on available data strategy and replicated clustering.The most challenging synthetic data set was Sim5D2.None of the indices was able to get all correct recommendations with the different degrees of TABLE 6.The quality of clustering results determined with external cluster validation indices.The K-spatialmedians clusterings with available data strategy (ADS) and using both expected distance and available data computations (EED-ADS) were compared.The highest scores are bolded only if they differ between the two clustering methods.
missing values.The high-density clusters in Sim5D2 caused many incorrect validation results.The sparse clusters were connected to higher-density ones after clustering, and therefore, many indices supported three as the correct number.Especially, the sparse clusters almost disappeared based on the ICkNN imputed data with 20% of missing values (see images from results folder).The validation results with real-world data were improved using the key point selection with all data sets.It appeared that CH, KCE, and WB indices recommended a very high number of for Ecoli and Iris data without the key point selection.

C. CLUSTER WITH MULTIDIMENSIONAL DATA
In the third part of the experiments, the cluster validation indices were applied to multidimensional data sets that were created by the data set generator presented in [33].The generator draws a random point on the M-dimensional sphere centered on c with radius d.The distance between centers is defined as d c = ||c i − c j ||, c i , c j ∈ C, where i = j, and C is a set of centers.The radius d is uniformly selected from the range of (0, 1] for each data point.It means the clusters do not overlap in the multidimensional space when the distance of the centers is d c ≥ 2, and the cluster overlap is approximately 50% if the distance is d c = 1. Table 8 shows the results of the cluster validation indices with the predefined number of missing values (0, 5, 10, and 20%) and different degrees of cluster overlap (d c = [0.9,0.8, 0.7, 0.6]).The best performing index was WG which recommended the correct number of clusters in almost all test cases (45/48 correct recommendations).Interestingly, the CH, KCE, and WB-index, which included the squared penalization term, always recommended the incorrect number of clusters.We also tested the non-squared penalizations but were not able to improve the results.The KCE uses only Intra which explains that the better separation in the multidimensional space depends on the quality of Inter.It supports the finding given in [33] that the difference between the clustering errors of good and bad clustering results in high-dimensional spaces is small.The curse of dimensionality can explain the findings, which causes relative differences between the distances to vanish in high dimensional spaces [63].The other well-performing indices were GD, RT, and DB * , which recommended 37, 33, and 30 correct solutions, respectively.The highest overlapping clusters (d c = 0.6) were challenging for the indices because only WG (11/12 times), RT (3/12 times), and GD (3/12 times) were able to find the correct numbers.
The experiments were also conducted with 2D-scaled M-Sphere data sets.However, the performance of all indices was poor in 2D data space (only a few correct recommendations), and therefore, these results were not reported.The generated clusters were compact and isolated in the highdimensional space, which explains the far better validation results with these data sets in their original dimensions [63].Further, the dimension reduction leads to a loss of information which also supports the findings.Nevertheless, the developed key point selection algorithms with ICkNN (k = 2) imputed data possess multidimensional functionality.The results of cluster validation indices with the key point selection and multidimensional M-Sphere data sets are given in the results folder in the toolbox.The indices can be concluded to perform better when the key point selection procedure was used to initialize the K-spatialmedians clustering with 0%, 5%, and 10% of missing values in the data sets.However, a decreased performance was observed with 20% of missing values in data.

VIII. DISCUSSION
The results indicate that the ESD distance estimation could be a better choice than EED in the general case due to the lower computational complexity.The overall best clustering models with synthetic 2D data sets seem to be obtained using expected distances in clustering and giving the prototypes as inputs to the K-spatialmedians originally based on the TABLE 7. The number of clusters determined with internal cluster validation indices based on key point selection (every second row).The data sets consisted of predefined numbers of missing values, and experiments were performed using the K-spatialmedians clustering algorithm.Reference results were obtained using K-spatialmedians clustering with available data strategy.
available data distance strategy.In the case of multidimensional data, we noticed that the quality of the clustering models highly depends on the form of the Inter term.In addition, significantly better validation results were achieved when the data sets resided in their original dimensions than in 2D presentation.The WG index clearly overperformed the other indices on the multidimensional sets.
The experiments to demonstrate the performance of the cluster validation indices were performed both on synthetic and real-world data sets.One challenge for testing indices with real-world data sets is that the correct number of clusters is not obvious.For instance, a clustering model may produce a useful presentation about the inherent structure of a data set while it does not necessarily agree with the given class distribution for the same data.In data mining, the goal of cluster analysis is, however, to discover new knowledge instead of training a prediction model in a supervised manner.In this scenario, one approach for validating a cluster model and estimating the number of clusters is to apply multiple indices that have previously performed well on several data sets.
We provided two modified versions of the original key point selection algorithm based on key point pruning.The developed algorithms included a mechanism for removing irrelevant key points.The algorithms resulted in good solutions for most of the data sets with a varying portion of missing values.However, there is still room for improvement in the heuristics to identify appropriate locations of the key points for diverse data sets.The development of clustering heuristics is not a trivial task because the notion of a cluster itself can be weakly defined [64].It is also good to remember that clustering is often in the eye of the beholder [65].Before a clustering algorithm is applied to the data, one may also want to determine whether the data even has a clustering tendency [66].The most central properties of clusters are density, variance, dimension, shape, and separation [67].Further, what type of clustering model is the most useful always depends on the target application.

IX. CONCLUSION
Even though the basic idea behind cluster analysis is simple, the process presumes many decisions and choices with multiple options in different parts of the analysis.This study proposed a toolbox that enables researchers and practitioners to achieve reliable and consistent clustering results regardless of missing values in their data.The priorities of the present work were on data preprocessing, clustering, and cluster validation.
The toolbox supports missing values and enables its user to build automated data clustering pipelines from preprocessing to cluster analysis and model validation.The validity and performance of the algorithms were demonstrated using multiple test cases and several data sets.One should note that the aim of the presented experiments was not to perform a systematic method comparison since most of the underlying development work has already been accomplished in the previous studies cited in this paper.
We remind that some of the implemented functions can also be useful in other machine learning tasks.For instance, the distance computation methods for missing data cases provided in the preprocessing folder are readily applicable in supervised learning with the distance-based methods [68], [69].
The functionality of the toolbox was verified against the reference results from the previous publications.In the study, the two expected distances measuring metrics' performance were thoroughly demonstrated in handling missing values.Further, a recently published key point selection mechanism, which associates the data points with relatively higher density and larger density-based distances to the so-called key points, was applied to improve the cluster validation process.The cluster validation was experimented with challenging multidimensional data sets with various cluster overlap and numbers of missing values.
Even though the key point selection strategy seems to improve the performance of many cluster validation indices, further investigations are recommended, especially related to the key point selection procedure and the initialization of clustering algorithms.The initialization is an important part of the clustering process, and several studies are already available on the topic [33], [70]- [72].The purpose of this toolbox is to facilitate and promote this research further.The UCI Repository provides a multitude of data sets, of which some are particularly proposed for clustering experiments.This toolbox enhances the testing of its methods with a wider range of sets.

APPENDIX A EXPECTED SQUARED EUCLIDEAN DISTANCE
Let us assume the data are missing at random (MAR), i.e., missingness may depend on the value of available data: The expected squared Euclidean distance between two data vectors can be partitioned into four parts depending on the missing and available values of each data vector: + where A i and A j denote the available values of data vectors x i and x j , respectively, and M i and M j denote the missing values of the vectors.The first term (l ∈ A i ∩ A j ) represents pairwise known values of both vectors, and they can be computed directly.The rest of the sum contains terms where at least one part contains only missing values.The missing value can be replaced with a random value, i.e., (x i ) l is denoted by (X i ) l for every l ∈ M i .Thus, the equation can be expanded as follows: In more detail, the third summation (l ∈ M i ∩ M j ) can be written as: 2  + Var((X i ) l ) + Var((X j ) l ).
Thus, it is sufficient to compute the expected value and variance of each random value separately to obtain the final distance.= 0.
Thus, x and x 2 are uncorrelated.In addition, they are jointly normally distributed, and therefore, independent.Following the initial assumptions, the conditional mean of x 1 given x 2 is obtained as follows: Further, we find out the following equality: Note that the basic rules of matrix algebra are given in [73].

3 .repeat 4 .
Compute the expected distances between the observations and prototypes.Assign individual observations to their closest prototypes. 5. Recompute the prototypes with the assigned observations.until The final convergence 6. Repeat steps 4 and 5 without distance estimation.Output: Partitions and prototypes corresponding K disjoint data subsets.

until final convergence do for each x i for which M i is nonempty do
Compute mean vector µ of available values of the data set.2. Impute missing values by µ to obtain the imputed matrix X imp .3. Recompute µ and compute covariance matrix by using imputed data.4. Create a zero matrix B which size is equal to .

TABLE 1 .
Non-central moments of normal distribution.
in Table 1. 3. Use formula (8) to obtain the shape parameter m and the final distance d ij .Output: Pairwise Euclidean distances d ij of data vectors.

TABLE 2 .
Internal cluster validation indices in general fashion.

TABLE 8 .
The number of clusters determined with internal cluster validation indices using K-spatialmedians clustering.The data sets consisted of predefined number of missing values, and different degrees of cluster overlap.