Ultra Data-Oriented Parallel Fractional Hot-Deck Imputation With Efficient Linearized Variance Estimation

Parallel fractional hot-deck imputation (P-FHDI (Yang et al. 2020)) is a general-purpose, assumption-free tool for handling item nonresponse in big incomplete data by combining the theory of FHDI and parallel computing. FHDI cures multivariate missing data by filling each missing unit with multiple observed values (thus, hot-deck) without resorting to distributional assumptions. P-FHDI can tackle big incomplete data with millions of instances (big-<inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="cho-ieq1-3249567.gif"/></alternatives></inline-formula>) or 10,000 variables (big-<inline-formula><tex-math notation="LaTeX">$p$</tex-math><alternatives><mml:math><mml:mi>p</mml:mi></mml:math><inline-graphic xlink:href="cho-ieq2-3249567.gif"/></alternatives></inline-formula>). However, handling ultra incomplete data (i.e., concurrently big-<inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="cho-ieq3-3249567.gif"/></alternatives></inline-formula> and big-<inline-formula><tex-math notation="LaTeX">$p$</tex-math><alternatives><mml:math><mml:mi>p</mml:mi></mml:math><inline-graphic xlink:href="cho-ieq4-3249567.gif"/></alternatives></inline-formula>) with tremendous instances and high dimensionality has posed problems to P-FHDI due to excessive memory requirement and execution time. To tackle the aforementioned challenges, we propose the ultra data-oriented P-FHDI (named UP-FHDI) capable of curing ultra incomplete data. In addition to the parallel Jackknife method, this paper enables a computationally efficient ultra data-oriented variance estimation using parallel linearization techniques. Results confirm that UP-FHDI can tackle an ultra dataset with one million instances and 10,000 variables. This paper illustrates the special parallel algorithms of UP-FHDI and confirms its positive impact on the subsequent deep learning performance.


I. INTRODUCTION
I NCOMPLETE data commonly occurs in nearly all scientific and engineering domains, which may result in biased estimation of parameters and exacerbate subsequent statistical Manuscript  analyses [2]. Inadequate handling of missing data may lead to incorrect statistical inference and subsequent machine learning (ML) [3]. A popular approach, known as listwise deletion [4], is to omit instances with missing values from analysis, but it may seriously bias sample statistics if the data does not follow the assumption of missing completely at random (MCAR). Pairwise deletion proposed by [5] is found deficient for severe missingness. Naive imputation (i.e., replace missing values with the mean of observed values) is popular [6], but it may lead to inconsistent bias when there is considerable missingness inequality across variables.
There are two branches of imputation theory that replace a missing value with statistically plausible values [7]: single imputation and repeated imputation. The single imputation uses expectation-maximization (EM) to replace each missing item with a predicted value estimated by the maximum likelihood methods [8]. However, it may require a considerable convergence time when data missingness is severe. Single imputation methods tend to underestimate the standard errors.
Repeated imputation includes two popular methods: multiple imputation and fractional imputation. Multiple imputation proposed by Rubin [9] overcomes the downside of single imputation by replacing each missing value with several plausible values representing the distribution of all possible values. The R package mice implements multiple imputation using chained equations to handle multivariate missing data such that each variable has its own imputation model [10]. However, the so-called "congeniality" and "self-efficiency" conditions are required for the validity of multiple imputation which may pose challenges in practice [11], [12]. Fractional imputation, proposed by [13] and extensively discussed in [14], [15], [16], creates a complete dataset with fractional weights after imputation. Departing from the fractional imputation, [17] proposed fractional hot-deck imputation, a non-parametric imputation method using two-phase sampling for stratification. Some of the authors of this paper developed the R package F HDI [18] to perform fractional hot-deck imputation or fully-efficient fractional imputation (FEFI) to cure general multivariate incomplete datasets without prior distributional assumptions. Recently, MLbased imputation methods have been gradually emerging: e.g., generative adversarial networks (GAN) for missing data [19], [20]. K-Harmonic mean imputation [21], sequential regression multivariate imputation [22], Fuzzy C-Means imputation [23],

II. RELATED WORK
This section reviews state-of-the-art parallel computingbased imputation methods. [25] developed a parallel imputation algorithm with cloud-based tensor decomposition to impute missing epigenomics experiments. Package MaCH developed the genotype imputation to infer the missing genotypes in genetic studies. However, it considers all observed genotypes when imputing each missing genotype, leading to a quadratic execution time increase, for which [26] proposed a parallel MaCH-based imputation using GPU implementations. Yet, the above parallel imputation methods are restrictive in bioinformatics, and the favorable scaling performance is not broadly confirmed. Incompleteness often occurs in big enterprise data, [27] proposed a parallel imputation framework to cure enterprise registration data, but it is tailored for big geo-referenced text data processing, exhibiting limited scaling (3.5 times at maximum). Other related works include R package missF orst [28] for mixed-type data and parallel Bayesian Markov chain Monte Carlo (MCMC) imputation [29] for missing data in epidemiology. missF orst adopted the random-forest imputation method and could run in parallel when the computation of random forests is time-consuming. [29] first adopted parallel MCMC for Bayesian imputation on the disk-based shared memory, which parallelized MCMC iterations over available nodes by a simple divide-and-conquer strategy, resulting in promising scalability. However, the parallel MCMC requires highly customized, setting-specific, and parallelized software. All of these parallel imputation methods are restricted to specific disciplines and not suitable for general ultra incomplete data in broad science and engineering domains, which is to be tackled by the present UP-FHDI.

III. KEY EQUATIONS FOR UP-FHDI
For a concise description of UP-FHDI, we introduce the following basic setup. Assume a finite population U with p-dimensional continuous variables y = {y 1 , y 2 , . . . , y p }, and y il represents the ith realization of the lth variable where i ∈ {1, 2, . . . , N} and l ∈ {1, 2, . . . , p}. A response indicator δ il takes the value of 1 if y il is observed and δ il = 0 otherwise. Given the number of categories K, one can discretize continuous variables y to discrete variables z, the so-called "imputation cells," such that z take values within categories {1, 2, . . . , K} for each variable. The y-values within the same value of z are relatively homogeneous. One can decompose an instance y i = {y i,obs , y i,mis } as the observed and missing parts, respectively. Similarly, the corresponding z-vector is decomposed as Let A be the index set of the sample of size n selected by a probability sampling mechanism. Consider A M as an index set of missing units such that A M = {i ∈ A; p l=1 δ il = 0}; alternatively an index set with observed units is The cross-classification of all variables forms imputation cells z, and we assume a cell mean model on the cells determined by z. Considering a finite mixture model under missing at random (MAR), the conditional distribution of f (y i,mis | y i,obs ) can be approximated (See detailed theory in [1]). Fig. 1 gives audiences a comprehensive guide to UP-FHDI theory. We will follow this diagram to present key equations of UP-FHDI stages as follows.

A. Cell Construction and Variable Reduction
The mathematical symbol ' • ' proposed in [1] denotes a loop that repeats a sequence of the same operation S(x) with discrete input augments. Consequently, the operations in a matrix format. This paper continues to use these notations to facilitate the description of UP-FHDI. [17] proposed the discretization method to pre-determine imputation cells z using the estimated sample quantiles. Considering z M and z R , we can obtain index sets of unique missing patternsÃ M = {i, j ∈ A M | ∀i = j, z i = z j } of sizeñ M and unique observed pat-ternsÃ R = {i, j ∈ A R | ∀i = j, z i = z j } of sizeñ R , respectively. Sequentially, we can further extract the unique missing patternsz M = {z i | i ∈Ã M } and unique observed patterns UP-FHDI must overcome the curse of dimensionality. A huge literature was accumulated on the problem of variable reduction (e.g., [30], [31], [32], [33]). Noteworthy, [34] proposed the sure independence screening (SIS) for ultrahigh dimensional variable selection. Inspired by the screening step of SIS, [1] came up with a correlation-based SIS to filter out noise variables with weak correlations to responses (i.e., missing variables). Suppose we select v variables for a recipient such that v p. Let X = {X 1 , . . . , X q } be always observed and Y = {Y 1 , . . . , Y w } be subject to missingness such that p = q + w. Consider r k = (r 1 k , r 2 k , . . . , r q k ) as a vector of sample correlations of X given Y k . This paper proposes a complement to the existing correlation-based SIS as follows: By contrast, the existing SIS in [1] where M k is the subcovariate set with top correlations for imputing Y k . Imputation cells z after discretization can not always guarantee at least two donors for every recipient to capture the variability from imputation. Denote the minimum entity of M be m. [1] adopted the cell collapsing to produce more donors and stop only if m 2. The non-parallelizable cell collapsing method is computationally intractable in practice. As a remedy, this paper applies the KNN method [35] to efficiently determine deficient donors. Let e ∈ Rñ R be the euclidean distance (ED) between a recipient z i andz R . Let w be the observed covariate set for z i if SIS is not used. Otherwise, w is set to be the sub-covariate set M derived by SIS for z i . Suppose z i has less than two donors such that M i < 2, the KNN algorithm consists of the following steps: 1) Compute e by wherez R(j,l) ∈z R and k l is the number of categories of z l . 2) Suppose e t 1 is the minimum entity of e. Add t 1 to D i and update M i . If M i 2, stop. 3) Suppose e t 2 is the second minimum entity of e. Add t 2 to D i and update M i . We will iteratively apply the KNN method to every unqualified recipient until m 2.

B. Estimation of Cell Probability With Selected Variables
A modified EM algorithm by weighting was proposed by [36] and extensively used in [1], [17] to estimate the conditional probability. This section extends this method with selected variables. Let X * be the selected covariates in the sub-covariate set M derived by SIS such that X * ⊂ X = {X 1 , . . . , X q }. Consider Y = {Y 1 , . . . , Y w } as a w-dimensional categorical random vector with support C. We develop an EM algorithm for estimating the joint probability for P (Y | X * ). Let π c (x * ) = P (Y = c | x * ) be the conditional probability of Y = c given X * = x * . Note that c∈C π c (x * ) = 1. ( The E-step is to compute the conditional probability is and C i = {c ∈ C; y i,obs = c obs(i) } is the subset of C whose c obs(i) components are equal to y i,obs . The M-step is to update the conditional probability as

C. Imputation
For simplicity in description, consider two discrete random variables (z 1 , z 2 ) with the same support of (z obs , z mis ). The key equation for developing fractional hot deck imputation is to approximate the conditional distribution of f (y mis | y obs ) by where the last equality is derived under the conditional independence between y mis and z obs given z mis . The first component is equal to I(z 1 = z obs ) if z obs is a sufficient summary of y obs . The second component is computed by the above EM algorithm. The third component is computed from the hot-deck imputation within z. For concise explanations, let A be partitioned into G groups such that A = A 1 ∪ . . . , A G . The group A g can be subpartitioned into A R g = {j ∈ A g ; δ j = 1} and A M g = {j ∈ A g ; δ j = 0}. Let n R g and n M g be size of A R g and A M g , respectively. We can decompose a donor for the ith recipient (z i,obs , z i,mis ) as (z j,obs , z * j,mis ) where z j,obs = z i,obs and z * j,mis is a possible imputed value for z i,mis . Let w * ij,F EF I be the fractional weights of the jth FEFI donor for the ith recipient. The FEFI donors refer to all possible donors and w * ij,F EF I is expressed by The term a l = 1 if (z l,obs , z * l,mis ) = (z j,obs , z * j,mis ). Otherwise, a l = 0. Note that n Rg j=1 w * ij,F EF I = 1. The term π z * j,mis |z j,obs is the conditional probability of z * j,mis given z j,obs computed bŷ where P (z * j,mis , z obs ) is the joint probability derived after the convergence of the EM algorithm in the preceding subsection. Note that n Rg j=1πz * j,mis |z j,obs = 1. The tailored systematic sampling [18] was designated to efficiently select M FHDI donors among FEFI donors. UP-FHDI inherits this method with modifications. Originally, the first step of the tailored systematic sampling scheme sorts all FEFI donors in Mahalanobis distance by the half-ascending half-descending order. However, computation of Mahalanobis distance is intractable when the dimension space is ultra-high. As a remedy, we substitute this step with a random shuffle and confirm that this modification affects the mean estimates in a negligible manner. Let w * ij be the fractional weights of the jth FHDI donor for the ith recipient. The w * ij is given by The FHDI estimator of y l can be expressed bŷ where y * (j) il is the jth imputed value of y il .

D. Fast and Efficient Variance Estimation for Ultra Data
Popular variance estimation methods include balanced repeated replication [37], bootstrap [38], and Jackknife methods [39]. The Jackknife variance estimation had been successfully implemented into P-FHDI and proved to be effective [1]. We modified the parallel Jackknife method in this study for better efficiency. Yet, it is not applicable for ultra data owing to the computational bottleneck. To overcome this challenge, we develop a linearized variance estimation technique based on one of the author's dedicated prior works [40]. Consider variance estimation of a deterministic imputation. Assume the original sample is decomposed into G disjoint groups. The sample observations follow the cell mean model: where A g is the index set of group g. Let n g and r g be the number of elements and the number of observed elements in group g, respectively. Assume the response mechanism is MAR and the parameter of interest is θ = E(Y ). We will use a deterministic imputation withη = (μ 1 , . . . ,μ G ), whereμ g = r −1 g i∈A g δ i y i is the gth cell mean of y among respondents. The imputed estimator of θ will bê Writing e ig = y i − μ g for i ∈ A g , we can expresŝ Using E(e ig ) = 0, the uncertainty associated with n g /r g is asymptotically negligible. Thus, the imputed estimatorθ I can be expressed bŷ where p g is the probability limit of r g /n g . The resulting variance estimator isV whered We can extend the linearization formula for the variance of the mean estimator of FHDI with multivariate missing data, where M imputed values are randomly selected from respondents in the same imputation cell. The FHDI estimator of the mean of y l is given byμ Assume each unit i in the sample belong to one and only one imputation cell. Letȳ gl be the sample mean of y l among the respondents in cell g such that where S gl ∈ N n gl is the index set of the units in imputation cell g. And r gl is the number of observed units (i.e., δ i = 1) in cell g of y l . The variance estimator ofμ l,F HDI can be expressed bŷ IV. PARALLEL ALGORITHMS OF UP-FHDI Fig. 2 visualizes the parallel file system tailored for UP-FHDI. This system targets the HPC environment that allows simultaneous access from multiple compute servers. Suppose we have in total Q processors indexed by 0, . . . , Q − 1. The slave processors indexed by 1, . . . , Q − 1 perform distributed computations. The master processor (indexed by 0) is responsible for aggregating results from slave processors. The MPI library [41] provides basic routines MPI_Send and MPI_Recv (denoted as MPI_SR for brevity), and MPI_Bcast to support inter-processor communication. The master processor integrates pieces of results together by an operation MPI_Gather (denoted as Ω). The operation Ω j i (i = 0) integrates results generated by slave processors (indexed by i to j) and stacks them on the master. Notably, the parallel file system supports IO communication between slave processors and the hard drive by cooperative parallel IO in MPI, thereby overcoming out-of-memory problems. We adopt IO routines MPI_File_Read (denoted as MPI_RD) and MPI_File_Write (denoted as MPI_WR) to simultaneously read or write from multiple processes to a single binary file on the hard drive. The MPI parallel IO is superior to non-parallel IO since its operations are collective such that IO is scalable. A single user's intensive and simultaneous IO running on a small number of nodes can overload the metadata server and degrade global file system performance. As a remedy, [42] developed the Optimal Overload IO Protection System (OOOPS) to automatically detect and throttle the IO workload. We apply this tool with UP-FHDI to dynamically adjust IO during the job without interruption. Hereafter, any vector or matrix will be marked with a star symbol '*' on the left-up corner when it is temporarily placed on the hard drive. The development of UP-FHDI is conducted and tested on the local HPC of ISU (Condo [43]) and NSF Cyberinfrastructure's HPC (Stampede2 [44]). The benchmark tests across different HPC environments guarantee the compatibility of UP-FHDI.

A. Parallel Cell Construction
Let v be the number of reduced variables after SIS per each missing pattern. Algorithm 1 recaps the key sub-steps of parallel cell construction with SIS. The function CAT predetermines imputation cells z and the function ZMAT decomposes z into unique missing patternsz M and unique observed patternsz R . Sequentially, we pair donors for each recipient by function nDAU. Note that we search donors for each recipient based on v selected variables when SIS is activated by setting v = 0. Otherwise, donors are determined based on all observed variables of a recipient. If there exists z i ∈z M such that M i < 2 (i.e., deficient donor situation), we adopt the KNN function until every recipient has at least two donors. Unless otherwise stated, UniformDistr(.) means a simple uniform task/data distribution scheme. The indices (s and e) represent the beginning and ending of distributed work on processor q, respectively. In lines 1 and 2 of Algorithm 2, we read distributed raw data on hard drive y (q) to each slave processor. Considering the estimated distribution function of y kF For every y k where k ∈ {s, . . . , e}, we construct proportions {q a 1 , . . . ,q a G } satisfying 0 < a 1 < · · · < a G in line 4, the estimated sampling quantile of y k for a g is defined aŝ From lines 5 to 9, we can discrete y k to z k with respect to {q a 1 , . . . ,q a G }. In lines 11 and 12, write distributed imputation cells z (q) from slave processors to the hard drive and integrate as * z. In lines 1 and 2 of Algorithm 3, read distributed imputation cells z (q) to each slave processor. Let F ∈ N such that z = z 1 ∪ · · · ∪ z F . We can iteratively compare z (q) with z f , f ∈ {1, . . . , F } in lines 3 to 6 to remove duplicates in z (q) , and thus obtain distributed unique patternsz (q) . Hence, z (i) ∩z (j) = ∅ whenever i = j. Distributed unique observed patternsz ifq a g −1 < y ik <q a g then 7: z ik = g 8: end if 9: end for 10: end for 11: MPI_WR(z (q) ) → * z (q) 12: * z = Ω Q−1 1 * z (q) Algorithm 3: Parallel Function ZMAT. Input: imputation cells * z Output: unique observed patterns * z R and unique missing patterns * z read distributed raw data y (q) to each slave processor. Consider F ∈ N such that y 1 ∪ · · · ∪ y F = y. In line 6, derive an index list v k based on the computed correlations between y (q) k and y f . If f equals one, V (q) can be initialized by including v k where k ∈ {1, . . . , p/(Q − 1)}. Otherwise, we need to update the existing v k in V (q) in contrast with the v k generated in the Algorithm 4: Parallel Function RANK. Input: raw data * y Output: ranking list V 1: UniformDistr(p) −→ s, e 2: MPI_RD( * y (q) ) −→ y (q) 3: if f==1 then 8: V (q) .add(v k ) 9: wherez Rf,t ∈z Rf . Note that t ∈Ã f in (26) will be mapping to t ∈Ã R globally before adding to D k . When no variable reduction is applied by setting v = 0, I(z   if v==0 then 7: Compare(z Compare(z M,k that has M k < 2. Initialize global minimum ED (denoted as e t 1 ) and global secondminimum ED (denoted as e t 2 ) as an infinitely large constant ε in line 4. Let t 1 ∈Ã R be an instance index that has an ED of e t 1 . Likewise, let t 2 ∈Ã R be an instance index that has an ED of e t 2 and t 1 = t 2 . Let F ∈ N such thatz R =z R1 ∪ · · · ∪z RF . From lines 5 to 10, compute ED e betweenz (q) M,k andz R1 , . . . ,z RF iteratively by (2) and update t 1 and t 2 within iterations. In line 11, add t 1 to donor list L and update M k . If M k is still less than two, continue to add t 2 to L such that two donors will be guaranteed. In line 16, we can integrate L (q) on the master processor and broadcast L to slaves in line 17.

B. Parallel Imputation
The EM algorithm in the estimation of cell probability is an implicit and iterative process such that it does not support a simple divide-and-conquer parallelism. Hence, we only apply necessary parallelisms to multiple linear search operations (see [1] for details of parallel cell probability). In line 1 of Algorithm 7, each slave is aware of boundary indices of distributed tasks. From lines 2 to 13, it selects FHDI donors among FEFI donors for each recipient and computes corresponding fractional weights. In line 3, compute fractional weights w * kj,F EF I for FEFI donors by (8). If the kth recipient has less than M donors such that M k M , we will adopt all possible donors in l k , where l k ∈ L is the donor index list for the kth recipient. In line 5, import imputed values to memory regarding l k and add imputed values along with w * kj,F EF I toŶ (q) in line 6. If the kth recipient has more than M donors in l k such that M k > M, we apply the tailored systematic sampling scheme [18] to efficiently l k → * y k ; MPI_RD( * y k ) → y k 6:Ŷ (q) .add(w * kj,F EF I );Ŷ (q) .add(y k ) 7: else 8: Compute w * kj 10: l M → * y k ; MPI_RD( * y k ) → y k 11:Ŷ (q) .add(w * kj );Ŷ (q) .add(y k ) 12: end if 13: end for 14: MPI_WR(Ŷ (q) ) → * Ŷ(q) 15: select M donors as l M in line 8. In line 9, it computes fractional weights w * kj for FHDI donors by (10). Likewise, import imputed values to memory regarding l M and add imputed values along with w * kj toŶ (q) in line 11. In lines 14 and 15, it simultaneously writes distributed imputed valuesŶ (q) to the hard drive and integrates them as * Ŷ .

C. Parallel Variance Estimation
Full details about the parallel Jackknife variance estimation are available in [1]. This section will focus on the parallel linearization techniques in Algorithm 8.

Algorithm 8: Parallel Linearized Variance Estimation.
Input: raw data * y, response indicator * r, imputation cells * z, imputed values * Y Output: variance estimator V ofμ F HDI 1: UniformDistr(p) → s, e 2: MPI_RD( * y (q) ) → y (q) 3: MPI_RD( * r (q) ) → r (q) 4: MPI_RD( * z (q) ) → z (q) 5: MPI_RD( * Y (q) ) → Y (q) 6: Compute y (q) In lines 1 to 5 of Algorithm 8, we read distributed raw data y (q) , response indicator r (q) , imputation cells z (q) , and imputed values Y (q) from the hard drive, respectively. In line 6, one can determine sample means y (q) ∈ R G× p Q−1 among the respondents in different imputation groups by where S gl ∈ N n gl is the index set of the units in imputation cell g. And r gl is the number of observed units in cell g of y l . However, if it has multiple donors, one may not determine an imputation group for a missing unit (i.e., δ = 0). UP-FHDI uses conditional probability to determine the imputation cells for missing units. Suppose a recipient z i = (z i,mis , z i,obs ) has two donors (z whereȳ gl ∈ y (q) andȳ * il is defined in (19).
In line 9, the variance estimator V (μ F HDI ) (q) ∈ R p Q−1 on slave processor q is given by and we integrate V (μ F HDI ) ∈ R p on the master processor in lines 10 and 11 such that SE is computed by SE = V .

V. VALIDATION
This section conducts Monte Carlo (MC) simulations, a comparison between Jackknife and linearized variance estimation, and a performance comparison of UP-FHDI against baseline imputations (naive method and an advanced machine learningbased method) with large synthetic and real-world datasets. We provide a step-by-step illustration in APPENDIX I on how to use UP-FHDI, available in the online supplemental material.

A. Monte Carlo Simulations versus UP-FHDI
Let U(n, p, η) denote a finite population with n instances and p variables issued by η missing rate in proportion. Unless otherwise stated, missingness under MCAR is created by the Bernoulli distribution as δ il ∼ Ber(τ ) where τ is the probability of response. Let B be the MC simulation size. MC simulations follow steps: (MC1) Generate synthetic data by where e 1 , e 2 , e 4 are randomly generated by the standard normal distribution. The term e 3 is randomly generated by the gamma distribution. Thus, μ 1 = 1, μ 2 = 2, μ 3 = 2, and μ 4 = 0.
where SE is the square root of the variance estimator V ( Y F HDI ) and SE( Y F HDI ) is the square root of the sampling variance of the mean estimator Y F HDI .
where Y

(b)
F HDI is the imputed point estimator from the bth MC sample and Y F HDI is given by We present MC simulation results of UP-FHDI without SIS in Table I. Likewise, MC results using SIS are shown in Table II. The mean estimates Y F HDI from MC simulations have negligible differences with true column mean μ, and thus good accuracy of the UP-FHDI imputation is affirmed. Average RB in Tables I  and II are 1.96% and 2.87% (less than 5 percent), validating the Jackknife variance estimator.

B. Linearized Variance Estimation versus Jackknife
In Fig. 3, we investigate whether the linearized variance estimator is an accurate substitute for the Jackknife variance estimator. Absolute difference of standard errors (ADSE) is defined as where AD l is the absolute difference between SE Linear,l and SE Jack,l . Note that SE Linear,l and SE Jack,l are the standard errors of the mean estimator of y l using linearization techniques and the Jackknife method, respectively. Remarkably, ADSE between Jackknife and linearization gradually converges to an  acceptable datum as datasets become larger. Meanwhile, we confirm an alignment between linearized and the Jackknife variance estimators with a minor ADSE in Table III with an ultra dataset. Overall, the linearized variance estimator is a good alternative to the Jackknife variance estimator for ultra data (e.g., n ≥ 20, 000). It should be noted that the linearized variance estimation may not be recommended for a small-sized data curing due to substantial difference from the Jackknife (see large ADSE in the left range of Fig. 3). Although the Jackknife is recommended for small to medium-sized datasets, APPENDIX II, available in the online supplemental material, affirms a significant advancement in the computational efficiency of the linearization techniques.

C. Baseline Imputations Versus UP-FHDI With Large Synthetic and Real-World Datasets
Next, we validate the performance and accuracy of UP-FHDI with large real-world and ultra synthetic datasets. Table IV presents the adopted incomplete datasets, which are publicly accessible in IEEE DataPort [45]. Some variables in real-world datasets have significantly skewed data distributions, as shown in APPENDIX III, available in the online supplemental material. For generating synthetic data, (38) is used until we obtain total p variables where ρ = 0.5 and e i , e i+1 , e i+3 are randomly generated by the normal distribution with a user-defined standard deviation (SD), and the term e i+2 is randomly generated by the gamma distribution. We create 30% missingness to all datasets by different missing mechanisms: MCAR and missing not at random (MNAR). Although missingness is generated by MCAR for simplicity, UP-FHDI should hold for response models based on MAR. Table IV summarises resultant synthetic data and adopted real-world datasets to compare the performance of UP-FHDI and baseline imputation methods, including naive imputation and the recently proposed Generative Adversarial Imputation Network (GAIN). The naive imputation adopts a simple mean estimator computed using observed values. GAIN is a GANbased framework that employs an imputer network to handle the missing data [19]. Experiments show that GAIN outperforms many state-of-the-art imputation techniques, and the summary of key theories of GAIN is presented in APPENDIX IV, available in the online supplemental material. This study adopts the default settings to build the GAIN model. Considering the stochastic nature of GAIN, we conduct ten experiments for each dataset and average the performance measures. Let the average standard error of the mean estimator be SE = 1/p p l=1 SE l and RM SE is given by where y il andŷ il are normalized units of original data and imputed values by Min-Max normalization, respectively. Using large real-world datasets (Earthquake, Bridge Strain, Travel Time, and CT Slices), UP-FHDI performs well comparable to GAIN regarding RMSE results. Note that the default-setting GAIN aborted when it was applied to Swarm, p53, and Radar (the last three rows in Table V). It appears that GAIN may require specific extensions for large/ultra data curing. Tables V and VI show that UP-FHDI consistently outperforms naive imputation with smaller SE and RM SE using large real-world and ultra synthetic datasets, respectively. Similar performance of UP-FHDI under MNAR assumption can be found in APPENDIX V, available in the online supplemental material.

VI. COST ANALYSIS AND SCALABILITY
The cost model is important since it gives audiences an insight into the computational performance of UP-FHDI. Given constant time for a unit operation, one can build a cost model based on the total number of unit operations in computation and communication. Let α be the basic computational cost per unit. Although there exist intra-node and inter-node communication transfer costs, let β be the overall communication transfer cost per unit for simplicity and L be the communication startup cost. Likewise, IO communication can be approximated by a startup cost and a unit transfer cost in the same way as inter-node communication. IO communication transfer cost can be classified as contiguous (denoted as γ 1 ) and non-contiguous types (denoted as γ 2 ). Contiguous IO makes a single IO request and allows access to a contiguous chunk of data at a time, whereas noncontiguous IO must be accessed by making separate function calls to access each individual contiguous piece iteratively, and thus γ 2 γ 1 [49]. Let L 1 and L 2 be the associated startup cost for contiguous and non-contiguous IO, respectively. Typically, L 2 L 1 . Let s = max(n, p) and thus total running time T (Q) with Q available processors can be expressed by where c p ∈ N + . See detailed inference of (40) and (41)

VII. IMPACT OF UP-FHDI ON DEEP LEARNING
The impact of FHDI on the subsequent ML has been extensively discussed in [3] with datasets of moderate size. Following a similar methodology, we investigate the impact of UP-FHDI on deep learning with ultra incomplete datasets. We segment the investigation as the following procedures: i) Generate an ultra synthetic dataset U(n, p) and p variables are indexed by {0, 1, . . . , p − 1}. Split U(n, p) as disjoint U 1 (0.7n, p) and U 2 (0.3n, p) such that U 1 ∪ U 2 = U. ii) Perform the two-staged feature selection on U(n, p) regarding the target variable indexed by p − 1, and thus obtain a list of important features v ∈ N v . Note that the target variable is guaranteed to be included in v. iii) Create 30% missingness to U 1 (0.7n, p) and cure it with either of imputation methods to derive imputed values U 1 (0.7n, p). iv) Extract important features from U 1 (0.7n, p), U 1 (0.7n, p), and U 2 (0.3n, p) regarding v such that we will have training datasets U 1 (0.7n, v) and U 1 (0.7n, v), and a test dataset U 2 (0.3n, v). v) Build deep learning models with training datasets U 1 (0.7n, v) and U 1 (0.7n, v), respectively. Test different predictive models on the same U 2 (0.3n, v) and compute nRM SE by . (42) where y i is the ith unit of the target variable in U 2 (0.3n, v). Andŷ i,cured andŷ i,ori are the predictions of y i based on U 1 (0.7n, v) and U 1 (0.7n, v), respectively. For deep learning predictions, this study adopts the R package h2o [50]. We enable the ReLU activation function and four hidden layers, and twenty neurons in each layer for the implementation.
As a complement to the above investigation, the two-staged feature selection is presented as follows. Ultrahigh dimensional data (e.g., p > 10, 000) may pose challenges to the direct use of existing ML methods. Feature selection has been proven to be effective in removing redundant features to improve subsequent learning performance [51]. Generally, feature selection techniques can be classified into three groups: (i) wrapper methods [52], [53], (ii) filter methods [54], [55], and (iii) embedded methods [30], [31]. To investigate the impact of UP-FHDI on deep learning, we need to have a subset of v features such that v p. This paper proposes a two-staged feature selection method that leverages the mutual information (shrink the largesized to the medium-sized features) and then graphical LASSO (shrink the medium-sized to a final subset of v features). First, we develop a parallel filter method based on mutual information (MI) to reduce to v I (e.g., up to 100) features. The MI provides a measure of how much useful information is conveyed to the target variable by including a predictor. Hence, including features with high MI will help to boost prediction accuracy. Consider y I = {y 0 , . . . , y p−2 } as predictors and y p−1 as the target variable. Let y where G l ∈ G (q) , p(g) and p(k) are marginal probability, and p(g, k) is joint probability. (MI3) Aggregate M I (q) on the master processor by (MI4) Select v I features with top MI with respect to M I. Second, we build graphical models based on the inverse covariance matrix (denoted as Σ −1 ) to further reduce v I features to a final small-sized subset of v features. The inverse covariance matrix, commonly referred to as the precision matrix displays information about the partial correlations of variables. The zero element of Σ −1 implies the conditional independence of two variables given the rest. The sparsity pattern of the precision matrix Σ −1 reflects the graphical structure as a consequence of the Hammersley-Clifford theorem [56]. We use R package glasso [57] to estimate a sparse precision matrix Σ −1 using a lasso (L1) penalty (see [58] for details about graphical lasso). Assume N multivariate normal observations of dimension p such that X i ∼ N p (μ, Σ) where i = 1, . . . , N and μ and Σ are unknown mean and covariance matrix. Let Θ = Σ −1 be a precision matrix. The estimation of Θ follows steps: 1) Let W = S + ρI be the estimate of Σ and ρ is a regularization parameter. Express W and S by The diagonal entries of W remain unchanged throughout the following steps. 2) For j = 1, . . . , p, solve the lasso problem bŷ β = min Since we computeβ by solving (46), theθ 12 has many zeros and thus estimated Θ becomes sparse. One can gradually increase ρ to make Θ more sparse until we obtain v selected features connected to the target variable, as shown in Fig. 6. Regarding memory usage and speedup, the parallel MI is linearly scalable with ultra data. It considers relationships between predictors and the target. By contrast, the glasso package is not directly applicable to ultra data but takes relationships among the medium-sized set of variables into consideration. We illustrate using the parallel MI program and glasso package in APPENDIX VII and VIII, respectively, available in the online supplemental material. After the two-staged feature selection, this study follows the proposed investigation procedures with synthetic datasets U(0.1 M, 10000) generated by (38). The formula indicates that seven variables are closely connected to the target variable. This fact is well captured by the two-staged feature selection. In particular, the final graphical model of seven important features (nodes indexed by 9992-9998) connected to the target variable (node indexed by 9,999) is visualized as a network in Fig. 6. The validity of the two-staged feature selection is confirmed. Lastly, we investigate the impact of different imputation methods on deep learning predictions in terms of the normalized RMSE nRM SE (see Table VIII). In essence, nRM SE being closer to 1 means that an imputation method does not disturb (or negatively impact) the prediction performance. Given (38), we prepared three ultra datasets by varying underlying SD used in the synthetic data generation. Throughout all three ultra datasets, Table VIII shows that UP-FHDI positively impacts deep learning performance compared to the naive method. The difference between UP-FHDI and naive imputation may not be significant due to the simple structure of the adopted synthetic data.

VIII. FUTURE RESEARCH
There is room to improve UP-FHDI's scalability due to non-contiguous IO communication between nodes and the hard drive disk. An optimal workflow shall upgrade the parallel file system for more efficient IO communication. By new theories, the adopted euclidean distance-based KNN shall be improved to embrace fully categorical datasets.

IX. CONCLUSION
In a new era of Big Data and powerful computing, there is a strong need for a Big Data-oriented imputation paradigm for data-driven scientific discovery. By inheriting all existing functionalities of the general-purpose, assumption-free data-curing tool P-FHDI, this paper proposed the ultra data-oriented P-FHDI (UP-FHDI). We document the full details of UP-FHDI with respect to the adopted parallel file system, ultra data-oriented parallelisms, parallel linearization techniques, and impacts on the subsequent deep learning. The Monte Carlo simulations and comparative studies against baseline imputation methods affirm the validity of UP-FHDI, and analytical cost models exhibit its promising scaling performance. This program enables big incomplete data imputation and efficient parallel variance estimation, and ultimately improves the subsequent deep learning with the cured Big Data. We share all the developed codes, examples, and documents in [59]. Jae Kwang Kim received the PhD degree in statistics from ISU, in 2000. He is a fellow of American Statistical Association and Institute of Mathematical Statistics and currently a LAS dean's professor with the Department of Statistics, ISU. His research interests include survey sampling, statistical analysis with missing data, measurement error models, multi-level models, causal Inference, data integration, and ML.
In Ho Cho received the PhD degree in civil engineering and minor in Computational Science and Engineering from California Institute of Technology, in 2012. He is currently an associate professor of CCEE Department, ISU. His research interests include data-driven engineering and science, computational statistics, computational science and engineering, and parallel computing.