Systematic Review of Advanced AI Methods for Improving Healthcare Data Quality in Post COVID-19 Era

At the beginning of the COVID-19 pandemic, there was significant hype about the potential impact of artificial intelligence (AI) tools in combatting COVID-19 on diagnosis, prognosis, or surveillance. However, AI tools have not yet been widely successful. One of the key reason is the COVID-19 pandemic has demanded faster real-time development of AI-driven clinical and health support tools, including rapid data collection, algorithm development, validation, and deployment. However, there was not enough time for proper data quality control. Learning from the hard lessons in COVID-19, we summarize the important health data quality challenges during COVID-19 pandemic such as lack of data standardization, missing data, tabulation errors, and noise and artifact. Then we conduct a systematic investigation of computational methods that address these issues, including emerging novel advanced AI data quality control methods that achieve better data quality outcomes and, in some cases, simplify or automate the data cleaning process. We hope this article can assist healthcare community to improve health data quality going forward with novel AI development.


I. INTRODUCTION
T HE Coronavirus Disease 2019 (COVID-19) pandemic has caused significant morbidity and mortality around the world, which prompted artificial intelligence (AI) research to develop software tools to help combat the disease. However, most of these tools have been largely unsuccessful due to the limited availability of high-quality, large-scale, and timely data to investigators [1], [2], [3]. Although large quantities of COVID-19 health data have been collected in real-time since the beginning of the pandemic, data collection was generally siloed and many public repositories contained unreliable or unharmonized datasets [2]. AI tools developed using dirty data can be biased and fail upon validation or deployment (Fig. 1, Suppl. Fig. 1), leading to an opportunity cost in terms of the possibility for enhanced technology-driven COVID-19 surveillance, triaging, diagnosis, and prognosis. In this review, we expand on our previous data quality control reviews [5], [6], [7], [8], [9], [10], our past research on data quality [5], [6], [11], [12], [13], [14], [15] and our previous work discussing state-of-the-art AI methods contextualized in the COVID-19 pandemic [4]. We discuss four common data quality issues: a) lack of data standardization of data from different sources, b) missing data, c) tabulation errors, and d) noise and artifacts. For each data quality issue, we describe how this issue is relevant to COVID-19 health data and provide a systematic review of advanced quality control tools that address the issue with superior performance compared to classical methods. These four categories span several large, national datasets collected for the purpose of COVID-19 public health surveillance that present unique challenges in approaching quality control depending on the data modality and its potential source of error. Accordingly, in each section we discuss the classical and advanced approaches particular to the most common data modalities relevant to the main quality control topic. Other critical aspects of health data quality control for machine learning include such topics as data safety, class imbalance, data duplication and many more general best practices for handling data. To stay within the scope of our literature search, however, our discussion will center on the four main categories and data modalities introduced above. A select non-exhaustive set of available COVID-19 health datasets is   Table I (short version), with a more detailed version available in Suppl. Table I.

II. PRISMA FRAMEWORK
We used the Preferred Reporting Items for Systematic Reviews and Meta Analyses (PRISMA) framework (Fig. 2) to systematically identify and review original research articles presenting novel deep learning-based approaches towards data quality control, specifically focusing on the following data quality issues: missing data, noise and artifact, tabulation errors, and lack of data standardization. The systematic database search included Scopus, PubMed, and the Institute for Electrical and Electronics Engineers (IEEE). The search was restricted to articles published on or after 2018 that contained the terms listed in Suppl. Table II. The search included terms listed in a) the abstract, title, or article keywords for Scopus, b) the abstract or title only for PubMed, or c) the abstract only for IEEE. All 3,345 articles found across all three databases for the different search terms were saved into a table. An additional set of 239 articles were found using approaches that did not involve a database search, such as ancestry and descendance approaches. The total number of articles after combining these was 3,584. After removing any duplicated entries, the total number of articles was 2,517. The first round of screening involved screening by title. Exclusion criteria included review papers, articles that were not written in English, and articles not related to deep learning or data quality control. After the first round of screening, a total of 2,199 articles were excluded and 318 were included. Further detailed inclusion and exclusion criteria are described in Suppl. Table III.

III. QUALITY CONTROL FOR LACK OF DATA STANDARDIZATION
Data standardization refers to harmonizing data from different sources into a cohesive dataset for use in downstream analyses. Lack of standardization is a major challenge for EHR data because different healthcare systems and clinics can use different EHR software with different data formats and data field. A key example of the lack of consistent COVID-19 EHR data across institutions is illustrated by the work of the Consortium for Clinical Characterization of COVID-19 by EHR (4CE) [28], which is one of hundreds of efforts to aggregate COVID-19 data. The researchers collected aggregated data from 100 hospitals in several different countries, and found substantial differences between the datasets, such as variations in units for certain data fields, different code systems used for laboratory tests and diagnoses,  Table III. and the lack of meta data on hospital specific reference ranges. Lack of standardization is also a significant challenge with medical image data, and this has been especially problematic for COVID-19 related research. Publicly available databases were set up that allowed anyone to upload COVID-19 chest X-ray or CT images, and many of the uploaded datasets contained images from heterogeneous sources [2]. Images produced in different healthcare institutions might use different scanning machines, settings, or protocols that influence the image, making it possible for a machine learning algorithm to learn a biased or incorrect classification rule from the data. A summary of how the issue of lack of data standardization is relevant for various types of COVID-19 health data is presented in Table I (short version) Suppl. Table IV (detailed). The following subsections provide a detailed analysis of classical and advanced approaches to manage standardization issues, with a focus on EHR and medical image data. Specific examples of comparisons between classical and advanced methods, along with an illustration of the issue of lack of standardization, are detailed in Suppl. Fig. 2. Advanced methods are further summarized in Suppl. Table V and Fig. 1 (Right), with a visual timeline presented in Fig. 3. The issue of lack of standardization is further visualized in Fig. 4.

A. Classical Approaches for Handling Unstandardized Data
EHR data standardization challenges are in part being overcome using standards such as HL7's Fast Healthcare Interoperability Resources (FHIR) [29] and SMART-on-FHIR [30] that harmonize EHR data from different sources to facilitate data sharing. For researchers aiming to harmonize distinct EHR datasets that are not already harmonized through a unified data sharing framework such as FHIR, significant pre-processing needs to be done to transform the data into standardized formats. This can be a generally time-consuming and manual task, requiring syntactic and semantic data transformations. Syntactic data transformations consist of transforming the data from one format to another, such as changing a data table from a long and narrow format to a short and wide format [31], as exemplified in Suppl. Fig. 3. These transformations typically require the user to specify the total scope of the operations allowed in a table, such as splitting or merging columns, or string manipulations. On the other hand, semantic data transformations require the need for external information such as a mapping from ICD-10 codes to disease names for diagnoses, as shown in Suppl. Fig. 4.
For medical image standardization, the Digital Imaging and Communications in Medicine (DICOM) Standard [32] is a framework developed to allow for easy image storage and exchange for medical images from diverse vendors. The DICOM Grayscale Standard Display Function (GSDF) has been shown to increase visual consistency across medical images, but only improves the luminance response, which is just one of many factors that influence the quality of a medical image, including reflection, spatial resolution, noise, geometrical distortions, display chromaticity, veiling glare, and temporal response [33]. Thus, DICOM GSDF is useful for medical image standardization but not the most extensive solution. Another classical approach is using histogram matching [34], [35], which involves transferring a source image into a target domain by setting up an intensity histogram for each image domain and then matching the histograms between two images. This has been used to standardize the luminance value and saturation distributions for different images, but has sub-optimal performance, does not consider all factors that influence image quality, and can sometimes produce artifacts [34], [35].

B. Advanced Approaches for Handling Unstandardized Data
Recent advances have facilitated use of datasets from diverse sources for jointly training machine learning models, without the need to share data, hence having the potential to save researchers significant time and effort that would otherwise be spent on manual data harmonization. In 2017, Google published a blog post introducing an approach called Federated Learning (FL) [36], which allows for a centralized machine learning model to be jointly trained across several distributed clients without the need for data sharing. Because data sharing is avoided, each healthcare institution (i.e., client) utilizes its own siloed dataset for training the joint model, and thus an added advantage is that there is no need to manually combine and harmonize the healthcare data from each independent institution into a standardized dataset. While FL precludes the need to manually combine and harmonize data from different clients for joint model training, an important consideration is that the distribution of data between  clients is generally statistically heterogeneous and thus non-I.I.D, posing challenges for model convergence [37]. However, methods have recently been developed to address these issues in FL whilst continuing to avoid the need for data sharing [37], [38], [39].
Recently, FL has been successfully applied for training a COVID-19 prognosis clinical decision support algorithm using chest X-ray and electronic health records from a large number. of healthcare institutions. A FL multi-modal neural network model called EXAM (Electronic medical record chest X-ray Artificial intelligence Model) [40] was trained to predict the future oxygen requirements and 24 or 72-hour prognosis of COVID-19 patients using data from 20 different healthcare institutions without the need for data sharing or harmonization. Model training consisted of several "rounds", each of which required training locally using each institution's private data and servers for one epoch, and then sending the updated model parameters for all the local models back to a centralized server to be aggregated. The FL approach enabled use of large quantities of data, allowing EXAM to generalize and achieve a 16% average improvement in AUC over the models trained individually at each site. It serves as an example of how machine learning models can be used to jointly train machine learning models without the need to share data, thus precluding the need for the time-and effort-intensive process of manually combining and harmonizing the data. While FL is not itself a tool for data harmonization, it enables machine learning models to be trained across large numbers of institutions with private datasets, while precluding the need to harmonize the data to begin with.
Issues with medical image standardization can be addressed using novel image-to-image translation approaches. This field was launched with a seminal work by Isola et al. published in 2016 (Pix2Pix [41]), which uses a conditional generative adversarial network (cGAN) neural architecture to learn a loss function for a mapping from an input image in one domain to an output image in another domain. The model allows input black and white images to be converted into output color images, or input daylight photos to be transferred into the equivalent nighttime photo, for example. A standard GAN comprises a generator subnetwork that generates samples from random input data and a discriminator that takes as input both real and synthesized samples and classifies whether a given sample is real. Both the generator and discriminator have contrasting objectives, where the goal of the generator is to produce samples that the discriminator incorrectly classifies as being real. To better adapt standard GANs for image-to-image translation tasks, cGANs use labeled (i.e., paired) training examples, whereby the label image is the version of the input image in the desired domain and the generator learns to produce a new image conditioned on the label image. While cGAN was one of the first models designed for image translation tasks, it is limited by the need for paired training examples, which are not generally readily available in medical image datasets. After the introduction of Pix2Pix, many works in medical imaging followed suit with GAN-based algorithms for image-to-image translation. For example, Wolterink et al. [42] and Emami et al. [43] used GANs to generate synthetic computed tomography (CT) images from input magnetic resonance (MR) images. Recent works have also used GANs to generate MR [44] or CT [45], [46] images from positron emission tomography (PET) images.
A novel GAN architecture called CycleGAN was introduced in 2017 by Zhu et al. [47] and allows for image-to-image translation without the need for paired training examples. Unlike previous approaches, which might require training samples to be images of the same patient to be taken using MR and CT imaging techniques (i.e., paired samples), for example, CycleGAN uses information about the distribution of images from the two different domains, thus precluding the need for the paired samples. Specifically, to train this model, unpaired training example images from both the source domain and the desired domain are needed. There are two generators, one of which generates synthetic samples of one domain using input samples from the other domain and the other which reconstructs the input from the synthetic samples. In addition to the standard GAN adversarial loss, CycleGAN also uses two cycle consistency loss terms, each of which minimizes the reconstruction loss of translating each synthetic image back to its original source domain. This loss term enables image-to-image translation without the need for paired samples. Recent works in medical image-to-image translation are based on this architecture, such as CyTran [48] and the work by Wolterink et al. [42].
While many of these algorithms have shown success in translating medical images from one image modality to another, GANs (especially CycleGANs) have the potential to also be used to standardize images from different hospital institutions or scanning machines. Even for two datasets with the same type of image (i.e., CT scan), there might be variations in the parameters used by the technician or the manufacturer of the machine. CyTran was developed to generate contrast CT images from non-contrast images, which can be useful for the standardization of images from different sources taken with these different settings. Translating between contrast and non-contrast CT images is a particularly challenging task because it requires a model to effectively recognize tissue types, organ structures, and/or tumors, which can have different radio density measurements between the types of images. To address this challenge, CyTran combines a CycleGAN framework with a convolutional transformer block to generate images, enabling the model to simultaneously recognize large-scale global structural aspects of the images while translating them to the desired CT contrast style. When tested for style transfer with varying pairs of contrast phases (i.e., native to venous, native to arterial), CyTran achieved a consistently higher structural similarity index measure (SSIM) compared to Pix2Pix and CycleGAN, indicating an improved ability to retain structural information between contrast phases. It also achieved a lower RMSE and achieved the best overall subjective evaluation by physicians in terms of translated image quality. While this method was specifically focused on CT contrast style transfer, it may be a promising approach for broader CT image standardization tasks. Recently, STAN-CT [49] was developed as a novel end-to-end framework for CT image standardization. STAN-CT uses both a GAN framework and a DICOM synthesis framework. DICOM CT images are used as input and are then translated into a standardized distribution of CT image patches using the GAN, after which the standardized image patches undergo basic quality control, integrated back into the full image, and saved in the DICOM format. This end-to-end solution can significantly assist with CT image standardization for the purposes of COVID-19 research. It builds on a previous method (GANai) developed by Liang et al. in 2018 [35]. Both methods have improved standardization for image texture features such as contrast, correlation, homogeneity, energy, and correlation, compared to histogram matching, and once pretrained, can be useful tools for researchers. Another work by Zunair et al. [50] directly addressed the issue of limited publicly available COVID-19 chest X-ray data by developing a synthetic chest X-ray image dataset using CycleGAN which translates non-COVID-19 patient images to COVID-19 patient images. This approach can potentially be useful when the available COVID-19 image data are heterogeneous and from diverse or unknown original sources. Overall, deep GAN-based imageto-image translation methods are allowing for improved and more versatile CT image standardization than was previously possible. Whereas classical approaches such as Digital Imaging and Communications in Medicine (DICOM) Grayscale Standard Display Function (GSDF) and histogram matching can standardize certain aspects of images such as luminance values, GAN-based methods have dramatically broadened the scope of what is possible for image standardization, allowing for image translation between different medical image styles, CT contrast styles, and more. This can help address some of the most prevalent data quality issues in COVID-19 AI research. Similarly, FL is a breakthrough approach that enables broad, large-scale AI tool development and validation without the need for manual data harmonization, which has been a major roadblock during the pandemic.

IV. QUALITY CONTROL FOR MISSING DATA
The problem of missing data is a significant data quality issue with COVID-19 health data from various sources. For example, a recent study on the quality of a dataset containing EHRs and COVID-19 test results for thousands of patients in Portugal ("SINAVE-Med") found over 90% of the data were missing for important features such as the date of patients' first positive laboratory test results or the indication of whether or not a positive case required intensive care unit (ICU) admission [51]. Another study on the use of wearable devices for COVID-19 research found that many patients stop wearing their devices or let the charge expire during the time when they are symptomatic [17], and, despite its prevalence in multiple large datasets, potential audio clipping in crowdsourced COVID-19 cough data can impact the reliability of time-frequency representations in discriminative neural networks and lead to poor diagnostic performance as well [52]. A summary of how the issue of missing data is relevant for various types of COVID-19 health data is presented in Table I (short version) and Suppl.  Table IV (detailed).
The following subsections describe classical and advanced approaches towards handling missing data, with a focus on static (i.e., tabular) and time series data, including a few methods specific to audio waveform data. Examples of classical and advanced methods and their comparisons in performance are illustrated in Suppl. Fig. 5. Advanced methods are further summarized in Suppl. Table V and Fig. 1 (Right), with a visual timeline presented in Fig. 3. The issue of missing data is further visualized in Fig. 4.

A. Classical Approaches for Handling Missing Data
Missing data can be addressed using complete case analysis or imputation. Complete case analysis, also known as case deletion, involves deleting samples that contain incomplete data across all or most features of interest, or deleting features with incomplete data across all or most samples. Although it is a simple and common approach, it is generally not recommended, especially if there is a high quantity of missing data, because removing samples can lead to a loss of statistical power. Imputation involves entering an estimate for an unknown missing value. The simplest imputation method involves imputing missing values for a given feature as the mean, median, or mode of the non-missing data for that feature or as a constant value. However, simple imputation can change the distribution of the data and bias downstream analyses. There are several common imputation algorithms developed for time series datasets, the simplest of which involves imputing the missing values as the value of the known observation that occurred either immediately before or immediately after the missing values (also described as "last observation carried forward; LOCF" or "next observation carried backward; NOCB" [53]). More sophisticated classical multivariate static and time series data imputation frameworks are summarized in Suppl. Tables VI and VII, respectively.

B. Advanced Approaches for Handling Missing Data (Static Data)
Deep learning-based methods have recently been developed that achieve state-of-the-art imputation performance for multivariate static and time series data. For static tabular data imputation, successful models have recently been developed using generative adversarial networks (GANs) [54] and autoencoders. In 2018, Generative Adversarial Imputation Nets (GAIN) was published by Yoon et al. as the first GAN-based algorithm for missing static data imputation [55]. The model uses a generator that imputes the missing values for each sample, and a discriminator that is trained to identify which values for each sample are known observed values and which are imputed by the generator. The model is trained until the discriminator can no longer differentiate between known and imputed values. Specifically, the user inputs the original data matrix (samples by features), where missing values are set to zero. Using this matrix, two other matrices are created -a random matrix, which sets all non-missing values to zero and sets each missing value to a random number, and a mask matrix, which sets all non-missing values to one and missing values to zero. The generator uses all three to produce an imputed samples by features matrix. The model uses another generator subnetwork ("hint generator") that produces an encoded "hint" matrix based on just the mask matrix, enabling the discriminator to utilize information about the pattern of missingness in the original dataset. Finally, in contrast to a standard GAN discriminator which classifies an entire sample as real or synthetic, using cross-entropy loss to distinguish between real and synthesized values for each sample. The outcome is an adversarial model that imputes missing values until they can no longer be distinguished from known values in a dataset. When tested on several datasets, GAIN achieved significantly improved imputation performance compared to classical methods such as multivariate imputation by chained equations (MICE) and expectation maximization. When datasets were imputed using various methods and then used as input into a logistic regression algorithm, the classification performance was highest when GAIN was used as the imputation method. In 2019, MisGAN [56], which also uses GANs for imputation, was introduced. Using the Fréchet Inception Distance (FID) as an evaluation metric, MisGAN was found to consistently outperform GAIN across various datasets, especially at higher missing rates over 70%. However, the authors tested MisGAN with only image datasets, and its comparative performance to GAIN for tabular data may differ. Both algorithms are applicable to data missing completely at random (MCAR), which means that there is no relationship between the features or variables in the data and the missing values. In contrast, Multiple Imputation using Denoising Autoencoders (MIDA) was introduced as one of the first autoencoder-based methods for imputation and does not rely on the MCAR assumption, allowing for broader applicability [57]. Denoising autoencoders (DAEs) are similar to standard autoencoders in that input data is reconstructed through a series of encoding and decoding hidden layers, however they differ in that the input is corrupted (i.e., by setting values randomly to zero, adding noise, etc.). MIDA is structured such that the encoding and decoding layers are of sequentially higher dimensions than the input data, enabling better imputation performance. The original dataset is input into the model with missing values initially imputed as the univariate mean (for numerical data) or mode (for categorical data) of each feature column, and the output is the fully imputed representation. When various datasets with data not MCAR were imputed using both MIDA and MICE and then used for downstream classification or regression tasks, model performance was generally better when using MIDA for imputation. However, MIDA was not compared to other state-of-the-art deep learning imputation methods, so its comparative performance to GAIN or MisGAN is unclear. Another disadvantage of GAIN is that it was designed for continuous and binary data types but not for other data such as mixed numerical (i.e., continuous real-value, discrete count data) or nominal (i.e., categorical and ordinal). Nazabal et al. published their work on HI-VAE [58] in 2018 as a variational autoencoder-based imputation framework which can be applied for a broader set of data types under the MCAR assumption and is particularly suitable for datasets with nominal variables. Since then, novel autoencoder-based static imputation methods continue to be introduced [59], [60], [61], [62]. A few of these static deep learning-based imputation methods are further summarized in Suppl. Table VIII.

C. Advanced Approaches for Handling Missing Data (Time Series Data)
The development of multivariate time series imputation algorithms using deep learning is a larger and fast-growing field of research and is centered on the models developed using concepts from recurrent neural networks (RNNs), Gaussian processes (GPs), variational autoencoders (VAEs [63]), and GANs (Suppl . Table IX). While RNNs have been developed for handling missing time series data since the late 1990 s and early 2000 s, more recently in 2016, a model based on gated recurrent units (GRUs) called GRU-D [64] was developed as an end-to-end RNN framework for classification tasks using incomplete input time series data. GRU-D uses information about which values are observed or missing and patterns of missingness over time to improve classification prediction. Specifically, the model comprises a modified GRU structure that contains two trainable decay mechanisms (input and hidden decay) and that uses a mask vector for each sample for each time point that zeroes out missing values. The input decay uses information about the time interval during which an input feature is missing to decay the last observed value towards the mean, and uses these decayed values to impute missing data points. The hidden decay term is applied to the hidden state of the previous time point. Together, these modifications result in more fine-tuned imputation than traditional approaches and enable the RNN to be trained end-to-end with missing data. GRU-D was shown to have state-of-the-art performance compared to classical baseline methods when applied to the MIMIC-III intensive care unit and PhysioNet datasets. For example, when GRU-D was used for classification with MIMIC-III data, model area under the curve (AUC) increased by nearly 3% compared to when the dataset was imputed using LOCF and then trained using a standard GRU-based neural network. Another work by Yoon et al., M-RNN [65], uses a neural network architecture comprising bidirectional RNN layers that interpolate missing values within each data stream (i.e., each univariate time series sequence in a multivariate time series dataset) using known values in the same sequence, followed by a fully connected layer with dropout that uses information across different data streams to impute the missing values in each data stream. The dropout allows the model to capture the uncertainty in the imputed data, allowing for more robust imputation results, and the interpolation and imputation blocks allow for more robust performance for frequently and infrequently sampled data, respectively. When compared to classical methods such as spline or cubic interpolation and to advanced methods including GRU-D, M-RNN achieves consistently better imputation performance across various datasets, as measured by root mean squared error (RMSE) loss. Interestingly, despite having superior imputation performance, datasets imputed with M-RNN did not achieve significantly higher prediction accuracy when the imputed dataset was used for downstream classification tasks. When the model was modified to include binary cross-entropy loss as an end-to-end imputation and classification method like GRU-D, it achieved significantly improved classification accuracy compared to all baselines including GRU-D, especially at higher missing rates up to 90%. This suggests that end-to-end deep learning models that jointly perform imputation and classification or regression tasks appear to have better overall performance. In 2018, Cao et al. introduced another end-to-end RNN-based model called Bidirectional Recurrent Imputation for Time Series (BRITS) [66], which uses a bidirectional RNN graph to improve the imputation accuracy. BRITS achieves significantly improved classification performance compared to GRU-D and M-RNN, with an average increase in AUC of 0.025 and 0.030 respectively, across two separate datasets. This model was later applied for use in wearables data imputation with a modified regularization term, where the authors were successfully able to use BRITS to impute missing values for features such as steps and heart rate, achieving better performance than baselines such as k-NN or SOFTIMPUTE, even with features that had 50% missing values [67]. In 2020, Variational-Recurrent Imputation Network (V-RIN) [68] was introduced as an end-to-end model comprising a VAE-based imputation subnetwork connected to an RNN-based imputation subnetwork that can be trained for a classification or regression taskand that leverages information about imputation uncertainty. In particular, the VAE subnetwork takes as input multi-dimensional vectors for each patient sample for each time stamp, with missing values zeroed out. The decoding layer outputs an imputed representation of each input vector and is used to generate an uncertainty matrix comprising standard deviation values for each imputed feature, where a higher standard deviation implies a greater imputation uncertainty and all non-missing features have zero uncertainty. For each time point, this matrix, along with the representation, a mask matrix, and a time stamp matrix, are fed into the RNN subnetwork, which comprises a novel version of the GRU-D structure slightly modified to incorporate an uncertainty decay term that leverages information from the uncertainty matrix. V-RIN achieves better performance compared to M-RNN, BRITS, and GRU-D for a mortality prediction task using both Physionet and MIMIC-III datasets.
The growth in GAN-based multivariate time series imputation models was spearheaded by the introduction of GRUI-GAN by Luo et al. in 2018 as the first model to use a GAN for time series imputation [69]. They introduced a novel RNN cell (GRUI) which considers time lags incurred by irregularly sampled data and is included in the discriminator and generator of the GAN. This allows the model to learn the relationships between observed and unobserved data, and the temporal information. One of the disadvantages of GRUI-GAN is that it is based on a two-stage framework which can be time-consuming to run [70]. First it trains a GAN to generate samples and then it searches for generated samples that are most similar to each original input sample that contains missing values. In 2019, E 2 GAN (End-to-End Generative Adversarial Network) [70] was proposed as a one-stage method for multivariate time series imputation using GANs, which improves training efficiency. E 2 GAN generally achieved improved imputation performance across various missing rates compared to GRUI-GAN and classical methods and achieved significantly improved AUCs for mortality prediction when imputed datasets were used for classification with various classifiers including support vector machine, logistic regression, and RNN. E 2 GAN is not an end-to-end joint imputation and classification model but still achieved improved performance over BRITS (AUC increased by 0.02) for mortality prediction using a healthcare dataset. In 2021, Miao et al. expanded on this work by introducing SSGAN [71], which comprises a generator, a discriminator, and a semi-supervised classifier that iteratively classifies unlabeled time series data and is based on a bi-directional RNN model like BRITS [71]. SSGAN achieved significantly improved imputation performance compared to BRITS and E 2 GAN as measured by RMSE loss. Although SSGAN is not an end-to-end imputation and classification or regression model, datasets imputed using this method and then used for downstream RNN-based learning tasks still achieved better prediction performance than BRITS and all other baselines, with a 17.2% improvement over BRITS for one of the datasets tested.
Wasserstein GANs (WGANs) were also introduced in 2017 to address implementation issues occurring in regular GANs [72]. Originally, GANs often relied on a loss function using Jensen-Shannon Divergence (JSD) that often produced training issues such as a vanishing gradient, which can occur when JSD is locally saturated and the loss function can no longer accurately update the generator, or mode collapse where a generator's settings collapse to one mode and produce the same outputs [72]. WGANs address this issue by replacing JSD with Wasserstein distance (WD) in the loss function thereby avoiding mode collapse and vanishing gradients due to the fact WD is continuous and converges on a linear function that prevents discriminator saturation [72].
Variants of WGANs have since been used to restore missing audio data through 'audio inpainting'. For example, Ebner & Eltelt 2020 approached the issue of long audio gaps (200 ms) through a dual-discriminator WGAN (D2WGAN) that showed improvement over the original WGAN model [73]. This improvement is due to the inclusion of two discriminators, each tasked to discriminate overlapping audio samples with either short (1 s) or long (2.5 s) missing data bordering real audio. The advantage of this approach then is the inclusion of more correlated information about the missing data and as such audio from the D2WGAN was subjectively scored for higher restoration quality than the original method. Likewise, GACELA [74], a long audio gap inpainter, was recently developed based on a conditional GAN (cGAN) with a Wassterstein loss to synthesize even longer, context-dependent audio gaps up to 1500 ms. Older GAN-based [75], [76] and neural network [77], [78], [79], [80] solutions have also been developed in aan attempt to fill progressively longer audio gaps over time.
A few multivariate time series imputation algorithms have been developed in recent years that use GPs and VAEs, including GP-VAE [81], SGP-VAE [82], and L-VAE [83]. For example, GP-VAE (2020), the first algorithm that uses both concepts for time series imputation, models the low-dimensional representation of time series data based on a GP and showed improved imputation performance over baselines including HI-VAE and classical methods such as LOCF [81]. When datasets imputed by GP-VAE were used for classification with logistic regression, it performed better than baselines. L-VAE achieved better predictive performance than BRITS, GRUI-GAN, GP-VAE, and other baselines, making it among the best-performing time series imputation methods published to date.
Other multivariate time series imputation methods published in 2020 and 2021 have been developed using novel approaches, including random drop [84], transformers [85], selfattention [86], and conditional score-based diffusion [87] models. For example, Random Drop Imputation with Self-Training (RDIS) [84] is an ensemble model that takes an incomplete dataset as input and then generates synthetic datasets by randomly selecting and labeling known values as "missing". The objective of the model is explicitly to learn to impute these "missing" values for each dataset, meanwhile generating values for the actual missing data points in each ensemble component. The entropy of all imputed values is then computed, and imputed values for all data points with entropy below a given threshold are used as pseudo-labels for a subsequent self-training task to impute missing values from the original dataset. RDIS achieved improved imputation performance compared to M-RNN and BRITS. Another model, Global and Local Time Series Imputation with Multi-directional Attention Learning [86] (GLIMA) comprises local and global RNN layers followed by a multidirectional self-attention layer. GLIMA achieves improved performance over GRU-D, M-RNN, BRITS, GRUI-GAN, and E2GAN when the dataset was imputed and then used for various classifiers including RNN, support vector machine (SVM), and logistic regression (LR).
Overall, deep learning-based static and time series imputation models have been widely successful, with significantly better imputation performance and improved robustness to higher missing rates compared to classical methods. Furthermore, many of these are end-to-end methods that combine imputation and downstream learning, allowing users to train their models on incomplete data in one step, which would not otherwise be possible.

V. QUALITY CONTROL FOR TABULATION ERRORS
For structured tabular data, such as electronic health records or public health surveillance datasets, an important data quality consideration is the presence of errors, which can impact the accuracy of downstream data analyses, even when using robust machine learning algorithms [88]. Tabulation errors can be heterogeneous and may comprise incorrect values, uninterpretable values (ex. typos), inconsistent use of diagnostic or medications coding systems (ex. ICD-9 vs. ICD-10) or inputting data with the wrong units. A summary of how the issue of tabulation errors relevant for various types of COVID-19 health data is presented in Table I (short version) and Suppl. Table IV (detailed).
The following subsections describe classical and advanced methods for handling tabulation errors, with a focus on both quantitative and qualitative error detection methods applicable for static (i.e., tabular) data. Suppl. Fig. 6 provides examples of tabulation errors and classical and advanced quality control methods, as well as performance comparisons between these methods. Advanced methods are further summarized in Suppl. Table V and Fig. 1 (Right), with a visual timeline presented in Fig. 3. The issue of tabulation errors is further visualized in Fig. 4.

A. Classical Approaches for Handling Tabulation Errors
Handling tabulation errors can be a manual and timeconsuming task. Errors can be detected using quantitative or qualitative approaches. Quantitative error detection involves identifying outliers in a dataset, as described in Suppl. Table  X. Qualitative error detection involves specifying logical patterns or relational constraints for observations or features in the dataset and using those to identify violations in the dataset [89]. These relational constraint rules are typically defined by domain experts and are summarized in Suppl. Table XI.

B. Advanced Approaches for Handling Tabulation Errors (Quantitative Error Detection)
Neural networks have been used for outlier detection from the early 2000s [90]. More advanced deep learning models have been developed in recent years based on progress in the neural networks research field. In 2016, the first ensemble autoencoder model for outlier detection (RandNet) was introduced for tabular data [91]. They set up the ensemble with several autoencoders, each with random connections between neurons dropped to make the models different enough from each other. Each autoencoder of the ensemble is trained independently, and the reconstruction error ("outlier score") for each sample point for each autoencoder is calculated. The final outlier score for each sample is the median score obtained from all ensemble components and is used to train a supervised classifier to identify labeled outlier points. This model achieved a significantly higher outlier classification accuracy (92.87%) on various datasets compared to classical baselines such as local outlier factor (LOF, 50.63%). Deep Autoencoding Gaussian Mixture Model [92] (DAGMM) was later introduced as an unsupervised end-to-end anomaly detection method combining an autoencoder with a Gaussian Mixture Model (GMM). The latent representation and reconstruction error from the autoencoder are then modeled using the GMM to evaluate the energy or likelihood of each input sample. DAGMM achieved an improved F1 score for outlier classification compared to various baselines across several datasets. In 2017, Schlegl et al. [93] introduced the first deep generative adversarial network (GAN) for anomaly detection which was originally applied to image data but was then shown by Zenati et al. to be applicable to tabular data [94] as well. A standard convolutional GAN is initially trained using normal data without anomalies, with the objective of minimizing residual and discriminator loss terms that are designed to enable the model to better learn the statistical distribution of "normal" data. These loss terms are included as part of an "anomaly score," which, upon running the trained model with both normal and anomalous samples, will return a high score for samples that do not resemble the normal data. In subsequent years, additional GAN-based outlier detection models were developed and applied to tabular data, including the works by Zenati et al. [94] in 2018 and Liu et al. [95] in 2019 ("Multiple Objectives Generative Adversarial Active Learning", MO-GAAL). The model by Zenati et al. outperformed that of Schlegl et al. in terms of precision, recall and F1 score for tabular outlier classification, and performed comparably to DAGMM. MO-GAAL uses multiple generators with different objectives to model the distribution of the dataset and generate potential outlier data points. As the model is trained using both normal and anomalous samples through multiple iterations, the discriminator starts to learn the distribution of the input data and learns the boundary between normal and anomalous data samples. MO-GAAL was compared to many classical outlier detection methods (i.e., distance-based, clustering-based, density-based, angle-based, and classification-based methods) across various datasets using the Friedman test and had the highest average rank in terms of overall performance. In 2019, adVAE (self-adversarial variational autoencoder) [96] was developed by Wang et al. and used a Gaussian transformer network to generate outlier latent variables, as well as encoder and generator subnetworks, both of which can discriminate between normal and outlier latent variables. This model outperformed MO-GAAL for outlier classification performance across almost all datasets tested. Many other deep learning-based works have been published for outlier detection [97], [98], [99] and this continues to be a rapidly growing field of research [100]. Meanwhile, there continues to be growth in development of novel statistical algorithms for outlier detection that do not involve deep learning, including density-based [101], graph-based [102], and ensemble-based [103] methods.

C. Advanced Approaches for Handling Tabulation Errors (Qualitative Error Detection)
In recent years, qualitative error detection tools have emerged that can automatically define integrity constraints, discover integrity constraint violations in a given dataset [104], [105], [106], and either provide suggestions on potential errors to repair or fix the errors automatically [89]. Given how time-consuming, complex, and prone to errors process of qualitative error detection can be, these algorithms have the potential to have a significant impact on data quality for many different research applications. Examples of recently developed cutting edge machine learning algorithms for error repair are described in Suppl. Table XII. In 2013, FASTDC [105] was developed to discover denial constraints from the dataset, which precludes added time and effort needed for a user to do this manually. Also in 2013, Chu et al. developed Holistic Data Cleaning (Holistic) [107], the first approach to data cleaning that integrates data from heterogeneous integrity constraint rules to identify and repair errors based on constraint violations. In this work, the user provides pre-specified data quality or integrity constraint rules, and the algorithm automatically finds and repairs violations. Another model, KATARA [108], uses an external knowledge base to identify qualitative errors in a dataset and then suggests to the user a set of possible repairs. In 2016, ActiveClean [109] was developed as an iterative data cleaning framework developed for convex-loss machine learning applications, such as support vector machines, linear regression, or logistic regression. The user specifies the model of interest, and the machine learning model is iteratively trained. In each iteration, ActiveClean suggests to the user what data may need to be repaired and the user then manually repairs the data using value transformations or filtering operations. Thus, ActiveClean uses data from the model performance to improve repair suggestions, leading to more successful data cleaning.
In 2017, HoloClean [110] was developed as a probabilistic framework which takes as input a dirty dataset, an external reference knowledgebase, and user-defined integrity constraints, and finds and automatically repairs errors in the dataset. It does this using an error detection module, which utilizes integrity constraints and the external data to detect outlier and identify repairs, a compilation module, which compiles data as features in a probabilistic graphical model, and a repair module, which repairs data using probabilistic inference on the graphical model. To measure data cleaning performance, each value in each original dataset used for benchmarking was manually labeled as correct or incorrect and then precision, recall, and F1 score for the repaired dataset were calculated after data cleaning. HoloClean had better performance compared to previous baselines including Holistic and KATARA. For example, for a healthcare dataset, HoloClean achieved an F1 score of 0.832, compared to 0.435 and 0.379 for Holistic and KATARA, and was successful for all datasets whereas other methods had extremely long runtimes or failed to identify certain types of errors. Other frameworks, such as PIClean [111] and MLNClean [112], followed suit with similar approaches. PIClean (2019) is a probabilistic interactive data cleaning system that uses relations between data columns to identify potential dataset errors and repairs. The suggestions are provided to users, which can then be confirmed or rejected, and user feedback is used to improve the performance of the model. It is more interactive with the user compared to HoloClean. In 2019, another system (MLNClean) was developed using Markov logic networks (MLNs) to clean the dataset by identifying both schema-level and instance-level errors. Like HoloClean, it requires domain experts to specify integrity constraints. However, it achieves consistently improved error detection performance compared to HoloClean even at greater error percentages and is also consistently faster. More recently, Rotom [113] was developed, which is a platform that leverages data augmentation and uses transformer [114], meta-learning, and self-supervised learning concepts to detect qualitative errors. It requires the user to provide a small number of labeled training examples that are known to be incorrect but achieves good performance for error detection. Rotom was compared to other data augmentation frameworks, but its performance compared to previous state-of-the-art data cleaning methods was not discussed.
Overall, these methods are promising in that they not only save researchers significant time and effort compared to manual qualitative data cleaning but also achieve good consistent error detection or repair performance.

VI. QUALITY CONTROL FOR NOISE AND ARTIFACTS
Noise and artifacts can occur in waveform and medical image data (Table I Suppl. Table IV). They can interfere with relevant physiological signals and thus pose a significant quality issue for researchers working with COVID-19 healthcare data. For waveform and medical image data, we focus the discussion on waveform data including electrocardiogram (ECG) signals, audio data and medical image data including X-ray or computed tomography (CT) images, respectively, which are most relevant for COVID-19 related research. ECG waveforms have been shown to be associated with COVID-19 cardiovascular complications [115], and several deep learning algorithms have recently been developed that can detect [115] or predict [116] these using ECG waveform data. Similarly, respiratory-related audio measurements related to speech or coughing sounds have been used to diagnose positive cases -although reliant on typically noisy and imbalanced datasets. Several publications have used the synthetic minority over-sampling technique (SMOTE) to address the overrepresentation of non-COVID19 subjects in otherwise large datasets (i.e., Coswara and Sarcos datasets) [117], [118], [119]. This common issue and its solutions, however, are out of the topical scope of our discussion. X-ray and CT images have also emerged as important data sources for AI tool development for COVID-19 screening and diagnosis. Low-dose chest CT (LDCT) scans have also been discussed as an alternative to normal-dose CT scans for COVID-19 routine practice to reduce the added risk to the patient [120], [121].
The following subsections detail key classical and advanced methods to handle noise and artifacts, with a focus on ECG waveform data (i.e., ECG, audio) and CT image data. Examples of these methods and comparisons between classical and advanced approaches are visualized in Suppl. Fig. 7. Advanced methods are further summarized in Suppl. Table IV and Fig. 1 (Right), with a visual timeline presented in Fig. 3. The issue of noise and artifacts is further visualized in Fig. 4.

A. Waveform Noise and Artifacts (Classical Approaches)
Waveforms are continuous data points collected over time and are generally obtained from sensors such as photoplethysmograms (PPGs), accelerometers, thermometers, touch sensors, and gyroscopes. They can be highly susceptible to noise and various types of artifacts including motion artifacts and electromagnetic interference (EMI). Classical waveform de-noising methods include empirical mode decomposition, adaptive filtering, and wavelet transforms (WT). Classical signal processing methods are not always optimal for improving ECG signal quality. Methods such as empirical mode decomposition (EMD) are suboptimal because they might remove true signals from ECGs, and adaptive filters such as normalized least mean squares (LMS) require a reference noise signal, which cannot always be obtained. Although WT methods are widely used for ECG de-noising, they work in the frequency domain and cannot always distinguish true signals from artifacts when the artifacts morphologically resemble the true signal, leading to residual noise in some cases.
One of the most common data quality issues addressed in audio analysis is removing background noise from a poor signalto-noise ratio (SNR) to enhance speech recognition or audio quality. There are a variety of sources that contribute to a low SNR such as acoustic, environmental or distorted sounds. Multiple classical approaches to audio denoising include Weiner filtering [122], [123], spectral subtraction [124], [125], minimum mean squared error (MMSE) estimation [126] and optimallymodified log-spectral amplitude (OM-LSA) estimation [127]. However, these methods can sometimes introduce additional artifacts, such as the generation of 'musical noise' through spectral subtraction due to the flat, short-time noise spectrum estimate that is subtracted from the whole spectrum [124]. Additionally, common among these approaches is the use mel frequency cepstral coefficients (MFCCs) as representational features and an outcome in voice recognition that is a more uniform, but less recognizable speech spectrum [128], [129].

B. Waveform Noise and Artifacts (Advanced Approaches)
Since the mid-2010s, a small but growing research field has developed to investigate how deep learning models can be used to for ECG de-noising, to achieve better performance than traditional approaches. For example, in 2016, Xiong et al. [130] developed the first de-noising autoencoder (DAE) for ECG signal de-noising using wavelet transform. They artificially added baseline wander (BW), muscle artifact (MA), and electrode motion (EM) artifacts to clean signals and then did a wavelet transform as a first de-noising step. The de-noised signal was then input into the encoding and decoding layers of the DAE to remove the residual noise and to output the clean signal. The resulting signal quality was significantly improved compared to WT, Stockwell transform, and other baselines. For example, for muscle artifact-corrupted ECG signals with signal-to-noise ratios (SNRs) of 5 decibels (dB), the signals de-noised using the proposed method achieved SNRs of 1.93 dB higher than those de-noised with WT (18.16 dB and 16.23 dB respectively). In 2019, Wang et al. [131] further developed a first generative adversarial network (GAN) approach towards ECG de-noising, where the generator subnetwork is trained to de-noise ECG waveform signals and the discriminator is trained to differentiate between de-noised signals and the original clean signal. The objective of the model is for the discriminator to be unable to adequately distinguish the de-noised signals from the original clean signals. This model achieved superior performance compared to WT for all common ECG artifacts tested at varying input noise levels. For example, WT improved the SNR of MA-corrupted signals from 5 dB to 18.16 dB, whereas the de-noised signals from the proposed method had a much higher SNR of 37.23 dB. This constitutes a 105% higher SNR compared to WT, which is significantly greater than the 12% improvement in SNR over WT from using the Xiong et al. method.
In 2018, Antczak developed the first recurrent neural network (RNN) for ECG signal de-noising using long short-term memory (LSTM) units and incorporating a de-noising autoencoder [132]. The input into the model was a pre-processed ECG signal normalized to zero mean, separated into samples where each sample comprised 600 time stamped data points in the signal. They pre-trained their model on synthetic data and then achieved good performance on actual ECG signals. While simple bandpass and WT filters performed best for signals with smaller noise levels (SNR greater than −7 dB), the RNN consistently performed best with more noisy signals. Arsene et al. more recently developed ECG de-noising models using convolutional neural networks (CNNs) [133], [134], and found that CNN-based models achieve better performance than a LSTM model they trained. Chiang et al. [135] developed a convolutional denoising autoencoder and emphasized the superior performance of convolutional layers for ECG de-noising. In 2020, Casas et al. [136] developed an autoencoder-based adversarial network applicable to ECG and other waveforms that uses a noisy signal as input and through encoding and decoding convolutional layers, it outputs a clean signal. Meanwhile, the discriminator of the model is trained to differentiate between clean and noisy signals using the inner hidden layer of the autoencoder. While this was not compared to other deep learning-based ECG de-noising methods, it was able to improve the SNR of noisy signals from −6.72 dB to 5.30 dB, compared to WT which only achieved a SNR of −5.90 dB. Another recent model developed by Zubair et al. [137] was designed specifically to remove motion artifacts from ECGs. The model comprises an RNN which de-noised the input ECG signals, followed by a standard deep neural network that resulted in a clean signal without artifacts. In general, these deep learningbased approaches perform significantly better than classical methods.
Although there is a significant level of interest in automated COVID-19 detection using cough, sound and speech data, multiple authors report the presence of background noise which, despite having been been reduced using classical audio denoising methods such as spectral noise gating and data augmentation via time stretching, can still persist [119], [138], [139]. However, many different types and combinations of deep learning methods have been employed to resolve the same issue including CNNs [140], RNNs [141], [142], [143], and GANs [144], [145], [146]. Deep learning methods for denoising audio commonly rely on a long-mel spectrograms for feature representation; however, some authors contend that while this method of compression and normalization can reduce training set size and training time, there may be an advantage to using raw waveform data for synthesizing higher quality sound and for future applications that avoid information loss but rely on more computational resources.

C. Medical Image Noise and Artifacts (Classical Approaches)
LDCT scans reduce patients' exposure to high levels of radiation compared to normal-dose images (NDCT), but simultaneously compromise the quality of the image. The added noise and artifacts in these images can make it challenging to use them for machine learning tasks if not properly managed. Common types of artifacts found in LDCT, NDCT, and X-Ray images are summarized in Suppl. Table XIII. Classical approaches for handling LDCT noise and artifacts in the image domain include non-local means [147], [148], dictionary learning [149], and block matching 3D methods [150], which became popular in the early 2010s (Suppl. Fig. 8).

D. Medical Image Noise and Artifacts (Advanced Approaches)
In recent years, there has been rapid growth in development of state-of-the-art deep learning-based tools to remove noise or artifacts from CT images in the image domain, after the image is reconstructed from the waveform domain. In 2017, Chen et al. [151] developed the first deep learning model for LDCT image de-noising from the image domain. They used a convolutional neural network (CNN) to map LDCT images to normal-dose versions of those images and achieved a peak signal-to-noise ratio (PSNR) and subjective quality rating comparable to that of block matching 3D. Also in 2017, Yang et al. (WGAN-VGG) [152] and Wolterink et al. [153] developed some of the first generative adversarial network-based models for LDCT de-noising. Although WGAN-VGG performed worse than dictionary reconstruction, in terms of PSMR and structural similarity index measure (SSIM), it is advantageous because it does not procure unintentional image blurring or waxy or blocky artifacts and achieved significantly better subjective quality ratings by radiologists. In the same year, Chen et al. introduced Residual Encoder-Decoder Convolutional Neural Network (RED-CNN) [154], which used both autoencoder and convolutional neural network-based architecture concepts. The model comprises a series of convolutional encoding layers followed by deconvolutional decoding layers to remove noise from LDCT images. The encoding layers filter noise from the original image, whereas the decoding layers and shortcut connections between encoding and decoding layers are used to recover structural details from the image. Based on qualitative analysis of image quality by experienced radiologists, images de-noised using RED-CNN had improved artifact reduction, noise suppression, contrast retention, lesion discrimination, and overall quality compared to baseline methods such as dictionary learning, non-local means, and the previously published algorithm by Chen et al. Unlike previous deep learning image de-noising models, RED-CNN also had improved PNSR compared to all baselines. In 2018, Structurally-Similar Multi-Scale Generative Adversarial Network (SMGAN-3D) [155] was also introduced as novel GAN-based model for LDCT image de-noising, but it did not achieve significant improvements over RED-CNN. Dilated Residual Learning (DRL) [156] is a CNN-based model introduced in 2019 which uses an edge detection convolution layer to identify object boundaries and uses dilated convolution layers to capture more contextual information from the input image in fewer layers to make computation less expensive. These modifications enabled it to achieve superior performance compared to the original Chen et al. model [151].
Most recently, in 2021, Eformer [157] was introduced for medical image de-noising and achieved state-of-the-art performance over baselines on the American Association of Physicists in Medicine (AAPM) Low-Dose CT Grand Challenge Dataset, including RED-CNN. Eformer is the first transformer-based model for medical image de-noising and is applicable not only to LDCT but to other types of medical images as well. It uses an encoder-decoder network with transformer blocks and integrates Sobel-Feldman filters for edge enhancement. Eformer surpasses all baseline state-of-the-art models tested in terms of PSNR and RMSE, with a nearly 3% higher PSNR and 12% lower RMSE than RED-CNN. This continues to be an active area of research with new models building on previous ones and achieving improved performance [158], [159], [160]. A few models have been specifically designed to handle metal [161] or motion [162] artifacts in CT scans or general image modalities. For example, in 2017, DeblurGAN [162] was introduced to correct motion artifacts, and in 2019, Liao et al. introduced the Artifact Disentanglement Network (ADN) [163] for metal artifact reduction from CT images.
Many general image de-noising algorithms can also be applied to medical images. For example, a highly cited general image de-noising algorithm, Noise2Noise [164], was developed in 2018 and was able to de-noise images without access to clean training example images and has been successfully applied to remove noise from magnetic resonance (MR) images. MedGAN [45] is another such GAN-based algorithm that appears to be broadly applicable but that was tested with positron emission tomography (PET) images. This model introduces a novel generator architecture that links together several convolutional encoder-decoder subnetworks with skip connections to enhance the resolution of the generated images. Additionally, a feature extraction subnetwork is introduced that is pre-trained using synthetic samples and input samples of the target domain, with loss terms that minimize the differences between the generated and target domain styles. These novel contributions enable MedGAN to translate noisy input images into de-noised versions of the same images. Suppl. Table XIV provides a summary of some of the deep learning algorithms for medical image noise and artifact removal published in recent years, with a focus on those applicable to CT scans, which are commonly being used for COVID-19 research applications. Suppl. Table XV illustrates the role of deep learning in the reconstruction process for producing high quality images.
Apart from image noise and artifact, low resolution is also a common image quality issue in which the image can appear blurry [165] and thus difficult to interpret. It is important to note that state-of-the-art deep learning algorithms have been developed to improve the resolution of input images using deep convolutional neural networks [166], [167], unfolding networks [168], transformer networks [169], and more.

VII. FUTURE DIRECTIONS
For each of the four data quality issues discussed in this review, we identified key trends in quality control COVID-19 data that are likely to persist moving forward (Suppl. Fig. 9). To address the problem of lack of data standardization, both generative neural network models and federated learning architectures have enabled levels of standardization that were not previously possible. Generative models such as GANs have expanded the scope of what is possible with medical image standardization using classical approaches such as DICOM GSDF and histogram matching, enabling translation between different image styles, CT contrast styles, and more. Federated learning is a powerful methodology that precludes the need to directly harmonize data from different sources prior to combining the data into a machine learning model for training, enabling data science researchers to train models with large-scale multi-institutional data without the added time and effort needed to manually standardize the data. We anticipate that the use of these tools will continue and expand moving forward. We anticipate that there will be continued growth in the use of advanced deep neural network architectures for static and time series data imputation, for managing noise and artifacts in waveform and medical image data, and for detecting outliers. We also expect that advanced probabilistic graphical models and automated approaches will increasingly be used to detect and correct qualitative errors.

VIII. CONCLUSION
This review summarizes a selection of novel state-of-the-art data quality control methods being developed to address the issues of lack of standardization, missing data, tabular errors, and noise and artifact. Broader leveraging of these tools across the COVID-19 research community can lead to better-performing algorithms powered by better-quality data.

ACKNOWLEDGMENT
The authors thank Dr. Wang's lab for providing valuable insights, and particularly thank Isha Shah, of the Georgia Institute of Technology Department of Computer Science, for her contributions. The authors appreciate the vibrant discussions in The Third NSF Workshop on Predicting Pandemic Emergence of the Predictive Intelligence for Pandemic Prevention (PIPP) initiative on Feb. 2021.