Recent Advances in Variational Autoencoders With Representation Learning for Biomedical Informatics: A Survey

Variational autoencoders (VAEs) are deep latent space generative models that have been immensely successful in multiple exciting applications in biomedical informatics such as molecular design, protein design, medical image classification and segmentation, integrated multi-omics data analyses, and large-scale biological sequence analyses, among others. The fundamental idea in VAEs is to learn the distribution of data in such a way that new meaningful data with more intra-class variations can be generated from the encoded distribution. The ability of VAEs to synthesize new data with more representation variance at state-of-art levels provides hope that the chronic scarcity of labeled data in the biomedical field can be resolved. Furthermore, VAEs have made nonlinear latent variable models tractable for modeling complex distributions. This has allowed for efficient extraction of relevant biomedical information from learned features for biological data sets, referred to as unsupervised feature representation learning. In this article, we review the various recent advancements in the development and application of VAEs for biomedical informatics. We discuss challenges and future opportunities for biomedical research with respect to VAEs.


I. INTRODUCTION
Over the past decade, there has been a remarkable increase in the amount of available large-scale biomedical data such as molecule compound structures [1], DNA/protein sequencing [2], [3], computer tomography (CT)/magnetic resonance imaging (MRI) [4], [5], and electronic health record (EHR) [6], [7], among others. Gaining insights and knowledge from heterogeneous, high-dimensional, and complex, biomedical data remains a key challenge in transforming bioinformatics research.
Recent advances in several factors have led to an increased interest in the use of artificial intelligence (AI) approaches [8], [9], that have greatly improved the performance of biomedical data analyses. However, due to the scarcity of labeled training data, data generation is a fundamental problem in several areas of deep learning. This is especially useful in imbalanced dataset problems and few-shot learning where a few classes may have low The associate editor coordinating the review of this manuscript and approving it for publication was Kumaradevan Punithakumar .
representation in the dataset [10]- [12]. On the other hand, due to the powerful development of deep learning techniques in recent years, such as Convolutional Neural Networks (CNN) [13]- [16], the ability to learn meaningful nonlinear feature embeddings with little or no supervision has become a key improvement toward applying AI to the enormous unlabeled data acquired in the world, where a system, fed with raw data, provides its own representations. However, the CNN is mainly designed to automatically and adaptively learn features of spatial hierarchies for object-classification tasks [8]. Recently, Bengio et al. [17] proposed a concept of meta-priors to learn mapping from high-dimensional space to a meaningful low-dimensional embedding. In this concept, the high-dimensional inputs can be reconstructed from the low-dimensional manifold representations [18].
Recently, deep generative models have gained a lot of attention due to numerous applications in data generation [17]. Among them, variational autoencoder (VAE) [19]- [21] is regarded as one of the most popular approaches to generative modeling as well as a low-dimensional manifold representation learning. The VAE can also be regarded as a mixture of an encoder and a decoder Bayesian network. The encoder maps an input data (e.g., an image) x to a latent vector z, and then, the decoder maps the latent vector z back to image or data space [20]. The VAEs are able to learn the smooth latent representations of the input data [17] produced by the encoder and thus generate new meaningful samples, balancing the dataset with more intra-class variants in an unsupervised manner via the decoder. In addition, a key benefit of VAEs is the ability to control the distribution of the latent representation vector z, which can combine VAEs with representation learning to further improve the downstream tasks [18], [22]. Moreover, the generated image quality and diversity are improved by the existing VAE-variants such as β-VAE [23] and InfoVAE [24], which combine VAEs with disentanglement learning, GMVAE [90] and VaDE [91], which give the VAE the ability for classification with unsupervised clustering, f-VAEGAN-D2 [92] and Zero-VAE-GAN [93], which combine VAEs with GANs and few-shot learning, S-VAE [94], which combines VAEs with spherical latent representation, VQ-VAE [95], which combines VAEs with discrete latent representation, VAE-GAN [96], which combines VAEs and GANs to generate a high-quality image, and S3VAE [97], which combines VAEs with disentangled representations of sequential data. Therefore, apart from the VAE being used as a powerful generative model, it is particularly important that its excellent nonlinear latent feature representation learning idea be used to produce a series of new research directions and various applications in biomedical informatics.
Due to the above characteristics of VAEs, current VAE research in biomedical informatics focuses primarily in two directions 1) data generation approach and 2) representation learning approach. In this article, we aim to provide a concise and insightful discussion of the latest advances in applying VAEs to biomedical informatics. We particularly highlight the most important techniques in successfully applying VAEs in this field. Table 1 shows VAE research in biomedical informatics.
The structure of our paper is organized as follow: Section II overviews some background work about VAEs. Section III provides overview the application of VAEs in biomedical informatics. Conclusion and future work are given in Section IV and references are delineated at the end.

II. OVERVIEW OF VAEs
In this part, we first present the theory behind VAE. Additionally, we introduce the data augmentation approach and representation learning approach of VAEs. It should be emphasized that data augmentation is also the one of the results of representation learning. In many studies, the two approaches can often be utilized at the same time in order to achieve the similar goal.

A. PRELIMINARIES OF VAEs
The VAE is an unsupervised generative model that provides a principled way for performing variational inference utilizing an Autoencoder (AE) architecture [98], [99]. As shown in Figure 1, the VAEs enhance a normal AE by adding a Bayesian component that learns the parameters representing the probability distribution of the data. The main difference between AE and VAE is the AE learns the compressed representation of the input, and its decompression to match the given input. In contrast, the VAE is a Bayesian model which learns the compressed representation of the AE, and constructs the parameters representing the probability distribution of the data. It can sample from this distribution and generate new input data samples. Therefore, VAE is a generative model, where as an AE which just does reconstruction does not have an obvious generative interpretation.
The VAEs use distribution estimation and sampling to achieve generation of new data [100]. To explain this further, suppose in a continuous or discrete high-dimensional space, there is dataset is Suppose encoding process produces a latent variable Z in a relatively low-dimensional space. Then, the generated model can be divided into two processes: 1) Latent variable Z approximates posterior distribution q φ (z|x) -the inferred network -through the inference process: 2) The generation process of the variable X which the data-likelihood can be defined as: The distribution of the latent random variable Z cannot be estimated directly, and the integral of the marginal likelihood P θ is intractable. Therefore, the EM algorithm can't be utilized to compute the variational inference. To overcome this difficulty, VAE present an inferred model q φ (z|x) instead of the true posterior distribution.
Specifically, the VAEs consist of the following parts: an encoder network which parameterizes a posterior distribution q(z|x) of discrete latent random variables z given the input data x, a prior distribution p(z), and a decoder with a distribution p (x|z) over input data. Suppose we want to approximate a distribution p (x|z) with some q(z|x) distribution via the Kullback-Leibler (KL) divergence [101], then by definition of KL, Since D KL is always positive, we can conclude that: Equation (3) is an important result and is known as the Evidence Lower Bound (ELBO) [102]. In a deep neural network implementation of a VAE, equation (3) is used as the loss function during training of the network. The E[ log p (X |Z )] term denotes the reconstruction i.e., the generation of output from the latent representation z. The D KL [q (Z |X ) p (Z ) ] measures the similarity of the distribution of the latent space with the target distribution p(z). Thus, the two components of equation (3) try to make the output similar to the input while keeping the distribution of the latent space as close to the target distribution p(z) as possible.
The ELBO is tight if q (z) = p (z|x), indicating that q (z) is optimized to approximate the true posterior. For scalability to larger datasets, we do not optimize q (z) for every data point X . Instead an inference network q (z|x) is introduced that is parameterized by a neural network that outputs a probability distribution for each data point X . Therefore, the final objective is to maximize: According to the objective described in equation (4), after we introduced q φ (z|x) to approximate p θ (z|x), if we want to sample Z from q φ (z|x), an easy choice is to assume that q φ (z|x) obeys the Gaussian distribution and that the sampling of Z can be done in the following reparameterization way [103]: where is an auxiliary noise variable such that ∼ N (0, 1) i.e., let q (z|x) be a Gaussian with parameters q(z|x) and p(z).
Then the KL divergence between q (z|x) and p(z) can be computed in closed form as follows: The explosion of big data and the development of GPUs provide sufficient training samples and advanced hardware facilities, which help to enhance the performance of deep learning. However, many applications of deep learning can only be realized under the premise of having a massive amounts of high-quality labeled data [100] and there still exist many domains lacking sufficient ideal training samples. Data augmentation is a strategy that can significantly increase the data available for training models so that researchers do not need to actually collect new data and increase labor costs. In addition, insufficient or unbalanced data distribution can lead to over-fitting and over-parameterization problems, resulting in a significant drop in the effectiveness of learning results. To this end, previous research in data generation is augmented data by modifying images via simple transformations such as basic image processing [104]. However, naïve method has limitations such as lacks intra-class variations that cannot well represent the data variance. In order to solve these problems, attempts have been made to convert the original data variance into the feature variance.
In recent years, deep generative models have gained a lot of attention due to numerous applications in machine learning. A generative model aims to learn the features of the input and recover the original data or generate similar data from a latent space distribution, thereby increasing the variance of the dataset [105]. VAEs [19] and Generative Adversarial Networks (GANs) [106] are regarded as the two most popular approaches to generative modeling. However, VAEs do not suffer problems encountered in GANs, mainly: nonconvergence causing mode collapse, and are hard to evaluate [105], [107]. What's more, VAEs have decent theoretical guarantee: first, by introducing the variational lower bound, the complicated calculation of the marginal likelihood probability is avoided. Second, by the reparameterization trick, the complicated Markov chain sampling process of latent variable is avoided.
Despite the above-mentioned advantages of VAEs, they do have some premise constraints such as compared to GANs, the samples it generates tend to be blurry and of lower quality [108]. In order to solve the problems, researchers have proposed many variations of the VAEs based on different task requirements such as representation learning, disentanglement and deep clustering with the goal of greatly improving the intra-class variations and quality of the generated data [109]. The ability of VAEs to synthesize images at stateof-art levels gives hope that the chronic scarcity of labeled data in the biomedical field can be resolved.

C. REPRESENTATION LEARNING APPROACH
The performance of models can be improved by selecting different representations to adjust the difficulty of machine learning [110]. Feature engineering [111] is one of the methods that can refine the representations from raw data.
Feature engineering refers to transforming raw data into advanced training data representations. However, in machine learn-ing, manually selected features rely on human and professional knowledge, which is part of the most time-consuming and energy-intensive work, and its weakness is the inability to extract and organize discriminant information from the data. Moreover, although our world is inundated with data, a large part of the data is still unlabeled and unorganized. Therefore, the ability to learn meaningful nonlinear feature embeddings with little or no supervision has become a key improvement toward applying AI to the enormous unlabeled data acquired in the world. Recently, many representation learning models have been proposed based on the VAEs where the goal is to learn mapping from high-dimensional space to a meaningful low-dimensional embedding. Furthermore, they can learn useful disentangled representations automatically. The representation learning of VAEs is done by the meta-priors proposed by Bengio et al [17]. The goal of representation learning is to be useful for downstream tasks. The most important meta-prior is called ''disentanglement'' which is an unsupervised learning technique that breaks down, or disentangles, each feature into narrowly defined variables and encodes them as separate dimensions [17]. Assuming that the data is generated from independent factors of variation, and if the VAE is trained to reconstruct the sample well, then the latent space between the encoder and decoder keeps the important information of the original data. Intuitively, a factorial code disentangles the individual elements that were originally mixed in the sample, just as humans recognize complex things by disentangling independent elements. If the dimensions of the latent vector are independent of each other, it is factorial disentangled, i.e., a good representation. VAEs have made such nonlinear latent variable models tractable for modeling complex distributions, and efficient extraction of relevant biological information from learned features for biological data sets, referred to as unsupervised representation learning.

III. APPLICATION OF VAEs IN BIOMEDICAL INFORMATICS
Over the past decade, there has been a remarkable increase in the amount of available biomedical data available. Data types can include images [112], audio [113], textual information [1], high-dimensional omics data [2], heterogeneous data [114], and other information from wearable devices. Recent advances in several factors have led to increased interest in the use of VAE approaches within the biomedical informatics and pharmaceutical industry [9], [115]. It is particularly important that VAE not only be used as a powerful generative model but also its excellent nonlinear latent feature representation learning. However, using VAEs for validating and visualizing learning in biological datasets is particularly challenging and remains in its infancy. In the following sections, we aim to provide a concise and insightful discussion of the latest advances in applying VAEs to bioinformatics. We particularly highlight the most important techniques in 4942 VOLUME 9, 2021 successfully applying VAEs in this field: 1) molecular design; 2) sequence datasets analyses; and 3) medical imaging and image analyses.

A. MOLECULAR DESIGN
VAEs has numerous applications in drug discovery, including de novo molecular design. The general application of VAEs in compound dataset is to generate new chemical/molecule structures. Compound screening refers to the process of selecting compounds with high activity for a specific target through standardized experimental methods. In order to discover and optimize molecules, a chemical space of drug-like molecules estimated to be 10 23 -10 60 must be searched. Screening out compounds that meet the activity index [116] is a time consuming process. Moreover, during the optimization process, adjusting one property by changing the molecular structure often has negative effects on another property [31]. VAEs can accelerate the development of this process, with p (x|z) learned by the decoder aiding in appropriate chemical/molecule structure generation.
In VAEs, chemicals/molecules are represented as continuous and differentiable vectors residing on a probabilistic manifold. By grouping molecules according to properties of interest, new molecules with desired profiles could be generated by decoding latent vectors from the organized continuous latent space back to discrete molecules. For a given molecule, we can sample nearby latent spaces to decode similar molecules. As we increase this distance, increasingly dissimilar molecules can be decoded. A diagram of the VAE used for molecular design is shown in Figure 2.
The two main molecular representations used by previously reported generative models are 1) a string notation called Simplified Molecular Input Line Entry System (SMILES) [117], and 2) graphs [118]. Figure 2 shows that given a starting from a discrete molecular representation, such as a SMILES string, the encoder network converts each molecule into a vector in the latent space, which is effectively a continuous molecular representation. Given a point in the latent space, the decoder network produces a corresponding  SMILES string. A molecular represented as a string notation and as graph are shown in Figure 3 and Figure 4, respectively. Note that the sampling results are generated by the code provided in the original papers [25], [32]. Table 2 shows literatures for the application of VAEs in molecular design.
Initial work in generative models for chemical/molecule compound structures focused on SMILES based methods. Gómez-Bombarelli et al. [25] implemented a VAE-based method that can convert discrete variables of a molecule to multidimensional continuous variables to generate new molecules. The encoder converts the discrete variables of the molecule into continuous variables, and the decoder converts these continuous variables back into discrete variables. Continuous variables of molecules can automatically generate new chemical structures by decoding vectors from the latent space. However, above methods often produce outputs that are not valid. Kusner et al. [26] used SMILES grammar, extracted from the training set, to improve the quality of the latent space. Although they saw an improvement, the models still struggled with invalid outputs and undesirable chemical structures such as large carbon rings and uncommon functional groups. Since complex non-linear methods were used, molecules possessing desired properties could be located in multiple locations in the latent space, thus molecule generation with an entire profile of properties is difficult. Mohammadi et al. [31] proposed the Penalized Variational Autoencoder (PVAE), which incorporates a penalty term on the decoder of the VAE and operates directly on SMILES strings. It demonstrates that the PVAE results in a significant improvement in latent space quality and transferability to new chemistry over all previous VAE approaches. Schiff et al. [27] examined the latent space of a VAE trained on molecular SMILES representations, and demonstrate how well the VAE's latent space encodes 3D topological structure of molecules. Pang et al. [28] proposed a VAE to learn energy-based prior model with SMILES molecules in latent space. Samanta et al. [29] trained a VAE that provides a rapid and novel metric for calculating molecular similarity. Yan et al. [30] propose a re-balancing VAE Loss to generate more valid SMILES molecules.
As a result of challenges encountered with SMILES-based methods, attention has shifted to graph-based methods with the proposal that it could be a superior molecular representation for VAE approaches [121]. Jin et al. [32] developed a graph based VAE that operates on a vocabulary of subgraphs extracted from the training set. Their method greatly improved the quality of the latent space over previous approaches. Liu et al. [33] proposed a variational autoencoder model in which both encoder and decoder are graphstructured, i.e., a sequential generative model for graphs built from a VAE with Gated Graph Neural Networks (GGNN) for the application of molecule generation. This approach achieved state-of-the-art generation and optimization results. Simonovsky and Komodakis [34] addressed the problem of generating graphs from a continuous embedding in the context of VAEs, which combines standard graph matching algorithm to align the output to the ground truth. Tavakoli and Baldi [119] proposed a generative model in the form of a VAE which operates on the 2D-graph structure of molecules in order to continuously represent molecules. Shervani-Tabar and Zabaras [35] presented a VAE-based framework for computing the statistics of molecular properties given small size training data set. Nesterov et al. [36] proposed an extended version of the VAE, which allows to efficiently generate 3-d molecular structures and explore molecular domains within a continuous low-dimensional representation of the molecules. Ragoza et al. [37] proposed a VAE based model and a fitting algorithm to generate 3D molecular structures by converting continuous grids to discrete molecular structures. Mahmood et al. [38] proposed a mask map model based on VAEs, which can generate novel molecular graphs by iteratively generating a subset of the graph components. Kwon et al. [40] constructed a VAE-based model to compress graph representation for scalable molecular graph generation. While the literature related to the creation of novel molecules using VAE models has proliferated, similar works that strive to do the same for inorganic crystal structures are less common but are on the rise. Court et al. [120] present an VAE-based deep-representation learning pipeline for geometrically optimized 3-D crystal structures that simultaneously predicts the values of eight target properties. Graph-based molecule sampled from VAE Gaussian prior distribution is shown in Figure 5. Note that the sampling results are generated by the code provided in the original paper [32]. In addition to generating new molecules, there are also studies on representation learning for decision-making based on VAEs. Koge et al. [39] proposed a method of molecular embedding learning using a combination of VAE and metric learning. This method can simultaneously maintain the consistency of the relationship between molecular structural features and physical properties, resulting in better predictions.

B. SEQUENCE DATASET ANALYSES
Another application of VAEs in bioinformatics is the application to sequence data such as genetic and amino acid datasets. Using deep learning tools for DNA/amino acid analyses usually requires converting sequences to numbers because many deep learning algorithms cannot work with categorical data directly. We can do this by one hot encoding our representations dataset. A one hot encoding is a representation of categorical variables as binary vectors. Figure 6 shows an example of a wild-type sequence represented as one-hot m×n matrix. Here m is the number of categories and n is the length of the sequence. Red dots correspond to a certain category at a certain position being utilized in an amino acid. Note that the sampling results are generated by the code provided in the original paper [46]. As a generative model, The VAEs can be used for data augmentation with more meaningful intra-class variations. Therefore, the basic applications of VAEs in sequence data are functional sequence engineering. Moreover, the VAEs can be used as feature dimensionality reduction technique. The main benefit of feature dimensionality reduction is to eliminate any redundant features and noise, which can improve the accuracy of prediction or generalization ability and support the interpretability of research results. Feature dimensionality reduction methods can be divided into two categories: supervised and unsupervised techniques [122]. Supervised techniques such as filter techniques [123], multivariate wrapper techniques [124] and embedded feature reduction techniques [125] select relevant features based on their ability to compute differences between groups. Unsupervised techniques such as Principal Component Analysis (PCA) [126], Independent Component Analysis (ICA) [127], AE, VAE and Coordinate-Based Meta-Analysis (CBMA) [128] are capable of selecting relevant features based on the results of interest. VAE utilizes the AE structure as a data pre-processing approach to generating representations that represent the structure of input data, and reduce its complexity, but does not reduce the quality or performance of the data.
In addition, since the structure of a VAE is a preexpandable structure, it can easily handle the integration of multiple heterogeneous data. Therefore, the VAEs can be successfully utilized in a more representation learning setting to learn the representation in complex integrative analyses of data, and eventually lead to more stable and accurate results [55]. Specifically, for many sequence databases there are different information sets besides sequence information can be obtained. Such as The Cancer Genome Atlas (TCGA) including gene expression levels information, so they can be used for more in-depth biomedical analyses such as predict effects of mutations, estimate gene expression levels, and DNA methylation analyses. A pipeline for building a VAE-based system in specific biomedical sequences analyses is shown in Figure 7.
The biggest strength of VAE in representation learning is that it makes nonlinear latent variable models easy to handle and can model complex distributions. This has allowed for efficient extraction of relevant biomedical information from learned features for biological data sets, referred to as unsupervised feature representation learning. Since sequence space is exponentially large, and experiments are costly and time consuming, accurate computational methods are essential for sequence annotation and design. In sequence analyses, every possible higher order interaction between sequences needs explicit incorporation of a unique free parameter that must be estimated. However, traditional methods such as pairwise model for sequence analyses are still unable to model higher-order effects. Rather than describing them by explicit inclusion of parameters for each type interaction, it is possible VOLUME 9, 2021 FIGURE 7. VAEs in sequence analyses Source: Adapted from [8].
to instead implicitly capture higher-order correlations by means of latent variables. Latent variables models posit hidden factors of variation that explain observed data and involve joint estimation of both hidden variables for each data point as well as global parameters describing how these hidden variables affect the observed. Towards this goal, numerous machine learning models have been developed to feature learning from evolutionary sequence data. Two widely used models for the analyses of genetic data, PCA and admixture analyses [129] can be cast as latent variable models with linear dependencies. Although these linear latent variable models are restricted in the types of correlations that they can model, replacing their linear dependencies with flexible nonlinear transformations can in principle allow the models to capture arbitrary order correlations between observed variables [57]. Learning robust predictive models based on sequence data is a key step in the development of precision medicine. Extracting meaningful low-dimensional feature representations from molecular data is the key to successfully solving high-dimensional molecular problems. Recent advances in VAEs have made such nonlinear latent variable models tractable for modeling complex distributions. This has allowed for efficient extraction of relevant biomedical information from learned features for biological data sets, referred to as unsupervised feature representation learning. Figure 8 shows the contrast between traditional models that factorize dependency in sequence families with a nonlinear latent variable z in a VAE model that can jointly influence many positions at the same time.

1) FUNCTIONAL SEQUENCE ENGINEERING
Protein engineering is of increasing importance in modern therapeutics. VAEs provide an alternative and potentially complementary approach capable of exploiting the information available in sequence and structure databases. Such conditional generation is of particular interest for protein design where it is frequently desirable to maintain a particular function while modifying a property such as stability or solubility. Guo et al. [41] proposed an approach to generating functionally-relevant three-dimensional structures of a protein and show the promise of generative models in directly revealing the latent space for sampling novel tertiary structures. Killoran et al. [42] propose VAE-based methods to generate DNA sequences and adjust them to have the required characteristics. Davidsen et al. [43] proposed a VAE model parameterized by deep neural networks for T cell receptor (TCR) repertoires and show that simple VAE models can perform accurate cohort frequency estimation, learn the rules of VDJ recombination, and generalize well to unseen sequences. Another approach has been to use trained models to directly move towards sampling sequences that have measurable and desired attributes. Greener et al. [44] indicated that CVAEs are able to carry out protein design tasks by conditioning output sequences on desired structural properties. Hawkins-Hooker et al. [45] proposed a VAE-based model that can be used to generate aligned sequence input and raw sequence input, and showed that both can reproduce the amino acid usage pattern of the family. Guo et al. [41] proposed a VAE-based interpretative approach to generate functionally related 3D structures of proteins.

2) SEQUENCE STRUCTURES DIMENSIONALITY REDUCTION
At present, the utilize of dimensionality reduction to explain the structure of single-cell sequencing data is still a challenge. Sequence datasets usually have high dimensionality, performing dimensionality reduction, followed by visualization or downstream analyses has become a key strategy for exploratory data analyses in sequence datasets, e.g., Single-cell RNA sequencing (scRNA-seq) [130], [131]. Ding et al. [47] provided a VAE based computational framework to compute low-dimensional embeddings of scRNA-seq data while preserving global structure of the high-dimensional measurements. Wang and Gu [51] proposed the deep variational autoencoder for the unsupervised dimensionality reduction and visualization of scRNA-seq data and find the nonlinear hierarchical feature representations of the original data as well as provide better representations for very rare cell populations in the 2D visualization. Hu and Greene [53] evaluated the performance of a simple VAE model developed for bulk gene expression data under various parameter settings and found substantial performance differences with hyperparameter tuning. Ding and Regev [52] introduced a hyperspherical VAE based model to embed cells into low-dimensional hyperspherical spaces, as a more accurate representation of the data which overcomes cell crowding and facilitates interactive visualization of large datasets. Rashid et al. [54] introduced a VAEs based method converts the single-cell genome data is into a feature space with a smaller dimension, so that the tumor subpopulations can be divided more effectively. The analysis of the encoding feature space reveals the evolutionary relationship between cell subpopulations. In addition, this method tolerates dropout of gene expression in single-cell RNA sequencing datasets. Dony et al. [48] found that VAE with a VAMP prior is capable of learning biologically informative embeddings without compromising on generative properties. Oh and Zhang [49] utilized various autoencoders to convert high-dimensional microbiome data into robust lowdimensional representations, and apply machine learning classification algorithms to the learned representations. Way et al. [50] compared different methods include VAE to compress data dimensionalities and learn complementary biological representations.

3) INTEGRATED MULTI-OMICS DATA ANALYSES
Multi-omics data covers a wide range of data generated from the transcriptome, proteome, genome, epigenome and metabolome. Due to a comprehensive understanding of diseases and human health, it is necessary to explain the molecular intricacy and multi-level variations of the multi-omics data. By providing an integrated system-level approach, the availability of multi-omics data has revolutionized the fields of biology and medicine. Simultaneously analyzing and integrating different data types can better help researchers study the mechanism and internal structure of biomedical processes at the molecular level. Simidjievski et al. [55] proposed VAE-based architecture is utilized to integrate heterogeneous cancer data types which including multi-omics and clinical data, and perform extensive analysis. They designed a detailed computational framework for VAEs as a system, which can correctly model the nonlinear representations in the integration data, while still being able to reduce the data dimension and acquire good representation learning. He and Xie [56] adopt a VAE-based approach on the unlabeled heterogeneous omics data in predicting anti-cancer drug sensitivity from somatic mutations via the assistance of gene expressions

4) PREDICT EFFECTS OF MUTATIONS
Accurate prediction of the effects of sequence variation is a major goal in biological research. The traditional method is through site independent or pairwise models [132]- [135]. However, they are unable to model higher-order correlations. The most recent improvement in model performance was made possible by the VAE. Riesselman et al. [57] used VAEs to infer biological sequences from large multi-sequence alignments and predict the effects of mutations and organize sequence information, all while being grounded with biologically motivated architecture learned in unsupervised fashion. The accuracy achieved with the VAE approaches exceeds the site-independent or pairwise interaction models. Sinai et al. [46] generated protein sequences using VAEs with the goal of predicting how mutations affect protein function. A nonlinear latent variable model VAE captures higher-order dependencies in proteins is shown in Figure 8.

5) GENE EXPRESSION ANALYSES
Since the VAEs model can perform meaningful representation learning in the latent space, it can generate and explore hypothetical gene expression under the perturbation of different molecules and genetic sequences. As an example, this could be used to predict a tumor's response to specific therapies, or to characterize complex gene expression activations existing in differential proportions in different tumors. The low dimensional latent space generated by the 4948 VOLUME 9, 2021 VAEs have been used to reveal complex patterns and novel biological signals from large-scale gene expression data and carry out drug response predictions. Grønbech et al. [58] proposed a novel variational auto-encoder-based method for analyses of scRNA-seq data. It is able to estimate the expected gene expression levels and model a latent representation for each cell, with support for several count likelihood functions. They use a variant of the variational auto-encoder which has a priori clustering in the latent space. Way et al. [59] proposed the research to which a VAE can be trained to model cancer gene expression, and capture biologically-relevant features. Dincer et al. [60] utilized a VAEs-based model to extract low-dimensional features From public unlabeled gene expression data, and the effectiveness of these low-dimensional representations is demonstrated. And they further demonstrate that the learned features are related to drug response predictions. Kim et al. [61] introduced a VAE-based survival prediction model to extract genes' significant features that can be used for patient survival prediction. Bica et al. [62] proposed a method based on VAEs, which can model cell differentiation by constructing low-dimensional meaningful representations from complex and high-dimensional gene expression data.

6) DNA METHYLATION ANALYSES
DNA methylation is a well-defined epigenetic biomarker for monitoring cancer development and its treatment efficacy because of role it plays in pathways and regulation of gene expression. DNA Methylation using VAEs have been shown to learn latent representations of the DNA methylation landscape from three independent breast tumor datasets and demonstrate the feasibility of VAEs to track representative differential methylation patterns among clinical subtypes of tumors. Wang and Wang [65] showed that DNA methylation data can be used with VAEs that can learn meaningful signals from merged datasets that credibly represent the subtype of the samples. Qiu et al. [66] showed a representation learning framework for transcriptome and methylome data based on VAE.

C. MEDICAL IMAGING AND IMAGE ANALYSES
Some of the popular medical imaging techniques for the early detection and diagnosis of diseases include magnetic resonance imaging (MRI), computed tomography (CT), ultrasound, X-rays, mammography and positron emission tomography (PET). Imaging enables scientists to see the phenotype and behavior of host, organ, tissue, cell, and subcellular components. Digital image analyses reveal hidden biology and pathology as well as drug action mechanisms. VAEs have also succeeded in biological image analyses, and many studies show superior performance. The main research area based on the VAE use in medical imaging datasets includes: 1) Medical image data augmentation for downstream tasks include image classification [67], [70], [78], [79], image segmentation [71]- [75], [86], image restoration [84], [85], and image reconstruction [71], [80]- [82].
2) Improve the interpretability of representation learning [78], [86], [88], [89]. Table 4 shows the literatures for the application of VAEs in medical imaging and image analyses. Note that if the dataset is public, we show the name of the data set. If the dataset can only be accessed by the owner or needs appropriate permission, we show it as private.

1) MEDICAL IMAGE AUGMENTATION FOR DOWNSTREAM TASKS
Deep learning models have shown significant success in analyzing clinical images and utilizing them for downstream tasks such as medical image classification. However, many applications of deep learning can only be realized on the premise of having a large amount of labeled data, and it is impractical to obtain large amounts of data in clinical medicine. Inspired by humans' ability to learn quickly from a small number of samples, few-sample learning has become a hot topic in artificial intelligence. The few-sample learning solves the problem of the limited training data set through the process of imitating human brain, thus getting closer to the application scenario. Data augmentation is a strategy that allows researchers to have a significantly increased amount of data available for training models without actually collecting new data. The easiest way for data generation is through simple image transformation [104] such as rotation, color jittering, noise addition, and image translation. However, these methods only generate duplicate versions of the original data, so the entire data set still lacks intra-class variation. Since VAEs learn variations in the data, they can be used for data augmentation to effectively improve performance of the downstream tasks [136], [137]. This is especially useful in imbalanced dataset problems and few-shot learning where a few classes may have low representation in the dataset [138].
To the medical image classification tasks, Pesteie et al. [67] proposed conditional generative model based on VAEs that learns the latent space independent of the labels to further improve the image classification task. Biffi et al. [70] proposed a 3D convolution generation model based on VAEs, which can be utilized to automatically classify images of cardiac disease patients related to structural remodeling and also improves the interpretability of the VAEs to further enhancing the clinical value. Uzunova et al. [78] proposed learns the meaningful perturbations of pathological regions by defining plausible pathology perturbation based on replacing the pixel values with healthy looking tissue learned by VAEs, which also improve the interpretability of the black box classifiers. Uzunova et al. [79] learns the entire variability of healthy data and detect pathologies by a conditional VAE. Díaz Berenguer et al. [68] proposed explainable semi-supervised representation learning based on VAE for COVID-19 diagnosis from CT Imaging. Thiagarajan et al. [69] proposed a VAE-based model to predict the type of lesion. This model is used to distinguish different types of lesions by extracting interpretable features. VOLUME 9, 2021  To the medical image segmentation tasks, Myronenko [74] proposed a semantic segmentation network for automatically segmenting brain tumors based on the VAE architecture and 3D MRI data sets. This network not only improves the performance of automated segmentation but also consistently acquires excellent training accuracy for random initialization. To the few shot learning problem in medical image segmentation, Ouyang et al. [75] proposed a method that combines VAEs and domain adaptation of transfer learning techniques to learn the feature latent space shared by both the source domain and the target domain, which can be utilized in the segmentation process with only a few target set. Another approach for few shot learning is the used of semi-supervised learning in segmentation with VAEs: Sedai et al. [72] proposed a semi-supervised VAEs-based method to segment optical cup, and utilized a small number of labeled data to accurately localize the anatomical structure. Chen et al. [73] proposed a VAEs-based semi-supervised image segmentation method for Brain tumor and white matter hyperintensities segmentation. Qian et al. [76] build a novel VAE for estimating object shape uncertainty in medical images.
To the medical image reconstruction tasks, Biffi et al. [71] proposed a CVAE based model, which can reconstruct a high-resolution 3D segmentation image of LV myocardium from three segmentations of a 2D cardiac image. Edupuganti et al. [80] introduced a VAEs-based method to analyze the uncertainty in compressed MR image recovery. Gomez et al. [81] proposed an image reconstruction architecture based on β-VAE, which can combine numerous overlapping image patches into a fusion reconstruction of the real fetal ultrasound images. Mostapha et al. [82] proposed VAEGAN-based framework to the automatic quality control of structural MR images. Tudosiu et al. [83] proposed a model based on VQ-VAE, which can effectively encode full-resolution 3D brain volume, compressing data to 0.825% of the original size, while maintaining image fidelity. Volokitin et al. [77] proposed a method to model the 3D MR brain volumes distribution by combining the 2D samples obtained by VAEs and the Gaussian model. Image restoration is the task of removing unwanted noise and distortions, giving us clean images that are closer to the true but unknown signal. Prakash et al. [85] introduced a fully-convolutional VAE to generate diverse and plausible denoising solutions, sampled from a learned posterior. It cannot only produce diverse results, but can also be leveraged for downstream processing. Zilvan et al. [84] proposed denoising convolutional VAE as feature extractor and also as a denoiser for disease detection tasks.

2) REPRESENTATION LEARNING FOR DECISION-MAKING
The VAEs can apply to large dataset with high dimensional features, due to high computational complexity. And it can learn meaningful fixed-size low-dimensional feature representation in unsupervised manner. Lafarge et al. [87] proposed the VAE-based framework to learn orientation-wise disentangled generative factors of histopathology images. The result shows the aggregated representation of subpopulations of cells produces higher performances in subsequent tasks.
However, due to the scarcity of labelled training data of medical datasets, one drawback of such unsupervised deep neural networks is the lack of interpretability. Since many neurons in neural networks turns into a many-to-many entangled mess, the learned latent representations are usually not directly interpretable. In medical research area, model interpretability is not only important but also necessary, since clinicians are increasingly relying on data-driven solutions for patient monitoring and decision-making.
Established methods like guided backpropagation [139] and gradCAMs [140] try to gain an insight into how neural networks learn and create intuitive visualizations based on the learned network weights. However, they are mostly heuristic and depend on the architecture of the neural network. Another possibility is by using perturbations [141]. e.g., Uzunova et al. [78] tackle the interpretability problem of generating plausible explanations by meaningful perturbations using VAEs. Recent scientific advances have combined the interpretability of supervised settings with the power of VAEs. Zhao et al. [88] proposed a VAEs based unified probabilistic model for learning the latent space of imaging data and performing supervised regression. Their results allow for intuitive interpretation of the structural developmental patterns of the human brain. Puyol-Antón et al. [89] proposed to use a VAE featuring a regression loss in the latent space to simultaneously learn efficient representations of cardiac function and map their change with regard to differences in systolic blood pressure (a measure of hypertension). Chartsias et al. [86] proposed VAE based model to learn decomposed meaningful spatial disentangled representation of cardiac imaging data, and leveraging these for improved semi-supervised segmentation results.

IV. CONCLUSION AND FUTURE OPPORTUNITIES
In this article, we comprehensively summarized the essential concepts of VAEs and its applications in the areas of molecular design, sequence dataset analyses, and medical imaging analyses. It is particularly important that VAE not only be used as a powerful generative model, but also its excellent nonlinear latent feature representation learning capability be used to produce a series of new research directions, and various applications in biomedical informatics. We particularly highlighted the most important techniques in successfully applying VAEs in the following fields: 1)molecular design, 2) sequence dataset analyses, and 3) medical imaging and image analyses.
In molecular design, VAEs can produce molecular generation in both the SMILES string, and graphs of molecular representations. The encoder converts the discrete variables of the molecule into continuous variables, and the decoder converts these continuous variables back into discrete variables. However, since molecules possessing desired properties could be located in multiple locations in the latent space, molecular generation with an entire profile of properties is difficult. Therefore, future work may include an extension to tune VAEs to have desired properties and further experimental validation. In addition, compared with sequence dataset analyses, medical imaging and image analyses, there are few studies on representation learning for decision-making approach using VAEs on the molecular dataset. We hope that more research will appear in this field in the future.
In sequence dataset analyses, VAE can assert not only its advantages as a generative model and feature dimensionality reduction technique, but also its benefits in nonlinear representation learning. In particular, VAEs provide an alternative and potentially complementary approach capable of exploiting the information available in sequences and tune them to have desired properties. Moreover, VAEs can be used for more in-depth biomedical analyses, such as integrated multi-omics data analyses, prediction of the effects of mutations, gene expression analyses, and DNA methylation analyses, so that more accurate and stable results can be achieved. In addition, the VAEs can be used as a feature dimensionality reduction technique. The main benefit of feature dimensionality reduction is to eliminate any redundant features and noise, which can improve the accuracy of prediction or generalization ability and support the interpretability of research results. Although the VAE model in sequence data set analysis has challenges, such as lack of interpretability and increased potential for overfitting, VAE may be increasingly important in the high-throughput design and biological sequence annotation. The applications of VAEs in sequence dataset analyses are currently in a relatively early stage of development. Therefore, its application in this area as well as the improvement of its generality such as increase in quality of feature reduction approaches [142], and interpretability will be more extensive in the future.
In medical imaging and image analyses, since VAEs learn variations in the data, therefore, they can be used for image augmentation to effectively improve the performance of downstream tasks such as medical image classification, segmentation, image restoration, and reconstruction. This is especially useful in imbalanced dataset problems, and few-shot learning where a few classes may have low representation in the dataset. However, high sensitivity to input parameters and high running time are some disadvantages of the deep learning based VAEs [143]. In the future, we expect that there will be more research achieving high classification accuracy with low computational complexity by the use of VAEs. On the other hand, even though the VAE can learn meaningful fixed-size low-dimensional feature representation in an unsupervised manner, one drawback of such unsupervised deep neural networks is the lack of interpretability.
Currently, the application of deep learning in EHR data for clinical informatics research has increased. Compared to traditional methods in clinical informatics, deep learning methods offer better performance and require less time, data preprocessing, and representation learning costs. However, there is no research on the application of VAEs in EHR data yet [7]. EHR data is very heterogeneous and can include multiple data types. This is in sharp contrast to the homogeneity of the original input data type with only image pixels or only Neuro-Linguistic Programming (NLP) characters. However, the key requirement of EHR dataset analyses is to acquire recognized and meaningful findings from such highdimensional, sparse, and complex clinical data. The feature representation learning ability in a higher-level abstraction of VAEs mentioned in this article may be utilized to acquire meaningful and powerful research results from EHRs. In the future, we expect that there will be more research focusing on the application of VAEs in EHR data analyses.
One of the most important goals of representation learning for biomedical informatics research is to make decisions based on VAEs. There are many biomedical applications that require decisions. In this survey, we have described researchers have accordingly attempted to use VAEs for decision-making applications, such as mutation-effect prediction for genomic sequences [57]. To make decisions based on VAEs, researchers implicitly appeal to Bayesian decision theory, such as taking the action that minimizes expected loss under the posterior distribution [144]. Lopez et al. [145] proposed a three-step procedure for using VAE for decisionmaking.
One drawback of such unsupervised deep neural networks is the lack of interpretability. Since many neurons in neural networks turn into a many-to-many entangled complexity, the learned latent representations are usually not directly interpretable. In medical research, model interpretability is not only important but also necessary, as clinicians are increasingly relying on data-driven solutions for patient monitoring and decision making. Recent scientific advances have combined the interpretability of supervised settings with the power of VAEs. However, more research is needed to ensure the worst-case performance of VAE-based models in diagnostic problems before they are used in high-stakes decision-making scenarios.
We provided a comprehensive review of the VAEs and its current representation learning research directions in biomedical informatics. We also provided an interactive visual code on GitHub for some papers to make this survey beneficial. We hope that this will be useful in improving the state-ofthe-art leading to research breakthroughs in related fields.
RUOQI WEI received the M.Sc. degree in computer science from the Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT, USA, in 2016, where she is currently pursuing the Ph.D. degree. Her research interests include machine learning, deep learning, transfer learning, few-shot learning, computer vision, and biomedical informatics.
AUSIF MAHMOOD (Member, IEEE) is currently a Professor with the Department of Computer Science and Engineering. He is also the Director of the School of Engineering, University of Bridgeport. His research interests include computer vision, machine and deep learning, computer architecture, and parallel processing.