Revisiting Bayesian Autoencoders with MCMC

Autoencoders gained popularity in the deep learning revolution given their ability to compress data and provide dimensionality reduction. Although prominent deep learning methods have been used to enhance autoencoders, the need to provide robust uncertainty quantification remains a challenge. This has been addressed with variational autoencoders so far. Bayesian inference via Markov Chain Monte Carlo (MCMC) sampling has faced several limitations for large models; however, recent advances in parallel computing and advanced proposal schemes have opened routes less traveled. This paper presents Bayesian autoencoders powered by MCMC sampling implemented using parallel computing and Langevin-gradient proposal distribution. The results indicate that the proposed Bayesian autoencoder provides similar performance accuracy when compared to related methods in the literature. Furthermore, it provides uncertainty quantification in the reduced data representation. This motivates further applications of the Bayesian autoencoder framework for other deep learning models.

Bayesian neural networks incorporate alternative training via Bayesian inference in which the model parameters (weights and biases) are (jointly) represented by a probability distribution rather than point estimates given by conventional gradient-based training methods [23]. The posterior probability distribution naturally accounts for uncertainty in parameter estimates, which is further propagated into the decision-making process [24], [25]. Bayes' theorem is used as the foundation for inference in Bayesian neural networks. Markov Chain Monte Carlo (MCMC) sampling methods [26] implement Bayesian inference by estimating the posterior distribution, which is often referred to as sampling from the posterior distribution. Variational inference [27] provides another way to approximate the posterior distribution, which approximates an intractable posterior distribution by a tractable one. This makes it particularly suited to large data sets and models, and so it has been popular for autoencoders and neural networks [13], [28]. Variational autoencoders [13] employ variational inference [27] as a form of regularisation of the model parameters (weights) during training [2], [14]- [16]. A deep learning model can feature tens of thousands to millions of parameters that are typically optimized by gradient-based algorithms. The major challenge of MCMC methods for training (sampling) deep learning models is the computational time required for constructing a good approx- LG] 28 Apr 2022 imation of the posterior distribution, incurring heavy computational costs. However, progress in incorporating gradientbased proposals into MCMC sampling has been inspired by earlier works [29], [30] for the development of Hamiltonian and Langevin-based MCMC methods [24], [25], [31]. Langevin-gradient proposal distribution demonstrated to be effective for Bayesian neural learning [32], [33] and combining it with parallel computing via tempered MCMC resulted in further improvements [32], which motivates their application for deep autoencoders that can feature thousands of model parameters.
This paper presents Bayesian autoencoders powered by tempered MCMC sampling that incorporate parallel computing and Langevin-gradient proposal distribution. We demonstrate the effectiveness of Bayesian autoencoders with benchmark datasets that involve small-scale to large-scale models featuring a maximum of about a million parameters. We also investigate the effect of two prominent gradientbased training methods and develop an adaptive Langevingradient proposal distribution for tempered MCMC. We finally compare our results with the literature where gradientbased methods have been used for the same datasets. The Bayesian autoencoder would provide a principled approach to uncertainty quantification in reduced data representation using MCMC. The Bayesian autoencoder can be used to reconstruct an ensemble of the reduced dataset from the posterior distribution rather than a single one given by singlepoint estimates given by gradient-descent learning in conventional autoencoders. Hence, once the posterior distribution is obtained, different estimators for the Bayesian autoencoder's parameters are possible such as the maximum a posteriori probability (MAP) estimate [34].
The rest of the paper is organized as follows. In Section II, we present a background and literature review of related methods. Section III presents the proposed methodology, followed by experiments and results in Section IV. Section V provides a discussion, and Section VI concludes the paper with directions of future work.

A. AUTOENCODERS
Autoencoders compress data by providing a reduced feature set that may be used for supervised learning. Autoencoders have been used with supervised learning methods such as simple neural networks [35], recurrent neural networks [36], long-short term memory (LSTM) networks [37], and graph neural networks [38]. Depending on the dataset, the features from other deep learning methods are used to augment the canonical autoencoder. For instance, a deep convolutional autoencoder incorporates pooling and convolutional layers used in convolutional neural networks for unsupervised learning tasks [39]. The applications of autoencoders in machine learning problems include data classification, dimensionality reduction, pattern recognition, image denoising, anomaly detection, and recommender systems [40]- [45]. Autoencoders have been successfully applied to a number of real-world ap-plications such as cyber-security [46], medical imaging [47], and biometrics [36]. Furthermore, a review of autoencoders has been done for geoscience and remote sensing with a focus on hyperspectral images [7].
Next, we review some of the key autoencoder architectures from the literature [48]. Sparsity autoencoders [49] are used to capture the latent, rather than redundant, representations of the data using a loss function that penalizes activation of the hidden layer; hence, enforcing the sparsity constraint. Denoising autoencoders [50] involve adding artificial noise to the data, after which the autoencoder attempts to reconstruct the original data from noisy data. This prevents the autoencoder from capturing an overly-complete representation of the data which is common to many conventional autoencoders. Contractive autoencoder [51] makes the reduced representation less sensitive to small variations in the input samples; this is achieved by adding a regularizer to the loss function which reduces the reduced representation's sensitivity to the input. Convolutional autoencoders [52], [53] create a reduced representation of image data which is helpful for supervised machine learning.

B. BAYESIAN DEEP LEARNING
The major limitation of conventional autoencoders is the lack of uncertainty quantification in the estimation of model parameters and the challenge of tuning hyperparameters. Bayesian neural networks address these by estimating the parameters of the model (weights and biases) as random variables, in contrast to single point estimates by backpropagation learning that employs gradient-based methods [23], [26]. The progress in neural networks has led to a revolution in deep learning models; however, progress in Bayesian neural networks and Bayesian deep learning has been relatively slow due to certain challenges [54], [55]. Bayesian inference implemented with canonical MCMC methods used to face major difficulties in handling the large number of model parameters. Variational inference provides an alternative Bayesian inference approach with successful implementations in variational autoencoders [56], generative adversarial networks (GANs) [57], variational convolutional neural networks (CNNs) [58]- [61], variational recurrent neural networks via LSTM networks [62], and variational graph neural networks [21], [63], [64]. Further details about their applications are given in [54]. We note that variational autoencoders also use a probabilistic representation of model parameters, as opposed to a single value representation in conventional autoencoders, facilitating uncertainty quantification [65]- [67]. Finally, the importance weighted autoencoder extends the variational autoencoder to use multiple samples to approximate more complex posteriors [68].
The limitations of MCMC sampling have been addressed with better computational resources and advanced proposal distributions, incorporating gradients [24], [25], [31]. Hamiltonian MCMC sampling methods have been used for Bayesian neural networks [69] with enhanced computation strategies [70]. Langevin-based MCMC methods have been successfully used in the implementation of Bayesian neural networks for pattern classification and time series prediction problems [71]. A major challenge in MCMC sampling has been in addressing big data and computationally expensive models; hence, surrogate-assisted estimation, in which a low-cost surrogate model provides an approximation of the likelihood, has been used to address this issue [72], [73]. Langevin-based MCMC methods have also been used in Bayesian neural networks for transfer learning given multiple sources of data [74]. In previous research, simple Bayesian neural networks have been used that at most had several thousand model parameters [72], [74]; hence, the challenge lies in adapting them for autoencoder-based deep learning models that can feature up to a million parameters.

A. MODEL AND PRIORS
An autoencoder consists of two major parts: an encoder function f φ (x) and a decoder functionf ψ (h), where x is the data that represents a set of features, and h is a set of reduced (latent) features. The autoencoder is assessed by how well the decoder can reconstruct the data from the encoder using a loss function R loss as given where φ and ψ represent the parameters (weights and biases) of the encoder and decoder, respectively.
In particular, z(x, θ) represents a feedforward neural network, with θ being its set of weights and biases. An encoderdecoder pair is then constructed from it, as given below The encoder extracts the essential features into a reduced representation while the decoder is used to reconstruct the input from the reduced representation as shown in Figure 1. The challenge is in determining the optimal θ = (φ, ψ) and typically gradient-based learning algorithms are used to minimize R loss [1], [2]. In case the data are labeled, another way to measure the performance of the autoencoder is by training another neural network model using the features from the reduced representation (h), and comparing it with performance when trained with original data. In contrast to principal component analysis (PCA), autoencoders are capable of nonlinear data reduction and can detect repetitive structures in the data [75].
Our likelihood function compares the original data x with decoder output x = z(x, θ) = z(z(x, φ), ψ). We incorporate the τ 2 parameter in the model which denotes the variance of data, τ 2 is also estimated; hence, our combined set of parameters becomes Θ = (θ, τ 2 ). We only consider continuous data for the Bayesian autoencoder, where meansquared-error (MSE) is used as the loss function [76]. Hence, we assume the data to have Gaussian distribution and use a Gaussian likelihood given by is the output of the neural network model for x t input features denoted by t instances in the training data. N is the total number of instances in the training data. The priors are based on Gaussian distribution for θ and inverse gamma distribution for τ 2 , respectively. Hence, the autoencoder weights and biases (θ) are independent a priori with a normal distribution with zero mean and variance σ 2 .
If θ features L model parameters in total, its joint prior with τ 2 is where, ν 1 and ν 2 are user chosen constants. We update τ 2 via random-walk proposal distribution and use the log-scale to avoid numerical instabilities in computing the prior and likelihood.

B. LANGEVIN-GRADIENT METROPOLIS-HASTINGS
We now present the MCMC sampler for training the model parameters (posterior) of the Bayesian autoencoder. We use a combination of methods: 1.) efficient proposal distribution that uses Langevin-gradients, 2.) parallel computing, 3.) efficient multimodal sampling via tempered MCMC (parallel tempering). We utilize the Langevin-gradient proposal distribution that incorporates Gaussian noise with gradients for a single iteration (sample). We note that Langevin-gradient proposal distribution has been effective for novel Bayesian neural networks in relatively small and large neural network architectures of up to several thousand model parameters [32], [77]- [79]; however, in this work we will experiment with deep autoencoders that feature up to a million model parameters.
The Langevin-gradient (LG) proposal distribution for a given sample (n) proposes to update the vector of weights and biases θ n from a biased multivariate normal distribution where ν 3 (learning rate) and ν 2 4 are user-defined tuning parameters. I L is an L × L identity matrix used to generate the stochastic noise. We take into account the loss function E(θ, x), for the autoencoder model f () that features data x as shown below Hence, the gradient ∇E(θ, x) is given by VOLUME 4, 2016 The proposal attempts to "explore" the posterior density, given our priorP (Θ) and likelihood P (x|Θ) defined in Equations (4) and (3). Furthermore, with Θ n = (θ n , τ 2 ), we either accept or reject the proposal using the standard Metropolis-Hastings criterion with Q(Θ n |Θ n ) = P (Θ n |Θ n ), the conditional proposal density and vice versa. In general, ∇ θ log{P (Θ n )P (x|Θ n )} = ∇ θ log{P (Θ n )P (x|Θ n )} and so Q(Θ n |Θ n ) = Q(Θ n |Θ n ) -an asymmetric proposal, and thus they do not cancel in Equation (8).

C. ADAPTIVE LANGEVIN-GRADIENT PROPOSAL DISTRIBUTION
In practice, in our trial experiments, we found that using the LG proposal distribution directly produces relatively slow mixing leading to inferior results for the case or large neural networks such as autoencoders. Hence, we formulate a new proposal distribution, borrowing ideas from the popular Adam optimizer [80]. The Adam-based weight update in the conventional learning paradigm is expressed as follows where, β 1 and β 2 are first and second moment estimates, respectively. Moreover, is a small scalar used to prevent division by 0. α is the user defined learning rate and typically, the default values are β 1 = 0.99, β 2 = 0.999, α = 10 −3 , and = 10 −8 (implementation in PyTorch 1 ). In our previous work [79], Adam-based gradients have been utilized in adaptive Langevin gradient (adapt-LG) proposal distributions for convolutional graph neural networks, which produced better results when compared to LG proposal distribution. We note that the learning rate is a user-chosen hyperparameter that is fixed during sampling in the case of LG and adapt-LG proposal distributions. Stochastic gradient descent (SGD) does not adapt the gradients while Adam uses a heuristic to adapt the gradients during optimization. Hence, Adam-based updates make use of earlier values of the gradient, which makes the Markov transition kernel in MCMC time-inhomogeneous. In our Bayesian autoencoder framework, we present an enhanced version known as optimization adapt-LG* which only makes use of adapt-LG in the burn-in phase and switches to LG afterwards. In this way, we address the inhomogeneous nature of adapt-LG. Hence, we use an enhanced form of gradientsÊ(θ, x) to obtain the adapt-LG proposal distribution as followŝ where ν 3 is user-defined learning rate previously denoted as α.

D. TEMPERED MCMC FRAMEWORK
The development of tempered MCMC (also called parallel tempering and replica exchange MCMC) has been motivated by the cooling behaviour of certain materials in the field of thermodynamics [81]- [83]. In tempered MCMC, an ensemble of M replicas Ω = [R 1 , R 2 , . . . , R M ], is executed with a distinct temperature value T = [1, . . . , T max ], where T max defines the extent of exploration. A replica defined by temperature t, samples from an attenuated posterior distribution P t (Θ t n |x) ∝ P (x|Θ t n ) 1/tP (Θ t n ). This serves to "flatten" the distribution, giving replicates with high t a generally higher probability of accepting proposals. The ensemble of replicas facilitates enhanced exploration capabilities, in particular, it enables a given replica to escape from a local maximum, allowing multi-modal and discontinuous posteriors to be explored [84], [85].
Inferentially, this can be viewed as augmenting the sampler state Θ n into Θ t n = (θ, τ 2 , t) and sampling from them jointly with those realisations, given t = 1 features the target distribution. Therefore, periodically, a proposal can exchange with the states of the neighbouring replicas, so that Θ t n = Θ t n and Θ t n = Θ t n , with t = t+1, which is accepted with the probability since the proposal is symmetric and the priors cancel. We note that in the temperature ladder T , not only the T max is important, but also the distribution of the temperature values. Specifically, for the geometric path between prior and posterior, ideally one should take temperatures such that the sum of the Kullback-Leibler (KL) divergences between successive tempered distributions is minimized. However, we use a geometric temperature ladder that worked well for Bayesian neural networks from our previous research [77], [79].

E. BAYESIAN AUTOENCODER
We now present the details for implementing a tempered MCMC framework for Bayesian autoencoders that utilize parallel computing. Figure 2 presents the framework that features the autoencoder model, tempered MCMC replica ensemble, and the inter-process communication for the exchange of neighboring replica states.
We present further details in Algorithm 1, where we first define the autoencoder topology by specifying the number of inputs, hidden layers, and outputs for the encoder, and the decoder (in reversed order). Secondly, we define the hyperparameters for the tempered MCMC sampler, such as the number of replicas (M ), geometric temperature laddermaximum temperature (T max ), replica swap interval (R swap ) that determines how often to propose an exchange between neighboring replicas, and the maximum number of samples (iterations) for each replica (R max ). In parallel processing, it is difficult to determine when to stop sampling, and hence this is done by checking several alive (active) replica processes. Therefore, we need to set the number of replicas in the ensemble as alive; alive = M . These are the key steps in initialization as shown in Stage 0 of Algorithm 1. The algorithm uses tempered MCMC in its first phase, then switches to canonical MCMC in its second phase, which is defined by several samples that needs to switch R switch . During the switch, all the temperature values in the ladder are changed to 1 as done in our earlier works [32], [86]. The algorithm implements adapt-LG* proposal distribution (LG in the first phase, and adapt-LG in the second phase) using R switch .
We then begin within replica sampling (Stage 1.1) by iterating through the respective replicas, where the manager process executes and manages parallel processes. In Stage 1.2, each replica sample creates a proposal using either random-walk or Langevin-gradient (either LG or adapt-LG*) (Equation (5)) and computes the likelihood (Stage 1.3) and uses Metropolis-Hastings condition in Stage 1.4 to check if the proposal is good enough to be part of the posterior distribution. In Stage 1.5, the algorithm checks if it needs to use tempered MCMC or needs to transform into canonical MCMC where temperature values of all the replicas in the ensemble are set to 1. Hence, tempered MCMC employs adapt-LG and canonical MCMC employs LG proposal distribution. Note that the samples obtained by tempered MCMC are later discarded as part of the burn-in period. In Stage 2.1, the algorithm computes the likelihood of the replica-exchange which is similar to the way it's computed in Equation (11), and then it exchanges the neighboring replica using the Metropolis-Hastings condition (Stage 2.2). Note that we represent the likelihood in log-scale to avoid numerical instabilities.
In Stage 3, the algorithm checks if (R max ) has been reached, and if so, the number of replicas alive is decremented to eventually stop the entire sampling process. Finally, we reach the end in Stage 4 where the sampling process ends, and we combine the respective replica posterior distributions of the Bayesian autoencoder for further analysis.
Rather than swapping the entire replica state (weights and biases in the autoencoder), it is possible to swap the respective neighboring replica temperature values, which is cheaper computationally. This is done by simply swapping neighing replicate temperature values, which is useful in parallel computing implementation, where inter-process communication can be memory intensive and computationally costly.

A. DATASET DESCRIPTION
We select benchmark datasets commonly used in the literature of autoencoders and data reduction methods. The datasets feature distinct properties in terms of size, features, and application. These include Madelon [87], Coil-2000 [88] and Swiss Roll [89] which are summarised in Table 1.   Madelon is a synthetic dataset that features data points grouped in 32 clusters, each on a vertex of a five-dimensional hypercube. The clusters are randomly labeled +1 or -1. In addition to these 5 coordinates, 15 linear combinations have been added for a total of 20 (redundant) informative features, as well as 480 random distractors. Coil-2000 has 86 variables containing customer information, including product usage data and socio-demographic data derived from postal codes, and whether the customer has a caravan insurance policy. Lastly, the synthetic Swiss Roll dataset features data points in the shape of a Swiss Roll given in three dimensions. It is a popular dataset for visualisation of data reduction methods [90]- [92].
The respective datasets were normalized before model training in the range of [0,1]. This was done to ensure that all features had equal importance in the model. Table 2 provides the Bayesian autoencoder (adapt-LG*) topology that is used in the experiments described in this section. In the respective datasets, the input feature size and the feature size of the compressed data are different. Table 2 highlights the exact dimensions of the data at the input, intermediate, and output layers. Table 2 also demonstrates the sheer volume in the number of parameters being dealt with and its exponential rise related to the increase in the number of input features from the data.

B. IMPLEMENTATION
We evaluate some of the critical parameters in MCMC sampling to observe the effects on the sampling performance, computational time, and reconstruction accuracy. The Bayesian autoencoder is designed to reduce the Madelon dataset from 500 features to 300. In the case of the Coil 2000 dataset, the Bayesian autoencoder reduces 85 features to 50 features. In the Swiss Roll dataset, the Bayesian autoencoder reduces 3 features to 2 features, essentially "unwrapping" the Swiss Roll, as shown in Figure 3.
In all the experiments, we use the tempered MCMC sampler hyper-parameters as follows. In random-walk proposals and the random component of the Langevin proposal, the Gaussian noise is added with a standard deviation ν 4 = 0.005, centered at 0. We use ν 1 = 0 and ν 2 = 3 as shown in the prior distribution given in Equation (4). We use replica temperature ladder maximum T max = 2, and swap interval R swap = 5 samples for all experiments. We use M = 8 replicas with R max = 6000 MCMC samples per replica. After R switch = 3000 samples, tempered MCMC converts to canonical MCMC with neighborhood exchange by setting the replica's temperature to 1. Note that we chose R max in trial experiments by observing the trend and selecting a value, where the accuracy and likelihood do not have major changes. In terms of R swap , we selected a value from trial experiments that ensured that a good mixing of the replicas take place. Finally, we compare our results with several common classification algorithms: K-nearest neighbors (KNN) [93]; support vector machines (SVM) [94]; random subspaces ensemble (RSE) [95] method which uses ensemble of decision trees; and naive Bayes [96]. We implement the Bayesian autoencoder using the pyTorch library 2 with Python multiprocessing library 3 for parallel tempered MCMC replica processes.

C. PRELIMINARY INVESTIGATION: PARAMETER TUNING
We first carry out a preliminary investigation to understand the effect of certain parameters on the tempered MCMC sampling and show a visualization of the results using the Swiss Roll dataset. Table 3 gives the accuracy for different Bayesian autoencoder configurations in MCMC sampling with adapt-LG* proposal distribution for the Swiss Roll dataset. We show the trend of the reconstruction accuracy (MSE) and acceptance percentage as we change the MCMC hyperparameters, i.e. step-size (variance) in proposal distribution, and learning rate for Langevin-gradients. The highlighted configuration in Table 3 is visualised in Figure 3. We also find that the accuracy for the chosen configuration is close to the canonical autoencoder. However, we note that the step size of 0.03 with learning rate of 0.04 provides the best performance, since they provide a balance between accuracy (0.025) and acceptance performance (19.25 %). In MCMC sampling, only the accuracy is not desired, a good acceptance rate is also needed to ensure that the sampler has well explored the posterior distribution. We note that in some models an acceptance of around 23.40 % [97] is known to be optimum for MCMC sampling. Figure 3 gives a visual representation of the Swiss Roll dataset for 2500 data points. It shows the 3D and 2D views of the various stages of the visualization process. Panel (a) shows the original dataset; Panel (b) shows the reconstructed dataset after reducing and then reconstructing it through a canonical autoencoder; and Panel (c) uses the same architecture but the Bayesian autoencoder. The visualization and analysis (Table 3, and Figure 3) demonstrates the effectiveness of the Bayesian autoencoder and its applicability to larger datasets.

D. RESULTS
We evaluate specific aspects of our proposed Bayesian autoencoder framework using the Madelon dataset. The Bayesian autoencoder reduces the number of the features from 500 to 300 features (encoder) and then reconstructs it, i.e. from the 300 features into the 500 features (decoder). We report the accuracy between the reconstructed dataset and the original dataset. We first present results in terms of computational efficiency, the effect of adapt-LG* rate, and several replicas (M ) for the Madelon dataset. We note that the Langevin-gradient is for one iteration or epoch which is defined by a single forward and backward pass via backpropagation. Table 4 shows the effect of adapt-LG* rate (g prob ) on the 8 VOLUME 4, 2016 performance of the Bayesian autoencoder given by training and testing accuracy (MSE) in terms of best, mean, and standard deviation (std). Since all of these are sampling from the same posterior, the accuracy should be the same as well. However, we find that without the adapt-LG* (when the rate is 0), the sampler cannot find the posterior mode. On the other hand, adapt-LG* proposals are much more computationally costly as given by the time in minutes (mins.). Next, we evaluate the effect of the number of replicas (M ) using the Bayesian autoencoder using adapt-LG* proposal distribution. Table 5 summarises the results where we observe a lower computational time with a similar accuracy (MSE) as the number of replicas increases, suggesting that parallel processing is useful in this case. Figure 4 shows trace plots over the samples with training and testing accuracy for one of the selected replicas of the Bayesian autoencoder (adapt-LG*), demonstrating convergence to the neighborhood of a posterior mode. We present a quantitative assessment of convergence in Table 6 in the form of Gelman-Rubin [98] diagnostics. This diagnostic works by comparing the variance between replicates to the variance within them. A ratio sufficiently close to 1 indicates convergence to the MCMC sampler's stationary distribution -the posterior [98]. Figure 5 presents the log-likelihood of selected replicas (temperature level of 1.0) of the Bayesian autoencoder (adapt-LG*) for the respective problems. Figure 6 shows the trace plots and posterior distribution for the selected weights of the Bayesian autoencoder (adapt-LG*) with the convergence diagnostic given in Table 6.
The data compression by autoencoders is typically benchmarked by applying machine learning methods on the reduced dataset. We trained the Bayesian autoencoder and obtain a reduced dataset. Next, we train another machine learning model for the task of classification with reduced data obtained from the Bayesian autoencoder as done in previous studies [76], [95], [96], [99]- [101]. In this way, we can compare the performance by training the machine learning model on the reduced dataset with that of the original dataset. Table 7 presents a comparison of the results to selected methods from the literature that use the compressed datasets for classification tasks. We note that caution should be exercised when making direct comparisons, because the autoencoder architectures may vary somewhat between papers. To the extent that these comparisons are valid, our Bayesian autoencoder combined with SVM outperforms the classification accuracy given in most methods from literature for the respective datasets. This shows the effectiveness of the Bayesian autoencoder as an alternative for dimensionality reduction and motivates the use of autoencoders as a possible data pre-processing step to increase classification accuracy. Table 8 compares the canonical autoencoder (using SGD and Adam optimizer) with the Bayesian autoencoder (using LG and adapt-LG*). We show the best, mean, and the standard deviation (std) of the reconstruction accuracy (MSE). As evident in Table 7, we find that the canonical autoencoder with SGD optimizer does not perform well for the respective datasets. We also find that the Bayesian autoencoder with adapt-LG* performs better than LG for all three datasets.

V. DISCUSSION
Our primary goal was to incorporate Bayesian inference via MCMC sampling into autoencoders. The proposed Bayesian autoencoder provides an approach for uncertainty quantification in the reduced dataset. We can reconstruct an ensemble of reduced datasets from the posterior distribution, rather than a single one, given by single-point estimates. The results show that the Bayesian autoencoder framework can efficiently sample the posterior distribution of the autoencoder's weights and biases, which are compared very well with canonical methods in terms of accuracy (such as SGD and Adam). We also found that our approach is competitive with the other methods found in the literature.
It is unusual to use MCMC methods for deep learning, which often features models with tens of thousands of parameters; however, the field is slowly gaining momentum [71]. We have shown that with parallel computing and sophisticated gradient-based proposal distributions in MCMC, it is feasible for them to be used, nonetheless. We finally demonstrated how MCMC diagnostics apply to this problem ( Table 6).
One limitation is that convergence analyses for MCMC samplers, such as the Gelman-Rubin diagnostics are generally designed for modest numbers of parameters. More work is needed to develop convergence diagnostic methods that are more suited for deep learning models that feature tens of thousands to millions of parameters. Furthermore, in large deep learning models, not all nodes and weights are equally valuable. We note that a significant portion of these model parameters (weights) are not contributing, and hence they can be pruned during or post sampling (training). Optimization and evolutionary computation methods have been used for pruning weights of large deep learning models; hence, there is potential for Bayesian inference methods to be used in model pruning. The future research can focus on determining and differentiating essential weights from non-essential ones using specific metrics to determine which weights have a major effect on the performance of the model. Furthermore, a Bayesian framework can be developed to prune the nonessential weights that do not affect the performance of the model to build a robust and compact Bayesian autoencoder.
We further highlight that we utilized uncorrelated multivariate Gaussian distributions as prior distribution for the autoencoder model parameters (weights and biases). Although we consider previous research for selecting this prior distribution, we note that our previous research used different neural network architectures, i.e. simple neural networks [32], and graph CNNs [79]. The recent developments on prior elicitation in Bayesian deep learning have spread the interest in more general choices [102]. The choice of prior is a fundamental issue when the interest lies in performing Bayesian model selection [103]. Further extensions of the proposed Bayesian autoencoder can consider current developments of VOLUME 4, 2016  Dataset  ID-0 ID-50 ID-100 ID-150 ID-2000 ID-3000 ID-4000 ID-6000 ID-9000 ID-10,  Best (Mean, Std) Autoencoder based classifier [99] kNN 0.547 (-,-) 0.929 (-,-) Ensemble Classification using Dimensionality Reduction [95] RSE 0.5565 (-, 0.0263) -Autoencoder inspired unsupervised feature selection [101] kNN 0.71 (-,-) -Multi-objective Evolutionary Approach [96] Naive Bayes priors in the field.

VI. CONCLUSIONS
We presented a framework that employs tempered MCMC sampling for the Bayesian autoencoder which is an alternative to the variational autoencoder. We addressed some of the known limitations of MCMC methods for Bayesian deep learning with the power of parallel computing and enhanced proposal distributions. Our results indicate that the Bayesian framework is as good as related methods from the literature in terms of performance accuracy, with the additional feature of robust model uncertainty quantification for the generation of reduced datasets. The paper motivates further applications of the Bayesian framework for other deep learning models, such as the generative adversarial networks. Furthermore, there is also scope for domain-specific applications such as remote sensing and geoscience applications where robust uncertainty quantification is vital. VOLUME 4, 2016

CODE AND DATA
We provide the code as open-source software via our GitHub repository 4 .