Bayesian Quantum Neural Networks

The astounding acceleration in Artificial Intelligence and Quantum Computing advances naturally gives rise to a line of research, which unrolls the potential advantages of quantum computing on classical Machine Learning tasks, known as Quantum Machine Learning or Quantum Machine Intelligence. The typical objectives are either (1) exploring the potential quantum advantages on classical learning tasks or (2) levering well-established classical ML algorithms to tackle quantum-related problems on Noisy Intermediate-Scale Quantum (NISQ) devices. Along the second research direction, we study Quantum Neural Networks (QNNs) to accomplish the purpose of Bayesian learning. By observing a wide range of studies on QNNs, in which the sole training method is based on frequentist training, we find that Bayesian learning benefits QNNs from two aspects. First, Bayesian-trained models enjoy a high level of generalization due to the prior and posterior distribution usage compared to frequentist training, which will be justified by this paper’s theoretical study of model capacity. Second, Bayesian Inference offers epistemic uncertainty estimation, which merits the decision-making process. It is worth mentioning that frequentist-trained QNNs generally lack this desirable property. Under the Bayesian training procedure, our derived models can be considered a new class of QNNs (called BayesianQNNs) which possesses both desirable properties of Bayesian Inference while maintaining comparable predictive performance as frequentist counterparts. The proposed Bayesian Quantum Neural Networks is justified by empirical evidence from numerical experiments.


I. INTRODUCTION
Machine Learning and Deep Learning serve the foundation of Artificial Intelligence to benefit various aspects of modern society. Meanwhile, quantum computing offers us a new formalism of computing, which shows unparalleled advantages over its classical counterpart. Beyond these computing advantages, the potential of quantum hardware initiates a new formalism toward Quantum Machine Intelligence. In the quest of unrolling the power of quantum computing in the field of Artificial Intelligence, an increasing number of research studying the application [1]- [9] and properties [10]- [13] of Quantum Machine Intelligence for AI tasks in the Noisy Intermediate-Scale Quantum (NISQ) era [8], [14]. Moreover, neural architecture design is also a merging topic for both modern Deep Learning [15] and QML [16].
Amongst Quantum Machine Intelligence literature, a line of studies addresses the quantum version of neural The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos . networks-the central mechanism of Deep Learning, which is so-called quantum neural networks [1], [17]. In this type of neural network, parameterized quantum circuits represent neurons in which control parameters are adaptive towards minimizing the objective loss function. This design enables quantum approximate optimization algorithms (QAOQ) to validate QML algorithms on simulation machines. Such works offer practitioners cost-effective developer kits to build and validate Quantum Machine Intelligence algorithms, paving the way towards practical implementations on NISQ devices. There are two main objectives for studying Quantum Machine Intelligence, which are: • Exploring the advantages of computing acceleration from quantum computers on Machine Learning. For example, a wide array of studies relies on a well-developed collection of quantum algorithms for basic algebra subroutines such as matrix inversion [7] or single value decomposition [18], which has shown quantum advantages over classical counterparts.
• Discovering the advantages of well-studied classical ML/DL algorithm to reveal the properties of quantum algorithms. In this line of research, we leverage well-developed algorithms in classical ML/DL literature to solve problems inherited from Quantum Machine Intelligence fields. For example, classical ML/DL search algorithms leveraging Reinforcement Learning (RL [19]) and Sequential Model-based Optimization (SMBO [10]) enable optimization of quantum circuits under the constraint of computational efficiency or predictive performance. The contribution of this paper is in the same line of the second category, where we study the impacts of Bayesian Inference on QNNs, which is decomposed into two fundamental aspects of statistical learning: (1) model capacity and trainability and (2) predictive performance. The main motivations for our works are from the advantage of Bayesian Inference over frequentist training, which is the more feasible approach for training QNNs. Specifically, in the conventional frequentist training, we aim to find the point estimate of the neural weights or parameterized quantum gates. On the other hand, Bayesian training enables learning the distribution of model weights, which drives two clear advantages over frequentist training implied from a statistical learning perspective taking advantage of the nature of quantum states: • Bayesian Inference has a higher level of generalization: From the fundamental, the posterior distribution used in Bayesian Neural Networks (BNNs) enables better calibration over unseen data points since the estimated uncertainty is more consistent with the empirical errors [20].
• Bayesian Inference is more beneficial to datadriven decision-making than frequentist approaches: The usage of the posterior distribution enables Bayesian-trained models to estimate the epistemic uncertainty, the uncertainty due to lack of evidence. In other words, Bayesian QNNs offer more conservative Inference where data points are dense while making less certain predictions on a spare region of data inputs. The above two key characteristics of Bayesian Inference motivate us to study the practicality of QNNs trained under Bayesian learning frameworks. Amongst the vast potential properties of this Quantum Machine Intelligence systems, we look at fundamental characteristics that are crucial in ML, which include: • Model Capacity and Trainability: We analyze these two quantities based on theoretical evaluation proposed in [11], in which evaluation is derived from classical information geometry. Rigorous evaluation highlight the advantages of proposed BayesianQNNs over its competitors.
• Predictive performance: We further investigate the practicality of the proposed approach. The derived models achieve better expressibility of model inference since they offer the class prediction and the uncertainty estimation from observed data. More specifically, the epistemic uncertainty estimation significantly benefits data-driven decision-making, which cannot be derived from frequentist-trained models. Since being trained under Bayesian learning, the neural networks derived by our proposed framework can be considered Bayesian quantum neural networks-named BayesianQNNs, a new class of QNNs. In BayesianQNNs, the intelligence learns the optimal distribution for weights, while frequentist training only offers point estimates of consequences. As a result, the Inference of Bayesian-QNNs possesses a similar effect as ensemble learning since they are derived by averaging multiple abstractions according to weights drawn from learned distributions. Moreover, Bayesian learning induces probabilistic properties into QNNs, which are fundamental properties for the quantum regime.
In the numerical results, we will show the proof-of-validity of BayesianQNNs on different data scenarios. The structure of this paper is organized as follows: Section II offers the background of Quantum Neural Networks and Bayesian learning in Deep Learning, Section III introduces a framework that enables training QNNs under Bayesian learning, adopted from the classical ML literature, Section IV shows the numerical evidence for the practicality and advantages of derived BayesianQNNs over frequentist counterparts, both theoretically and empirically, Section V further provides in-depth analysis on the convergence and effect of the training set on derived BayesianQNNs. Moreover, we extend the scope of the proposed framework to Continuous-variable QNNs, which are validated on intermediate-scale datasets.

A. QUANTUM MACHINE LEARNING
The literature of QML has witnessed a growing number of advantages of quantum machine learning over classical competitors in a wide range of learning tasks. However, the sole approach for training variational quantum machine learning models is via frequentist approach such as quantum Support Vector Machine [21], [22], quantumenhanced features [3], [23], quantum generative adversarial networks [6], [24], quantum Boltzmann machine [25], quantum graph neural network [26] and quantum neural network [1], [27]- [29]. Supervised quantum machine learning has been shown similar to kernel methods in [30], in which classical data space is mapped to high-dimensional Hilbert space. The effectiveness of such embedding is shown in [31] and [10], yielding remarkable results in comparison to classical counterparts. Besides, [32] and [33] presents circuit-based quantum models as partial Fourier series, which can access the frequency spectra defined by the nature of the data encoding. The primary approach for quantum machine learning is referred to as variational training, which is inspired by the Quantum Approximate Optimization Algorithm (QAOA) [34]. We can view quantum computers as a trainable neural network in which physical control parameters of quantum hardware can be trained to perform machine learning tasks. Current hardware implementations for quantum machine learning models are based on both discrete-variable and continuous-variable quantum information. In the former implementation, we train the variational circuit-based models according to input features, enabling feature embedding onto high-dimensional Hilbert space by qubit's rotation. Classification of discrete-variable quantum model can be performed by difference strategy such as linear classifier (hybrid classical-quantum model) [35], Helstrom or fidelity classifier [7] and bitstrings parity-binary mapping [36]. On the other hand, continuous-variable quantum machine learning models encode input features in a continuous degree of freedom, which enjoys quantum states' properties such as an electromagnetic field.

B. BAYESIAN INFERENCE OF NEURAL NETWORKS
Bayesian neural networks can be facilitated as stochastic neural networks trained using Bayesian inferences. Marginalization is the appealing advantage of the Bayesian approach, which ameliorates the correctness and calibration of artifact neural network [37]. Moreover, [38] provides strong underlying principles of Bayesian inferences, while [39], [40] empirically show that Bayesian neural networks possess a better generalization and calibration in comparison to classical counterparts. Furthermore, a Bayesian neural network can naturally quantify the uncertainty of deep neural networks. The uncertainty in Bayesian formalism includes two distinct types: (1) epistemic uncertainty caused by inadequate knowledge of data, which is evaluated by the posterior distribution of model's weights given the observation; and (2) aleatoric uncertainty due to the nature of data, which is measured by the marginal distribution of data given model's parameters and input features [20]. Pioneer works on Bayesian neural networks are [41]- [43], which make such Bayesian models as flexible as possible. Reference [43] shows that Bayesian neural networks are Gaussian processes when the number of hidden units towards infinity. Recent advances in developing modern Bayesian deep learning include scalable inference [44], [45] and function-space inspired priors [46], [47].

III. BAYESIAN QUANTUM NEURAL NETWORKS
In this section, we will introduce the concept of Quantum Neural Networks (QNNs) in Section III-A, highlighting the differences between QNNs versus their classical counterpart. Section III-B will focus on the fundamental difference between Frequentist and Bayesian training. Finally, we will introduce two variational algorithms adopted from the classical ML literature to train BayesianQNNs in Section III-B.
A. FROM CLASSICAL TO QUANTUM MACHINE LEARNING 1) QUANTUM NEURAL NETWORKS a set of observed random variable in the classical data space X and D y = {y i } N i=1 a set of associated target variables. Circuit-based quantum machine learning models is interpreted as parameteried (variational) circuits U(x, w) (also denoted as U w (x)), in which w are classically trainable parameters with realizations x. Mathematically speaking, the mapping on input data is defined as where |φ w (x) is the quantum representation of input x in the high-dimensional quantum Hilbert space. The construction of a variational quantum circuit U w (x) involves a stack of M identical sub-architectures of a data encoding circuit F(x) and a trainable circuit G(w). In particular, F(x) is formed by gates having form exp(−ixH ), where H is the Hamiltonian (total energy) generating the time evolution for data encoding. Thus, we can interpret the overall design for quantum circuits as The unified notation for a quantum model can be written as: where |φ w (x) φ w (x)| is the density matrix of quantum state |φ w (x) and M is a Hermitian operator corresponding to the quantum observable, which is associated with the measurements.
a: GENERAL DESIGN OF QUANTUM NEURAL NETWORKS Figure 1 (a) illustrates the overall principle to design a discrete-variable quantum neural network. First, the architecture is formed by a stack of N identical ansatz circuits, including feature embedding F(x) and learnable circuit G(w). Second, quantum machine learning enjoys the power of quantum computing via entanglement between qubits, which are formed CNOT gates. Another way to establish entanglement is by using control-Z operation. In the former design of entanglement, the permutation of CNOT gates highly impacts the final result, while the order of control-Z gates is permutation invariant [10]. As a result, the quantum machine learning model encodes data from the classical space into the quantum Hilbert space. The embedding is trained to maximize the distance of observations from different classes. Thus, the decision boundary constructed on Hilbert space corresponds to a complex decision boundary in the classical feature space [7], providing a more powerful classifier. It is fundamentally different from the concept of neural networks in Deep Learning, in which neural layers are represented as mappings between real spaces. Regarding continuous-variable QNNs, the Gaussian operations is used as neurons in the neural architectures, including rotations R(φ), displacement D(α), squeezing S(r) and beamsplitter BS(θ). Like classical neural networks, the final model is formed by stacking N layers of the CV-quantum circuit, which plays feature embedding in neural networks. At the decision level, several choices for measurement, such as homodyne, heterodyne, or photon-counting, can be used to generate model output.

2) EVALUATION OF MODEL CAPACITY AND TRAINABILITY
The power of machine learning models is relied on the capacity to represent various relationships between variables. Although classical neural networks have established a large number of state-of-the-art benchmarks, quantum machine learning is a potential advancement that leverages the power of quantum computing through quantum effects such as superposition or entanglement. The faster training process is attributed to quantum machine learning models' potential advantages over classical counterparts. However, the comparison between the power of quantum and classical machine learning remains a challenging question. Reference [12] provides a theoretical prediction error bounds to assessing potential quantum advantage by using geometry test and complexity test for specific function or label. Hence, comparing quantum and classical machine learning models depends on data scenarios. Moreover, [13] shows a limited quantum advantage by investigating the representation learning ability of variational quantum circuits via memory capacity. This links to a recent investigation on the quantum model's power using effective dimension [11], which is considered a robust measurement for the model's capacity. A downside of quantum machine learning is the barren plateau, in which the loss landscape is flat, causing untrainable quantum models. By investigating the Hessian matrix and its spectrum, we can determine the trainability of such models. Thus, [11] shows that well-designed quantum machine learning models can possess a higher capacity and speed-ups the training processes. This is consistent with our numerical results, which show that quantum machine learning models require smaller model complexity to outperform classical counterparts. To assess the capacity and trainability of Bayesian-QNNs, we investigate the effective dimension and Fisher information, which are rigorous measurements for both classical and quantum models. In this study, we analyze the model capacity and trainability following the theoretical evaluation proposed in [11].

a: FISHER INFORMATION
The Fisher information is used to measure the capacity of a class of machine learning model through statistical theory perspective. Any machine learning model can be interpreted as a mapping where (x, y) ∈ X × Y are data pairs and w ∈ is learnable parameter. In particular, p(y|x, w) is a discriminative model, which is usually used in classification problems, while p(x) is a generative model, which learns the distribution of input data. From statistical perspective, we can refer p(x) as a prior distribution, while p(y|x, w) is a conditional distribution of outputs parameterized by w with realization of x. The Fisher information matrix is defined as Given n data samples, the empirical Fisher information matrix can be approximated by ( Reference [11] shows that Fisher information captures the response of model's outputs with respect to parameter's shift in the parameter space, which enables natural gradient optimization. Moreover, Fisher information is a convenient statistical quantity to study the interaction between model space and parameter space. Specifically, the Fisher information is a Riemanian metric in the parameter space, in which √ det F(w) gives the volume in the model space. Motivated by information geometry, the effective dimension is an useful complexity measurement, which can be applied for both classical and quantum machine learning models. Let us define f w (.) ∈ M be a statistical model parameterized by w and M is a model space that contains all possible models with a particular parametrization. Each element in M is a distribution of a model specified by a set of parameters. A conventional method to improve the performance of classical model is to enlarge the complexity of the parameter space, seeking for better optimal neural solution. However, this approach is not an appropriate due to several issues. First, models with extremely high complexity tend to be over-fitted in train set, which leads to performance degradation. An example of such phenomenon can be addressed by scaling ResNet architectures, in which the performance gain is diminished when additional layers are added. Second, an uniform shift in parameter space results in an extreme shift on the model space. Thus, directly studying the model space enables us to understand the representation learning ability of models. The main objective of effective dimension is to estimate the volume that a model is contained in the model space M. The effective dimension of a model is given as [11] where ⊂ R d is the parameter space, γ ∈ (0, 1] and n ∈ N is the number of samples. The volume of parameter space is defined by the Fisher information matrix Equation 7 shows that the effective dimension takes into account the Fisher's information eigenvalue by integration over its determinant. The number of observations n plays an important role in calculating the affective dimension, determining the resolution used to observe the model space. Moreover, this enables us to study the effect of data availability on the effective dimension and model complexity. where = 4M 2π log n γ n , ≥ || ∇ w logF(w)||,

B. FROM FREQUENTIST TO BAYESIAN TRAINING 1) BAYESIAN QUANTUM NEURAL NETWORKS
We illustrate the main differences between frequentist and Bayesian training quantum neural network in Figure 2. The dominant frameworks in the literature for training variational quantum circuits presented in Equation 2 are via frequentist approach, which is to find the point estimates of model's parameters where L(.) is the loss function and R(w) is the regularization with penalty term λ. From statistical perspective, it is the Maximum Likelihood Estimation (MLE), which is equivalent to the Maximum A Posteriori (MAP) under proper regularization. Although frequentist approach is computational convenience in modern implementations, the method inclines to be over-fitted on out-of-training observations due to poor generalizability. In our proposed Bayesian variational quantum circuits (BayesianQNN), we take into account stochastic components into the circuit structure. Given a variational quantum circuit U w (x) parameterized by learnable weights w under realization x. We define two stochastic models: (1) the prior distribution of quantum circuit's parameters p(w) and (2) the prior confidence of the predictive power p (y|x, w). The posterior distribution of quantum circuit's weights is given as which induces the marginal distribution of possible predictions p(y|x, D). The underlying properties of BayesianQNN are similar to ensemble learning, in which the inference is derived by averaging over multiple abstract explanations.

Algorithm 1 Bayes by Backprop With Gaussian Variational
Posterior for Training BayesianQNN initialize (µ prior , σ prior ), variational posterior (µ 0 , ρ 0 ), parameter-free noise , scalar η, number of total iterations I . for i in I : Moreover, the BayesianQNN offers the confidence to predictions (calibration), which potentially leads to more robust decision-making than the prediction from point estimate counterparts. Although there is no concrete empirical evidence for improvements of the Bayesian over frequentist approach, the main advantage of BayesianQNN is realistically quantifying the uncertainty associated with the process.
The design of quantum circuitry depends on the number of input features of the dataset of interest, which can be categorized into two main types: (1) fully quantum and (2) hybrid classical-quantum. Current quantum computers allow a limited number of qubits, which is commonly greater or equal to the number of input features in the general design of Ansatz circuits. Hence, we design a fully quantum model in the case of low-dimensional datasets, which have a small number of input features. We use hybrid classical-quantum models for datasets with high-dimensionality, including a classical neural network followed by a quantum circuitry. On the other hand, in hybrid models, the classical neural network plays a role as an encoder, which reduces the dimension of the input space. Particularly, the classical component transforms the input space R n → R m , n > m, while the quantum component transforms R m → H. The main difference between full and hybrid models is that the quantum circuitry of a fully quantum model embeds the original input features. In contrast, the hybrid model embeds the latent variables (in-depth features) from the classical neural network. Algorithm 2 Deep Ensembles for Training BayesianQNN initialize number of ensemble models N e , number of total iterations I and SGD optimizer.
w ← ∅ for n e in N e : for i in I : Thus, the general design of a Bayesian quantum model on near-term quantum computers begins with setting the number of qubits used for the model, considering the current computational ability of quantum computers. Then, the classical neural network will be used as a classical encoder to reduce the dimensionality of the datasets of interest having a larger number of input features than the number of qubits. Otherwise, we can directly embed the original features by a fully quantum model. Finally, the Inference is based on the post-measurement of quantum states, which can be discrete or continuous. We will discuss the design of quantum circuitry concerning under-investigated datasets in Section IV.

2) TRAINING ALGORITHMS FOR BayesianQNNs
The denominator of Equation 11 is computationally intractable, which challenges the practical implementation. Computational solutions for these issues can be categorized into two generic approaches: (1) sampling method based on Markov Chain Monte Carlo (MCMC) and (2) variational inference. Although MCMC algorithms can be considered the most plausible approach for sampling exact posterior distributions, it lacks the scalability of intermediate-size models. On the other hand, variational Inference offers scalable solutions, which can be applicable in a wide range of Bayesian learning problems. Within the scope of this study, we investigate two variational inference approaches for training BayesianQNN: Bayes-by-backprop and deep ensembles.

a: BAYES-BY-BACKPROPAGATION
Motivated by [48], [49], the variational learning for approximation of the exact posterior distribution is introduced in [50]. In particular, we aim to learn the parameter θ of the conditional distribution Q(w|θ) that minimizes the Kullback-Leibler (KL) divergence between Q(w, θ) and p(w|D). The optimization problem can be stated as The term E(D, θ) in Equation 12 is the exact objective function for Bayesian training, which is also known as the variational free energy [51]- [53] or expected lower bound (ELBO) [51]. Unfortunately, the direct computation of the exact cost function is nearly impossible. In stead, the exact objective function is approximated as: where w (k) is the k-th sample from variational posterior Q(w (i) |θ ). A simple but practical assumption for variational posterior is diagonal Gaussian distribution. Assume that the variational posterior parameter θ = (µ, ρ), where µ and σ = log(1 + exp(ρ)) are mean and standard deviation of a normal distribution, the posterior sample of quantum circuit's weights is given by where is parameter-free noise and is the element-wise multiplication. We adopt the Bayesian backpropagation for training BayesianQNN, which is illustrated in Algorithm 1.

b: DEEP ENSEMBLES
Another approach for variational Inference of BayesianQNN is deep ensembles, which can be considered as model averaging. [54] introduced a simple and scalable framework that can capture the predictive uncertainty from the model's predictions. The deep ensembles leverage the SGD dynamic with different initializations to obtain point estimates of varying model parameters. Although the training model in each iteration is via a frequentist approach, the whole process can be understood from the Bayesian perspective. Under regularization, the points estimates of the model's weights should be associated with modes of a Bayesian posterior. We adopt the deep ensembles for training BayesianQNN, which is straightforwardly illustrated in Algorithm 2.

IV. NUMERICAL RESULTS
This section will illustrate the effectiveness of Bayesian variational training on discrete-variable QNN models. In the experiments, we create two synthetic datasets. Each dataset contains 1000 observations, which are categorized into two FIGURE 3. Design of discrete-variable quantum machine learning model used in experiment of synthetic datasets. We set the initial states of two qubits at state |0 , followed by Rotation-X gate parameterized by input features. Entanglement is established by Control-Z gate. Model's parameters are corresponding to parameterized Rotation-X gates. Measurement over two qubits is performed in Pauli-Z basis, which is followed by classical post-processing to from classification layer. classes. For simplicity of the demonstration, we set the number of features to equal the number of qubits in the designed circuit model. Additionally, we make our generated datasets publicly available for reproduction and comparison.

A. TRAINABILITY AND MODEL CAPACITY
We study the Fisher information spectrum and the effective dimension of BayesianQNN compared to the frequentist quantum model and classical counterpart. First, we construct 2-layer QML models over two qubits as in Figure 3, which results in 8 parameters. For the classical model, we perform a random search to find the best model that achieves consistent results in the moon dataset (Section IV-B) with QML models, resulting in 48 parameters. To compute the empirical Fisher information, we randomly draw 100 sets of model parameters uniformly from [−1, 1] d .
First of all, the right panel of Figure 5 shows the Fisher information spectrum of a classical neural network, which is heavily centralized around zero with several extremely large eigenvalues. The derived Fisher spectrum is often associated with several optimization issues, in which extreme eigenvalues are inclined to prevent the convergence and prolong the training session. On the other hand, the eigenvalue distribution of BayesianQNNs is more uniform and absents outliers. This indicates that BayesianQNNs are more resilient to barren plateau loss landscape, leading to higher trainability. Second, the left panel of Figure 5 illustrates the effective dimension of three investigating models with different proportions of training data. BayesianQNNs outperform classical DNN overall range of data, yielding a considerable gap of nearly double. Incorporate with Equation 9, the higher effective dimension of BayesianQNNs leads to better generalization over unseen data. Finally, the effective dimension of BayesianQNNs is slightly lower than frequentist quantum models, which is the consequence of setting parameter-free noise at = 1. This gap is toward zero when → 0. This preliminary result is consistent with our numerical results, which will be discussed hereafter in Section IV-B.

B. PREDICTIVE PERFORMANCE
In this section, we analyze the performance of proposed BayesianQNNs in decision boundary level in Section IV-B1. Then, we further analyze the impact of Bayesian learning on kernels generated by the models in Section IV-B2

1) ANALYSIS ON DECISION BOUNDARY
We first design the same quantum circuit model for classification on generated datasets as depicted in Figure 3 by following the general procedure mentioned in Section III-A. Since our synthesized dataset only has two input features, the quantum circuit of BayesianQNN is be initialized with two qubits, followed by CZ gates. We used CZ gates instead of CNOT gates in this experiment due to their permutation invariance. In other words, the order of qubits applied to CZ gates does not affect the results, while CNOT gates consider such an order. The primitive quantum gate used is parameterized R X gate, enabling single qubit X-rotation by learnable parameter φ. The mathematical expression  of R X gate is given as: The model can be represented by unitary transformation over two qubits, in which entanglement is established by the Control-Z gate (CZ -gate). We train the network for 200 epochs with Adam optimizer and learning rate 0.1. The decay period of learning rate is every 2 epochs with decay rate of 0.97. The hyper-parameter setting for Bayesian backpropagation includes scalar η = 0.01 and (µ prior , σ prior ) = (0, 1). All experiment is conducted in Python 3.6, Pytorch 1.9.0 and Pennylane 1.15.0. Table 1 reports the predictive performance of the variational quantum model under different training strategies. First, Bayes-by-backprop yields consistent results with the frequentist approach, reaching 93.17% and 86.22% of mean accuracy in circle and moon datasets, respectively. The results of deep ensemble training based on 100 independent runs initialized with distinct seeds under SGD optimizer are very close to the result of the single run by frequentist approach. The variation of deep ensemble training is higher than Bayes-by-backprop, although it is based on many models. Hence, we observe that Bayes-by-backprop achieves higher efficiency than deep ensemble training in computational resources and predictive performance. Figure 6 depicts the decision boundary and uncertainty estimation of variational quantum models trained by the Bayes-by-backprop algorithm. The inference domain correctly splits the testing data corresponding to their generated distribution from the left figures. The illustration of decision boundary is consistent with the results reported in Table 1. Moreover, the right figures illustrate the epistemic uncertainty from our models. As we can see from these figures, the inferences of quantum models have high certainty when observations support the inferences. Take the results from the moon dataset as an example. The epistemic uncertainty of regions closed to two classes' centroids is low, increasing the decision boundary. The same pattern is observed from the circle dataset; the uncertainty has high values close to the class's separation, while inner and outer regions of the circle have uncertainty levels at absolute zero. Figure 7 shows the convergence of model parameters, where the model's weights converge after 100 epochs. Moreover, the variance of the model's weights decreases along with the training progression, reaching minimal values at the end of the process.
The experiment results show an empirical proof of concept for our proposed BayesianQNN compared to frequentist-trained QNN and classical neural networks. First, we do not attempt to reach the predictive performance between Bayesian and frequentist training QNNs. Instead, we would like to highlight the main advantage of Bayesian-QNN, which enjoys the properties of Bayesian Inference, capturing the uncertainty in the model's Inference. In other words, BayesianQNN makes inferences based on supported observations from investigating datasets, which enables a more robust decision-making process. Second, we also investigate the effectiveness of quantum advances over classical counterparts by constructing feed-forward neural networks on the same datasets. To reach the same performances of the Quantum machine learning model, classical neural networks require roughly 4× and 6× model complexity compared to QML models in circle and moon datasets, respectively. Fair neural networks, which have a similar model complexity with the QML model (contains 8 learnable parameters), only achieve the test accuracy of 0.82 and 0.76 in circle and moon datasets, respectively. The experimental results on the two synthesized datasets are consistent with our preliminary result using the effective dimension and Fisher information spectrum in Section IV-A.

2) ANALYSIS ON KERNELS GENERATED BY BayesianQNNs
Quantum neural networks have a close relationship with kernel methods. Thus, it is worth studying the effect of Bayesian training on kernel generated by QNNs, which significantly influence the decision boundary. Let |φ θ (x) be the embedded quantum state of classical input features x, which parameterized by model's weights θ. The quantum kernel is defined as the inner product between two data points x and x , which given as Different encoding schemes in the literature include basis encoding, amplitude encoding, repeated amplitude encoding, rotation encoding, and coherent state encoding (CV-Quantum model), which all result in different quantum kernels [3]. Figure 8 illustrates the quantum kernels from our discrete-variable model in Figure 3. We obtain the quantum feature |φ θ (x) by rotation embedding using Pauli-Y operations, given by [30]: cos(x k ) q k sin(x k ) 1−q k q 1 , . . . , q n , where |q i (x i ) = cos(x i )|0 + sin (x i )|1 In incorporating with Equation 16, the quantum kernel function illustrated in Figure 8 is We show the quantum kernel function k(0, x ) in the left panel of Figure 8 with increasing values of σ from 0.05, 0.1 and 0.2, which depicts the effect of Bayesian inference on rotation quantum kernels. From the right panel, we can see that kernels generated by BayesianQNN are not deterministic functions as in the frequentist approach, which considers the uncertainty.

A. EFFECT OF PARAMETER-FREE NOISE
We further investigate the effect of parameter-free noise on discrete-variable quantum models trained by Algorithm 1.  . Effect of noise on Bayesian quantum classifier. The standard deviation of parameter-free noise increases as 0.1 → 0.5 → 1 from left to right. Top row is decision boundary and epistemic uncertainty estimation from circle dataset, while bottom row is from moon dataset. The impact of noise is highly data dependent. There is no significant difference from the circle dataset, while the decision boundary from the moon dataset is noticeably affected by high values of σ .
First, we fix the hyper-parameter setting across all experiments and only increase the standard deviation σ from 0.1, 0.5 to 1. The primary observation from this experiment is that the impact of noise is highly data-dependent. From the top row of Figure 9, the decision boundaries are nearly the same when the noise increases. The same pattern appears when we analyze the uncertainty estimation on the circle dataset, where there are slight changes by under 0.03 in Inference's variance. On the other hand, the experiment on the moon dataset shows a contrast observation, which can be seen from the bottom row of Figure 9. It appears that the best model is a value of 0.5, in terms of predictive power and robustness. As we can see from the decision boundary corresponding to σ = 0.5, the model returns low-uncertainty-inference within the region having supported observations (in the green circle), while Inference on regions lacks support data is high certainty. Besides, high value of σ = 1 shifts the decision boundary, deteriorating the model's performance. Thus, determining the value for such noise can be performed by hyper-optimization problem, resulting in the best parameter setting given a specific data scenario.

B. VALIDITY OF PROPOSED WORK ON CV-QNNs
We validate the practicality of our proposed work on continuous-variable QNNs by investigating their performance on credit card fraud detection datasets. The dataset includes 284, 807 transactions with 28 features. Moreover, the dataset is highly imbalanced, in which only < 0.2% of observations are ''fraudulent''. We split the original dataset into train and test set using a ratio 80% − 20% with stratification of classes to maintain the proportion of each class.
We design a hybrid classical-quantum network following [1]. First, due to many input features, we project the original features into a representation vector of dim 14 by a small multi-layer perceptron network including 3 layers with 10 hidden nodes each. Then, these representations are then fed forward into a continuous-variable quantum model depicted in Figure 4. Finally, the output layer of the hybrid architecture is based on photon-number measurement. Since the output is fixed in the Fock basis, we assign the ''genuine'' class with single-photon in the first mode, while the second mode corresponds to the ''fraudulent'' class. In contrast to the design of discrete-variable quantum model, the primitive continuous-quantum gates used are single-mode Gaussian gates, which include phase space squeezing gate S(r), N-model interferometer U(θ, φ), displacement D(α). The Kerr gate K(k) at the end of the architecture is a non-Gaussian gate, which enables non-linearity and universality [1].
The computational requirement for training such a model is significantly high, demanding approximately 2-GPU days for a single run. Hence, we only investigate the result from the Bayes-by-backprop approach since deep ensemble requires many independent runs, which are computationally prohibited in current hardware. Figure 10 shows the receiver operating characteristic (ROC) curve from frequentist and Bayes-by-backprop training. The performance of the quantum model under Bayesian training is higher than the frequentist approach, reaching an AUC of 0.9128 ± 0.03 in comparison to 0.8205. We also construct a classical neural network with nearly similar model complexity to the quantum model, which achieves a compatible AUC of 0.90. To reach the same test accuracy with QML models, we need to increase the deep of the classical network by one layer of 10 hidden nodes, leading to 110 extra parameters. The results show proof-of-principle for Bayesian training variation quantum circuit in practice, which can achieve compatible results with classical machine learning model while requiring less number of parameters. This is evidence that the QML model possesses a larger model capacity for patterns learned from input space, leading to more efficient predictive performance and model complexity solutions.

VI. CONCLUSION
In this work, we introduce the Bayesian quantum machine learning model. We adopt two variational inferences to train classical Bayesian neural networks to learn Bayesian quantum machine learning models. Our proposed approaches achieve compatible results with the frequentist approach while also capturing the uncertainty in the model's inferences. Such epistemic uncertainty is valuable for the decisionmaking process, which can be considered the main advantage of Bayesian Inference over its counterpart. Moreover, we provide theoretical analysis on the model's capacity and trainability of QNN along with supported numerical experiments.
From a broader perspective, we proposed an alternative method for training QNN via Bayesian Inference for any given datasets of interest. The first step in the procedure is to construct a variational quantum circuit is based on Section III-A, which embeds input features onto high-dimensional Hilbert space. Second, once the structure of QNN is defined, we proposed two algorithms that enable training QNN by Bayesian Inference. Finally, the BayeisianQNN offers the epistemic uncertainty estimation of the model's prediction, which is useful for decision-making. Besides, our numerical results show that the proposed method is effective in both discrete and continuous variable quantum models, achieving compatible results compared to its competitors.
Further study of BayesianQNN may include the scalability, where we aim to train larger BaysianQNN. Another potential direction is further study the effect of parameter-free noise since it plays an important role in the decision boundary.