Antithetic Magnetic and Shadow Hamiltonian Monte Carlo

Hamiltonian Monte Carlo is a Markov Chain Monte Carlo method that has been widely applied to numerous posterior inference problems within the machine learning literature. Markov Chain Monte Carlo estimators have higher variance than classical Monte Carlo estimators due to autocorrelations present between the generated samples. In this work we present three new methods for tackling the high variance problem in Hamiltonian Monte Carlo based estimators: 1) We combine antithetic and importance sampling techniques where the importance sampler is based on sampling from a modified or shadow Hamiltonian using Separable Shadow Hamiltonian Hybrid Monte Carlo, 2) We present the antithetic Magnetic Hamiltonian Monte Carlo algorithm that is based on performing antithetic sampling on the Magnetic Hamiltonian Monte Carlo algorithm and 3) We propose the antithetic Magnetic Momentum Hamiltonian Monte Carlo algorithm based on performing antithetic sampling on the Magnetic Momentum Hamiltonian Monte Carlo method. We find that the antithetic Separable Shadow Hamiltonian Hybrid Monte Carlo and antithetic Magnetic Momentum Hamiltonian Monte Carlo algorithms produce effective sample sizes that are higher than antithetic Hamiltonian Monte Carlo on all the benchmark datasets. We further find that antithetic Separable Shadow Hamiltonian Hybrid Monte Carlo and antithetic Magnetic Hamiltonian Monte Carlo produce higher effective sample sizes normalised by execution time in higher dimensions than antithetic Hamiltonian Monte Carlo. In addition, the antithetic versions of all the algorithms have higher effective sample sizes than their non-antithetic counterparts, indicating the usefulness of adding antithetic sampling to Markov Chain Monte Carlo algorithms. The methods are assessed on benchmark datasets using Bayesian logistic regression and Bayesian neural network models.


I. INTRODUCTION
Predictive models of high parameter dimensionality have become the mainstay across a multitude of critical tasks such as medicine, law enforcement, finance and self-driving automobiles [1]- [5]. Bayesian methods such as Bayesian logistic regression (BLR) and Bayesian neural networks (BNNs) allow for principled inference of posterior distributions of model parameters. The Bayesian framework provides a robust approach to predictive uncertainty, theoretically The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . justified interpretations of regularisation and generalisation through prior distributions.
Markov Chain Monte Carlo (MCMC) methods are fundamental in performing inference of probabilistic machine learning models. Hamiltonian Monte Carlo (HMC) and its variants have been a popular choice of inference due to their ability to suppress random-walk behaviour through the use of first-order gradient information [6]- [9]. Although HMC serves as an improvement to other MCMC methods, like other MCMC methods it still suffers from the presence of autocorrelations in the generated samples [10], [11]. This results in the high variance of HMC based estimators. One approach of tackling the high variance of MCMC estimators is by using results from MCMC coupling theory [12], [13].
The coupling of MCMC methods has been used as a theoretical tool to prove convergence behaviour of Markov chains [12], [14]- [17]. Johnson [15] presents a convergence diagnostic for MCMC methods that is based on coupling a Markov chain with another chain that is periodically restarted from fixed parameter values. The diagnostic provides an informal lower bound on the effective sample sizes (ESS) of a MCMC chain. Jacob et al. [12] remove the bias in MCMC chains by using couplings of Markov chains using a telescopic sum argument. The resulting unbiased estimators can then be computed independently in parallel.
More recently, MCMC couplings have been studied for HMC with good results [18], [19]. Markov chain coupling has also been used to provide unbiased HMC estimators [12], [13], [20]. Heng and Jacob [13] propose an approach that constructs a pair of HMC chains that are coupled in such a way that they meet after some random number of iterations. These chains can then be combined to create unbiased chains. Unlike the approach of Heng and Jacob [13] and other authors [12], [19], Bou-Rabee et al. [17] present a new coupling algorithm where the momentum variable is not shared between the coupled chains.
In this paper we present methods that use the results from coupling theory to create anti-correlated HMC based chains where the momentum variable is shared between the chains. These anti-correlated chains are created by running the second chain with the momentum auxiliary variable having the opposite sign of the momentum of the first chain. These chains also share the random uniform variable in the Metropolis-Hastings acceptance step. When these two chains anti-couple strongly, taking the average of the two chains results in estimators that have lower variance, or equivalently, these chains produce samples that have higher effective sample sizes than their non-antithetic counterparts.
Antithetic coupled chains have a rich history in the literature [19], [21]- [23]. Green and Han [22] reduce the variance of samples from a Markov chain using an alternating signflip algorithm, which is akin to antithetic sampling. Antithetic Gibbs sampling is presented in Frigessi et al. [23], and is shown to produce estimators that always have lower variance than Gibbs sampling. The work that is most closely related to ours is the paper by Piponi et al. [19] where the antithetic HMC (A-HMC) and control variate HMC variance reduction techniques are presented. The HMC based antithetic samplers have an advantage over antithetic Gibbs samplers in that they are applicable to problems where conditional distributions are intractable, and where Gibbs sampling may mix slowly [19]. Although the control variate HMC variance reduction technique is more generally applicable than the antithetic approach (which requires the target distribution to be symmetric), it relies on the practitioner being able to construct efficient and accurate approximate distributions, which is difficult for neural networks, and is an impediment to broad use [19]. Thus, this work focuses on the antithetic sampling approach which is straightforward to implement for BNNs as well as for BLR.
We expand on the work of Piponi et al. [19] by presenting three new antithetic sampling MCMC algorithms being the antithetic Separable Shadow Hamiltonian Hybrid Monte Carlo (A-S2HMC), antithetic Magnetic Hamiltonian Monte Carlo (A-MHMC) and the antithetic Magnetic Momentum Hamiltonian Monte Carlo (A-MHMC) algorithms.
The A-S2HMC algorithm is based on sampling from the shadow Hamiltonian using Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC) [24], [25]. Sampling using the S2HMC algorithm provides benefits over sampling using HMC. One benefit is that the shadow Hamiltonian is better conserved by the numerical integrator, which allows one to use larger step sizes (and thus reducing autocorrelations) than in HMC without a significant drop in the acceptance rates [24]- [27]. In addition, S2HMC is an importance sampler and already offers variance reduction over HMC. The A-S2HMC algorithm thus combines the benefits of antithetic sampling with importance sampling, which should provide even more variance reduction than is provided by the the S2HMC over HMC. The disadvantage of S2HMC over HMC is that it consumes more computational resources than HMC, which reduces the outperformance on a normalised ESS basis.
The A-MHMC algorithm is based on adding antithetic sampling to the Magnetic Hamiltonian Monte Carlo (MHMC) [28] algorithm. The MHMC algorithms provides an improvement on HMC by adjusting the range of exploration via a magnetic field. This has the effect of enhancing the convergence speed and reducing the autocorrelations of the samples [28], [29]. In the A-MHMC algorithm we combine antithetic sampling with the benefits that MHMC already has over HMC with the aim of creating a sampler that outperforms A-HMC. Unlike S2HMC, MHMC has comparable execution time with HMC, which means that the computational burden should not outweigh the benefits of higher ESSs produced by MHMC and A-MHMC.
The A-MMHMC algorithm is based on performing antithetic sampling on the Magnetic Momentum Hamiltonian Monte Carlo (MMHMC) algorithm [28], [30]. As with MHMC, MMHMC produces higher ESSs than HMC and we thus leverage this to create the A-MMHMC algorithm which should outperform A-HMC. Unlike MHMC, the MMHMC algorithm requires an implicit integration scheme, which causes the method to be slow when compared to MHMC and HMC [28], [30]. This affects its performance on a normalised ESS basis.
The empirical results show that the proposed algorithms provide higher ESSs than A-HMC on the majority of the datatsets, more especially in higher dimensions. In addition, the antithetic versions of all the algorithms we consider in this paper have higher ESSs than their non-antithetic counterparts. This shows the benefit of adding antithetic sampling to MCMC algorithms.
The main contributions of this work are as follows: • We present three new antithetic algorithms being antithetic Separable Shadow Hamiltonian Hybrid Monte Carlo, antithetic Magnetic Hamiltonian Monte Carlo and the antithetic Magnetic Momentum Hamiltonian Monte Carlo algorithms.
• We provide the first application of antithetic MCMC algorithms to the inference of Bayesian neural networks. The remainder of this paper is structured as follows: Sections II to IV discuss the HMC, MHMC, MMHMC and S2HMC algorithms on which the new antithetic algorithms are based, Section V presents the three new antithetic algorithms that we are proposing along with the antithetic HMC method, Section VI outlines the experiments conducted, Section VII presents and discusses the results of the experiments and we provide the conclusion in Section VIII.

II. HAMILTONIAN MONTE CARLO
Hamiltonian Monte Carlo (HMC) reduces random walk behaviour of Metropolis-Hastings by adding auxiliary momentum variables to the parameter space [6], [31]. HMC creates a vector field around the current state using gradient information, which assigns the current state a trajectory towards a next state with high probability [31], [32]. The dynamical system formed by the model parameters w and the auxiliary momentum variables p is represented by the Hamiltonian H (w, p) written as follows [33]: where U (w) is the negative log-likelihood of the target posterior distribution, also referred to as the potential energy and K (p) is the kinetic energy defined by the kernel of a Gaussian with a covariance matrix M [32]: HMC then samples from the joint canonical distribution defined by the Hamiltonian as: The Hamiltonian dynamics are defined by the differential equation This differential equation cannot be solved analytically in most instances, and one has to resort to a numerical integration scheme. The leapfrog integration scheme is the most commonly used integration scheme. In the leapfrog integrator, to reach the next point in the path, we take half a step in the momentum direction, followed by a full step in the direction of the model parameters -then ending with another half step in the momentum direction [4], [6]. Since numerical integration is not exact and introduces errors, the Metropolis-Hasting acceptance procedure is used to adjust the acceptance rate of the sampler and ensure 16: w m = w m−1 17: else 18: w m = w m 19: end if return w m detailed balance [6], [34]. Algorithm 1 shows the pseudocode for HMC and the leapfrog integration scheme. The HMC algorithm has multiple parameters that require tuning for efficient sampling, such as the step size and the trajectory length. In terms of trajectory length, a trajectory length that is too short leads to random walk behaviour similar to the Metropolis-Hasting method. While a trajectory length that is too long results in a trajectory that inefficiently traces back [7].

Algorithm 1 Hamiltonian Monte Carlo
We now provide a simple example of how one would practically use the HMC algorithm and its variants, that we present in later sections of this paper, to infer the parameters of a model. Suppose that one wanted to infer the parameters a and b from the function f (x) = ax 2 + bx for x > 0. If we have n independent observations of x and use a uniform prior on the parameters, we would set where w = {a, b} and The Hamiltonian for this problem would then be given as: Algorithm 1 would then be run using equation 7 with M = I.

III. MAGNETIC HAMILTONIAN MONTE CARLO METHODS
Magnetic Hamiltonian Monte Carlo (MHMC) expands on HMC by adding a magnetic field in addition to the force field already in HMC. This magnetic field offers a great deal of flexibility over HMC and encourages more efficient exploration of the posterior [28], [29]. This results in lower autocorrelations in the generated samples when compared to HMC. MHMC uses the same Hamiltonian as in HMC, but exploits non-canonical Hamiltonian dynamics where the canonical matrix now has a non-zero element on the diagonal. The MHMC dynamics are: where G is the term that represents the magnetic field. This also shows that MHMC only differs from HMC dynamics by G being non-zero. When G = 0, MHMC and HMC have the same dynamics. As with HMC, these dynamics cannot be integrated exactly, and we resort to a numerical integration scheme with a Metropolis-Hastings acceptance step to ensure detailed balance [28]. For MHMC the integrator is not exactly the leapfrog integrator, but is very similar. Algorithm 2 show the pseudo-code for the MHMC algorithm as well as the leapfrog-like integration scheme.

Algorithm 2 Magnetic Hamiltonian Monte Carlo
p m = −p m , G = −G 10: end for function Integrator(p, w, , L, G, H (w, p)) 11: for t ← 1 to L do 12: p ← p + 2 ∂H ∂w (w, p) 13: 14: p ← exp(G )p 15: The matrix G provides an extra degree of freedom for the MHMC algorithm. It is not clear as to how one should set this matrix. Ideally this matrix should be tuned adaptively [28]. The selection of G is still an open research problem [28], [30]. In this paper we follow the directions of the authors [28] and select only a few dimensions to be influenced by the magnetic field. In particular, G was set such that G 1i = G 5i = g, G i1 = G i5 = −g and zero elsewhere for g = 0.2. We also ensure that G is anti-symmetric [30].
Although MHMC requires matrix exponentiation and inversion (see Algorithm 2 line 13), this only needs to be computed once upfront and stored [28]. Following this approach results in computation time that is comparable to HMC. This becomes more important in models that have many parameters such as neural networks.
As with MHMC, MMHMC [30] differs with HMC by the canonical matrix having a non-zero diagonal entry. The dynamics of MMHMC are given as : and thus only differ from HMC dynamics by E being non-zero. When E = 0, MMHMC and HMC have the same dynamics. As with MHMC, these dynamics cannot be integrated exactly and we resort to numerical integration. However, unlike MHMC, the integration scheme is not leagfrog-like and we resort to an implicit mid-point integration scheme [28], [30].
For the experiments in this paper we set E = G. The selection of E is still an open research problem which we intend to address in future work. Algorithm 3 show the pseudocode for the MMHMC algorithm, as well as the implicit midpoint integration scheme. The fixed point iterations in lines 15 to 20 of Algorithm 3 are the reasons why the MMHMC is much slower relative to HMC and MHMC. This reduces the algorithms practical usefulness, especially in problems with high dimensions such as neural networks.

IV. SHADOW HAMILTONIANS
Shadow or modified Hamiltonians are perturbations of the Hamiltonian that are by design exactly conserved by the numerical integrator. In the case of shadow Hamiltonian Hybrid Monte Carlo, we sample from the importance distribution defined by the shadow Hamiltonian as: whereH [k] is the shadow Hamiltonian defined using backward error analysis of the numerical integrator. When performing backward error analysis, the shadow Hamiltonian can be defined by an asymptotic expansion in the powers of the discretisation step size around the Hamiltonian:H p n+1 = p n = p 15: while norm w and norm p > 1e −6 or count < 10 do 16: w * = w n+1 +w 2 17: 19: This asymptotic expansion diverges in practice, however a k th order truncation of the expansion is used: The H k terms can be determined by matching the corresponding components of the Taylor series in terms of and the expanded exact flow of the modified differential equation of the Hamiltonian. These modified equations can be proved to be Hamiltonian for symplectic integrators such as the leapfrog integrator.
In this work, we focus on a fourth-order truncation of the shadow Hamiltonian under the leapfrog integrator. Since the leapfrog is second-order accurate (O 2 ), the fourth-order truncation is conserved with higher accuracy (O 4 ) than the true Hamiltonian. The fourth-order shadow Hamiltonian for the leapfrog after matching coefficients from the flow and the asymptotic expansion becomes: where U w , U ww , K p and K pp are Jacobians and Hessians of the potential and kinetic energies respectively. The shadow Hamiltonian in equation (13) is non-separable in terms of w and p, which necessitates computational expensive momenta acceptance criteria for momenta, and the potential tuning of additional parameters [24], [25]. This additional computational overhead is overcome by preprocessing the positions and momenta before propagating through the leapfrog integrator.

A. SEPARABLE SHADOW HAMILTONIAN HYBRID MONTE CARLO
Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC) utilises a processed leapfrog integrator to create a separable Hamiltonian [25]. The separable Hamiltonian in S2HMC is Propagation of positions and momenta on this shadow Hamiltonian is performed after performing this reversible mapping (ŵ,p) = X (w, p), where (ŵ,p) is found through fixed point iterations as: After the leapfrog is performed, this mapping is reversed using post-processing the following fixed point iterations: Once the samples are obtained from S2HMC as depicted in Algorithm 4, importance weights are calculated to allow for the use of the shadow canonical density rather than the true density. These weights are based on the differences between the true and shadow Hamiltonians as follows: Mean estimates of observables f (w) which are functions of the parameters w can then be computed as a weighted average.

V. PROPOSED ANTITHETIC SAMPLING ALGORITHMS
Suppose we have two random variables X and Y , that are not necessarily independent, with the same marginal distribution U , we have that: When f (X ) and f (Y ) are negatively correlated, the average of the two random variables will produce a lower variance than the average of two independent variables. This equation forms the basis for the antithetic sampling algorithms that we present in this paper. The full benefit of antithetic sampling is greatest when the distribution U is symmetrical about some vector [19], [35]. For the majority of target distribution of interest to machine learning researchers, the symmetry holds only approximately [19]. The approximate symmetry results in approximate anti-coupling, which nonetheless still provides good variance reduction [19].
The pseudo-code for the new antithetic algorithms that we are proposing is presented in Algorithms 6 -8, with the algorithm for A-HMC shown in Algorithm 5. The difference between the antithetic algorithms and the original algorithms is that the momentum variable and the uniform random variable in the Metropolis-Hastings acceptance step are shared between the two chains.

VI. EXPERIMENTS
The performance of the algorithms is compared on real world benchmark datasets. The real world data used are the four classification datasets used in Girolami and Calderhead [10]   which we model using BLR. We also apply BNNs to model real world regression benchmark datasets. The BNNs considered in this paper have one hidden layer with ten hidden units. The datasets, their number of features and their associated models are are outlined in Table 1.   Apply the pre-processing mapping to both chains 10:  by the execution time as well predictive performance on unseen data. The predictive performance metric used for classification was the Area Under the Curve (AUC), while for regression the Mean Square Error (MSE) was used. For all the algorithms we set the pre-conditioning matrix M = I, which is the common approach in practice.
The execution time is the time taken to generate the samples after the burn-in period. The ESS calculation used in this paper is the multivariate ESS calculation outlined in [36]. This metric takes into account the correlations between the parameters as well, which is unlike the minimum univariate ESS measure that is typically used to analyse MCMC results [10], [36]. The minimum univariate ESS calculation results in the estimate of the ESS being dominated by the parameter dimensions that mix the slowest, and ignoring all other dimensions [36]. The multivariate ESS is calculated as: where N is the number of generated samples, D is the number of parameters, is the sample covariance matrix and is the estimate of the Markov chain standard error. When D = 1, mESS is equivalent to the univariate ESS. For S2HMC, which is an importance sampler, the mESS is adjusted by taking into account the possibility of non-uniform importance weights (b) N m=0 through the thinning algorithm outlined in Radivojevic et al. [26].
The mESS for the variance reduced chain is related to the mESS of the original chain as [19]: where ρ is the correlation coefficient between the corresponding pairs of chains. In this paper, the correlation is taken as the maximum correlation across all the parameter dimensions, which creates a lower bound for mESS antithetic . Equation (20) implies that we can increase the ESS by ensuring that the correlation is as close as possible to -1, which is what the anti-coupling methodology aims to do. This also means that the mESS antithetic can be greater than the total number of generated samples N . For all the datasets, we generated 10 000 samples with a burn-in period of 5 000 samples. This was sufficient for all the algorithms to converge on all the datasets. The number of steps used to generate each sample was fixed at 100. The step-sizes were chosen by targeting an acceptance rate of 70% through the dual averaging [7] methodology explained below.

B. INTEGRATION STEP SIZE TUNING
We address the issue of step size adaptation in the algorithms through dual averaging in the burn-in phase [7]. We target a Metropolis-Hastings acceptance rate δ using dual averaging updates as follows: where µ is a free parameter that t gravitates to, γ controls the convergence towards µ, η t is a decaying rate of adaptation in line with [37] and H t is the difference between the target acceptance rate and the actual acceptance rate. The dual averaging updates are such that This has the effect of updating the step size towards the target acceptance rate δ. The dual averaging algorithm for HMC is shown in Algorithm 9, with the implementation for the other algorithms following similarly. More information on the dual averaging algorithm can be found in Hoffman and Gelman [7]. In this paper we use the same parameter settings as in Hoffman and Gelman [7], and these settings are shown in Algorithm 9. It is important to note that for the antithetic chains, we only adapt the step size of the original chain, and feed this step size to the other chain. The details of the real world datasets used in this paper are displayed in Table 1. The logistic regression datasets are all approached as binary classification tasks as they all have two classes. The datasets modeled using neural networks all have one continuous response variable and are thus approached as regression problems. A 90-10 train-test split was used and the features were normalised for all the datasets. For each dataset, the chain was run 10 times from different starting positions and the results recorded. The prior distribution over the parameters for the logistic regression datasets was Gaussian with standard deviation equal to 10. The above settings are in line with those in Girolami and Calderhead [10]. For the neural network datasets the prior was a standard normal distribution.

VII. RESULTS AND DISCUSSION
The experiments were implemented in PyTorch [38] and were carried out on a 64-bit precision CPU. In evaluating the S2HMC and MMHMC algorithms and their antithetic variants, we set a convergence tolerance of 10 −6 or the completion of ten fixed point iterations.
The performance of the algorithms across different metrics is shown in Figure 1 and Table 2. In Figure 1, the plots on the first row for each dataset show the mESS, and the plots on the second row show the mESS normalised by execution time (which is in seconds). The results are for the 10 runs of each algorithm.
As with Figure 1, the execution time t in Table 2 is in seconds. The results in Table 2 are the mean results over the 10 runs for each algorithm. When calculating the predictive performance metrics (that is AUC and MSE) for S2HMC and A-S2HMC, the results are weighted by the importance weights in equation (19). Note that we use the mean values over the 10 runs in Table 2 to form our conclusions about the performance of the algorithms.
As expected, HMC has the lowest execution time t on the majority of the datasets, with MHMC being a close second. The execution time for MHMC is similar to that of HMC on the BLR datasets and outperforms HMC on the Concrete and Protein datasets. The increased computation time of the shadow algorithms can be attributed to the fixed-point iterations in equations (9) to (12). The MMHMC algorithm is the slowest method due to the fixed-point iterations required in the implicit mid-point integration scheme used for this algorithm. Note that the execution time for the antithetic variants is twice the execution time for the non-antithetic algorithms. This is because two chains are run for the antithetic versions. The execution time of all the algorithms increases with the dimensionality of the problem, which is expected.
On the majority of the datasets, we find that MHMC, MMHMC and S2HMC have higher ESSs than HMC, with the out-performance being more pronounced on the the BNN datasets which have more parameters. The A-MMHMC and A-S2HMC methods produce superior ESSs on all TABLE 2. Mean results over 10 runs of each algorithm. Each column represents the mean value for the specific method. For example, column three shows the mean results for HMC. The execution time t is in seconds. The values in bold indicate that the particular method outperforms the other methods on that specific metric. For example, HMC outperforms in terms of t on the Pima dataset. The AUC metric is used for the BLR classification datasets (which are the first four datasets) while the MSE metric is used for the BNN regression datasets (which are the last four datasets). The methods have similar predictive performance across all the datasets with MMHMC and A-MMHMC outperforming the other methods on all the datasets. For the AUC metric higher is better, while for the MSE metric lower is better.
8 benchmark datasets when compared to A-HMC. A-MHMC produces higher ESSs than A-HMC on 7 of the 8 benchmark datasets, with the under-performance being on the German credit dataset. These results show that the proposed methods outperform in low dimensions, with the outperformance becoming more pronounced in larger dimensions. This will be particularly important when performing inference on deeper neural networks.
On a normalised ESS basis, A-S2HMC and A-MHMC outperform A-HMC on all the datasets, except that A-MHMC under-performs A-HMC on the German credit dataset. A-MHMC outperforms all the algorithms on the BNN datasets on a normalised ESS basis. This shows that although the proposed methods have in general a higher computational cost that HMC, they still provide improved normalised ESS rates.
The drop in normalised ESS performance for the A-MMHMC algorithm can be attributed to the increase in execution time of the mid-point integration scheme. Using an explicit integration scheme such as that proposed in Brofos and Lederman [30] for MMHMC would provide an improvement on a normalised ESS basis.
The algorithms have similar predictive performance on the benchmark datasets, with MMHMC and A-MMHMC algorithms consistently outperforming the other methods on all the datasets. The predictive performance is also similar between the antithetic and non-antithetic counterparts.

VIII. CONCLUSION
We present the antithetic Separable Shadow Hamiltonian Hybrid Monte Carlo, the antithetic Magnetic Hamiltonian Monte Carlo and the antithetic Magnetic Momentum Hamiltonian Monte Carlo variance reduction schemes based on approximate Markov chain coupling. We compare these three new algorithms to the antithetic Hamiltonian Monte Carlo algorithm on classification tasks using Bayesian logistic regression, and on regression tasks using Bayesian neural network models.
Although the proposed methods require more computational resources than Hamiltonian Monte Carlo in general, they result in improved effective sample size rates when compared to Hamiltonian Monte Carlo and antithetic Hamiltonian Monte Carlo, especially in higher parameter dimensions. We find that the antithetic versions of all the algorithms have higher effective sample sizes than their non-antithetic variants, which shows the usefulness of antithetic Markov Chain Monte Carlo methods.
This work can be improved by assessing the sensitivity of the methods to the path length as, in this study, the path length was fixed and the step size tuned. In addition, the adaptive tuning of the magnetic term, which is an open research problem, in Magnetic Hamiltonian Monte Carlo could also improve on the results presented in this paper. In future work we plan to consider larger datasets such as the MNIST dataset and deeper neural networks, albeit the computation cost will be higher. A comparison of the performance of the proposed algorithms to their control-variate counterparts, as well as exploration of the possibility of using Riemannian manifold based Monte Carlo algorithms are also of interest.