Adaptively Setting the Path Length for Separable Shadow Hamiltonian Hybrid Monte Carlo

Hybrid Monte Carlo (HMC) has been widely applied to numerous posterior inference problems in machine learning and statistics. HMC has two main practical issues, the ﬁrst is the deterioration in acceptance rates as the system size increases and the second is its sensitivity to two user-speciﬁed parameters: the step size and trajectory length. The former issue is addressed by sampling from an integrator-dependent modiﬁed or shadow density and compensating for the induced bias via importance sampling. The latter issue is addressed by adaptively setting the HMC parameters, with the state-of-the-art method being the No-U-Turn Sampler (NUTS). We combine the beneﬁts of NUTS with those attained by sampling from the shadow density, by adaptively setting the trajectory length and step size of Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC). This leads to a new algorithm, adaptive S2HMC (A-S2HMC), that shows improved performance over S2HMC and NUTS across various targets and leaves the target density invariant.


I. INTRODUCTION
Hybrid Monte Carlo (HMC) [1] is a widely employed inference method for sampling complex posterior distributions in machine learning and statistics. The success of HMC and its variants is fundamentally premised on the principled use of gradient information via Hamiltonian dynamics, which facilitates distant sample proposals that suppress random walk behavior [2,3,4,5,6,7,8].
The performance of HMC however suffers from two significant practical issues. The first issue relates to the deterioration in acceptance rates due to numerical integration errors as the system size increases [9,10,11]. This deterioration in acceptance rates results in strong auto-correlations in the generated samples, leading to a requirement of large sample sizes. The second issue with HMC is the need to tune performance sensitive parameters in the form of the step size and trajectory length. A trajectory length that is too short leads to exhibition of random walk behaviour [12,13], while a trajectory length that is too long results in a sampling trajectory that traces back [12,13]. On the other hand, small step sizes are computationally inefficient leading to correlated samples and poor mixing and large step sizes compound discretisation errors which also lead to poor mixing [12,13]. Finding the optimal settings for these parameters is of paramount importance, but is far from trivial [12,13].
An approach to address the deterioration in acceptance rates with increases in system size is the use of modified or shadow Hamiltonian based samplers. These modified Hamiltonian methods leverage backward error analysis of the numerical integrator, which results in higher order conservation of the shadow Hamiltonian relative to the true Hamiltonian [14]. Numerous methods have been put forward for sampling from a shadow Hamiltonian [15,11,16,9,8,5].
Izaguirre et al. [9] introduce Shadow Hybrid Monte Carlo (SHMC) by deriving a shadow Hamiltonian corresponding to the leapfrog integrator. Sweet et al. [16] present Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC), which leverages a processed leapfrog integrator that results in a separable shadow Hamiltonian. The separable shadow Hamiltonian reduces the need for computationally intensive momentum generation [9] or partial momentum refreshment [17,15] evaluations that are necessitated by non-separable shadow Hamiltonians. Heide et al. [11] derive a nonseparable shadow Hamiltonian for the generalised leapfrog integrator in Riemannian manifold Hamiltonian Monte Carlo (RMHMC), which results in improved performance relative to sampling from the true Hamiltonian. These methods still suffer from the practical impediments of setting the integration step size and trajectory lengths.
Hoffman and Gelman [12] present the No-U-Turn sampler (NUTS) which automatically sets the trajectory length by setting a termination criterion that avoids retracing of steps. Hoffman and Gelman [12] also address step size adaptation through primal dual averaging in the burn-in phase. These two approaches for setting these parameter have been widely adopted in the literature [18,19,20].
There has been no attempt to adaptively set the step size and trajectory length parameters for shadow Hamiltonian Monte Carlo methods. A particular constraint in trajectory length tuning is that most shadow Hamiltonian methods proposed in literature rely on non-separable Hamiltonians [15,17,9,11]. We overcome this constraint by relying on a separable shadow Hamiltonian based sampler. This allows for both tuning of step sizes through dual averaging and the trajectory length via a binary tree recursion as in Hoffman and Gelman [12].
We show that our adaptive shadow separable Hamiltonian Hybrid Monte Carlo (A-S2HMC) method achieves better exploration of the posterior and higher effective sample sizes than the standard separable Hamiltonian Hybrid Monte Carlo across various benchmarks, while leaving the target density invariant. The analysis is performed on jump-diffusion processes, Neal's [21] funnel and Bayesian logistic regression posterior targets.
Our contributions in this work can be summarised as follows. •  The remainder of this paper is structured as follows: Section II and III discuss the Markov Chain Monte Carlo methods that form the basis of the new method, Section IV presents the proposed method, Section V outlines the experiments conducted and discusses the results of the experiments. We provide the conclusion in Section VI.

II. HAMILTONIAN MONTE CARLO
The Hamiltonian Monte Carlo (HMC) algorithm uses first order gradient information of the unnormalised target posterior to guide its exploration of the parameter space [1,2]. The HMC adds an auxiliary momentum variable p to the parameter space w. Note that p has the same number of dimensions as w. The resultant Hamiltonian H(w, p) from this dynamic system is written as where U (w) is the negative log-likelihood of the target posterior distribution and K(p) is the kinetic energy defined by the kernel of a Gaussian as where M is a positive definite mass matrix. The trajectory vector field is defined by considering the parameter space as a physical system that follows Hamiltonian dynamics [2]. The equations governing the trajectory of the chain are then defined by Hamilton's equations at a fictitious time t as follows [22]: The evolution of this Hamiltonian system must preserve both volume and total energy. As the Hamiltonian is separable, to traverse the space we use the leapfrog integrator [4,22]. In the leapfrog integrator, to reach the next point in the path, we take a half step in the momentum direction, followed by a full step in the direction of the model parameters and then ending with another half step in the momentum direction. The update equations for the leapfrog integration scheme are [22]: where ǫ is the discretisation step size. Due to the discretisation errors arising from the leapfrog integration, a Metropolis-Hastings acceptance step is then performed in order to accept or reject the proposed sample [22,4]. In the Metropolis-Hastings step the parameters w * proposed by the HMC are accepted with the probability P(accept w * ) = min 1, The overall HMC sampling process follows a Gibbs sampling scheme, where we sample the momentum and then sample a new set of parameters given the drawn momentum. The leapfrog steps are repeated until the maximum trajectory length L is reached. For the HMC algorithm, one needs to tune the step size and the trajectory length. A trajectory length that is too short leads to a random walk behaviour similar to the Metropolis-Hasting method [12]. A trajectory length that is too long results in a trajectory that inefficiently traces back. The step size is also a critical parameter for sampling. Small step sizes are computationally inefficient leading to correlated samples and poor mixing, while large step sizes compound discretisation errors leading to low acceptance rates. Tuning these parameters typically requires multiple time consuming pilot runs [12].
The No-U-Turn Sampler (NUTS) of [12] automates the tuning of the leapfrog step size and trajectory length. In NUTS, the step size is tuned through primal dual averaging during an initial burn-in phase by targeting a particular sample acceptance. The trajectory length is tuned by iteratively adding steps until either the chain starts to trace back (Uturn) or the Hamiltonian becomes infinite. That is, when the last proposed position state w * , starts becoming closer to the initial position w. This happens if: (w * − w) × p ≤ 0. This approach however, violates the detailed balance condition. To overcome this problem, in NUTS, a path of states B is generated such that its size is determined by the termination criterion. The set B has the following property, which is required for detailed balance [13,12]: where z = (w, p) is the current state. That is, the probability of generating B, starting from any of its members is the same.
Having B, the current state z and a slice variable, u, that is uniformly drawn from the interval [0, π Z (z)], a set of chosen states C: is constructed from which the next state is drawn uniformly. In instances where the leapfrog transition is not effective in preserving the Hamiltonian, the algorithm can generate a large or even infinite B containing several states that will not be chosen (i.e. ||B|| >> ||C||) and waste computational resources. In the next section we present shadow Hamiltonian's which are better conserved by the leapfrog integrator.

III. SHADOW HAMILTONIANS
The HMC algorithm relies heavily on the integrators' accurate simulation of the trajectories dictated by the Hamiltonian dynamics. However, the leapfrog integrator only preserves the Hamiltonian up to second order. In order to increase accuracy, one could potentially design more accurate numerical integrators that preserve the Hamiltonian to a higher order [17,8,5]. However, this approach tends to be computationally expensive. The approach in this work relies on backwards error analysis to instead derive a shadow Hamiltonian, whose energy is more accurately conserved by the leapfrog algorithm. We thus instead target the corresponding modified or shadow density. Importance sampling is then utilized to correct the generated samples towards the true density.
Shadow or modified Hamiltonians are perturbations of the Hamiltonian that are by design exactly conserved by the numerical integrator [8,5]. In the case of shadow Hamiltonian Hybrid Monte Carlo, we sample from the importance distribution defined by the shadow Hamiltonian whereH [k] is the shadow Hamiltonian defined using backward error analysis of the numerical integrator up to the k th order. When performing backward error analysis, the shadow Hamiltonian can be defined by an asymptotic expansion in the powers of the discretisation step size ǫ around the Hamiltonian: This asymptotic expansion diverges in practice, however a k th order truncation of the expansion is used: The H k terms can be determined by matching the corresponding components of the Taylor series in terms of ǫ and the expanded exact flow of the modified differential equation of the Hamiltonian. These modified equations can be proved to be Hamiltonian for symplectic integrators such as the leapfrog integrator.
In this work, we focus on a fourth-order truncation of the shadow Hamiltonian under the leapfrog integrator [8,5]. Since the leapfrog is second-order accurate (O 2 ), the fourthorder truncation is conserved with higher accuracy (O 4 ) than the true Hamiltonian. The fourth-order shadow Hamiltonian for the leapfrog after matching coefficients from the flow and the asymptotic expansion becomes: where U w , U ww , K p and K pp are Jacobians and Hessians of the potential and kinetic energies respectively. Note that the shadow Hamiltonian in equation (11) can also be obtained by truncating the Baker-Campbell-Hausdorff (BCH) formula applied to Poisson brackets of the terms of the separable Hamiltonian [23]. The shadow Hamiltonian in equation (11) is non-separable in terms of w and p, which necessitates computational expensive momenta acceptance criteria for momenta, and the potential tuning of additional parameters [9,16,11,24,8,5]. This additional computational overhead is overcome by preprocessing the positions and momenta before propagating through the leapfrog integrator, and thus creating a processed leapfrog integration scheme [9,16,8,5]. The processed leapfrog can be shown to be both symplectic and reversible and preserves detailed balance, hence leaving the target density invariant [16]. We present the proof of this in the supplementary information.
Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC) utilises a processed leapfrog integrator to create VOLUME 4, 2021 a separable Hamiltonian [16]. The separable Hamiltonian in S2HMC is given as: which is obtained by substituting a canonical transformation (ŵ,p) = X (w, p) into (11). The full derivation of the separable Hamiltonian is presented in Appendix VI. The map should commute with reversal of momenta and should preserve phase space volume so that the resulting S2HMC ensures detailed balance [16,8,5]. Propagation of positions and momenta on this shadow Hamiltonian is performed after performing this reversible mapping (ŵ,p) = X (w, p). The canonical transformation X (w, p) is given as [25,16,8,5]: where (ŵ,p) is found through fixed point 1 iterations as: After the leapfrog is performed, this mapping is reversed using post-processing via following fixed point iterations: Once the samples are obtained from S2HMC, importance weights are calculated to allow for the use of the shadow canonical density rather than the true density. These weights are based on the differences between the true and shadow Hamiltonians as b m = exp[−(H(w, p) −Ĥ(w, p))]. Mean estimates of observables f (w) which are functions of the parameters w can then be computed as a weighted average.

IV. ADAPTIVE SEPARABLE SHADOW HYBRID HAMILTONIAN MONTE CARLO
We now present the adaptive Separable Shadow Hybrid Hamiltonian Monte Carlo (A-S2HMC) algorithm which automatically tunes the trajectory length and step size parameters of S2HMC. This new approach is based on the NUTS algorithm with dual averaging for the step size. In particular, the step size is tuned via primal dual averaging during the burn-in phase, and the trajectory length is set by a recursive doubling procedure until the particle traces back. The A-S2HMC method differs from NUTS in that a different integrator is used. Instead of using the leapfrog integrator associated with HMC, the processed leapfrog integrator, which is explicit and symplectic, corresponding to S2HMC is used. In Theorem IV.1 we show that A-S2HMC leaves the target distribution invariant.
The processed leapfrog integrator used in the A-S2HMC sampler proceeds by first performing the pre-processing step and then passing the momenta, weights and the shadow density to the leapfrog integrator, after which a post-processing step is employed to recover the momenta and weights. The post-processed weights and momenta are then used in the stopping criterion to prevent a U-turn. Note that the stopping criteria in A-S2HMC is the same as in the original NUTS algorithm, except that now the target distribution is the shadow density and the weights are the post-processed weights i.e. the criterion in equation 7 changes from using π Z (z) to using the shadow equivalentπ Z * (z * ) where z * is the state representing the post-processed position and momenta.
Theorem IV.1. A-S2HMC satisfies detailed balance and leaves the target distribution invariant.
Proof. To show that A-S2HMC satisfies detailed balance we show that a NUTS algorithm that uses the processed leapfrog algorithm is both sympletic and reversible. The original NUTS algorithm, that uses the leapfrog integrator, has been shown to satisfy detailed balance [27]. We now show that the processing map (ŵ,p) = X (w, p) commutes with reversal of momenta and preserves phase space volume so that the resulting processed leapfrog integrator used in A-S2HMC is both sympletic and reversible and thus ensures detailed balance.
We can get a symplectic map using a generating function of the third kind [25]: The map (ŵ,p) = X (w, p) is then given by: The map (ŵ,p) = X (w, p) from equation (16) is given precisely by equations (14). This map is the pre-processing step of A-S2HMC. In other words, it is the canonical change in variables that preserves the symplectic property of the processed leapfrog algorithm. The inverse mapping (w, p) = X −1 (ŵ,p) is given by equation (15). The inverse mapping is the post-processing step of A-S2HMC. The reversibility of the processed leapfrog algorithm can be shown by X (w, −p) = diag(I, −I)X (w, p). Thus, since the processed leapfrog algorithm is both symplectic and reversible, A-S2HMC preserves detailed balance and leaves the target distribution invariant.
We present the complete A-S2HMC sampler in algorithm 1, with parts in blue indicating the additions to NUTS with dual averaging of [12]. If the blue sections are omitted, the processed leapfrog reduces to a the original leapfrog, which then makes A-S2HMC to become NUTS. Note that when the dash " − " character is used as an output, it means that the corresponding argument is left unaltered.

V. NUMERICAL EXPERIMENTS
We consider three test problems to demonstrate the performance of A-S2HMC over S2HMC, HMC and NUTS. We first study a Merton [28] jump diffusion process , whose transition density is an infinite mixture of Gaussian distributions with the mixing weights being probabilities from a Poisson distribution. We then turn our attention to Neal's funnel problem [21], which is an example that shows the difficulty in sampling from Bayesian hierarchical models. We then consider performance on Bayesian Logistic Regression over various benchmark datasets. Performance is measured via multivariate effective sample size (ESS) and ESS per second. For all the algorithms the step size was set by targeting an acceptance rate of 70% during the burn-in period. A total of 10 independent chains were run for each method on each target distribution. All experiments were conducted on a 64bit CPU using PyTorch.
The ESS calculation used in this work is the multivariate ESS calculation outlined in [29]. This metric takes into account the correlations between the parameters as well, which is unlike the minimum univariate ESS measure that is typically used to analyse MCMC results [30,29]. The minimum univariate ESS calculation results in the estimate of the ESS being dominated by the parameter dimensions that mix the slowest, and ignoring all other dimensions [29]. The multivariate ESS is calculated as: where n is the number of generated samples, D is the number of parameters, Λ is the sample covariance matrix and Σ is the estimate of the Markov chain standard error. When D = 1, mESS is equivalent to the univariate ESS measure. We now address the effective sample size calculation for Markov chains that have been re-weighted via importance sampling, such is the case for S2HMC and A-S2HMC [17,11]. For n samples re-weighted by importance sampling, the common approach is to use the approximation by [31] given by whereb j = b j / n k=1 b k . This accounts for the possible nonuniformity in the importance sampling weights. In order to account for both the effects of sample auto-correlation and re-weighting via importance sampling, we approximated the effective sample size under importance sampling by:

A. JUMP DIFFUSION PROCESS
The modelling of returns of asset prices has been of interest since the 1970s. The majority of the models used to model the returns, such as geometric Brownian motion, are based on the normal distribution. The problem with assuming that returns are normally distributed is that the model will fail to capture extreme events such as those that occurred during the great financial crisis of 2008 [32,33]. Jump diffusion processes are appealing to use to model asset returns, instead of the traditional diffusion models such as geometric Brownian motion, because they produce returns which are leptokurtic [32,33]. The jump diffusion model that is analysed in this work is the [28,33] one-dimensional Markov process {S t , t ≥ 0} which is characterised by the following stochastic differential equation (SDE): where µ is the drift coefficient, σ is the diffusion coefficient, {B t , t ≥ 0} is a standard Brownian motion process, Y i is the random size of the ith jump and {N t , t ≥ 0} is a Poisson process with intensity λ. Furthermore, the jump sizes Y i are assumed to be independent and identically distributed random variables, and independent of both {N t , t ≥ 0} and {B t , t ≥ 0}. In addition, it is assumed that Y i ∼ N (µ jump , σ 2 jump ). The SDE in equation (20) implies the following transition density: where w and x are real numbers, τ is the time difference between S(t + τ ) and S(t) and φ is the probability density function of a standard normal random variable. Given that the transition density in equation (21) is an infinite mixture of normal distributions, it can be shown that the likelihood function has many points where it is not defined. In this work we truncate the infinite summation in equation (21) to the first 10 terms. We simulated 4 years worth of returns using the SDE in equation (20). We refer to this dataset as the simulated dataset. We further calibrate jump diffusion processes to real world datasets across different financial markets. The datasets consists of daily prices that were obtained from Google Finance. The specifics of the datasets are as follows: • Bitcoin dataset: Daily data for the crypto-currency index from 1 Jan 2017 to 31 Dec 2020. The prices were converted into log returns. • USDZAR dataset: Daily data for the currency from 1 Jan 2017 to 31 Dec 2020. The prices were converted into log returns. We generated 2 000 samples after a burn-in period of 500 samples using the four MCMC methods. A trajectory length of 200 was used for both HMC and S2HMC.

B. NEAL'S FUNNEL DENSITY
Neal [21] introduced the funnel density as an example of a sampling problem that exhibits behaviour typical of pathologies that arise in Bayesian hierarchical and latent variable models [34,11]. The model treats the variance of the parameters as a latent lognormal random variable, which leads to non-convex exponential cusping behaviour in the negative logdensity [34,11]. Neal [21] considered a 10-dimensional funnel. In this work we instead consider increasing dimensions in the set {50, 100, 200} to illustrate the robustness of the proposed method with increasing dimension. The target density is For our experiments we produced 5 000 samples, after a burn-in period of 500 samples. A trajectory length of 200 was used for both HMC and S2HMC on all the funnels. The results in Table 2 show that A-S2HMC produces the highest ESS, and improves on a normalised ESS basis as the dimensionality of the problem increases. Furthermore, A-S2HMC outperforms on an average negative log-likelihood basis.
In Figure 1 we show how HMC, NUTS, S2HMC and A-S2HMC explore a 10 dimensional Neal's [21] funnel using the first two-dimensions of the funnel. The results shows that A-S2HMC explores the "throat" of the funnel better than the other algorithms, and accurately recovers the distribution of the first dimension v when compared to the other samplers.

C. BAYESIAN LOGISTIC REGRESSION
Bayesian Logistic Regression is a standard tool for binary classification that is used extensively in various disciplines. We present results from the analysis of four datasets used in [30] as well as the Sona dataset from the UCI machine learning repository. The features of the datasets were normalized to have zero mean and unit variance. The potential U (w) is then taken to be the negative log-likelihood of the data at w, and a Gaussian prior with variance 100 on each of the parameters is imposed. A total 2 000 samples were drawn after 500 samples of burn-in. A trajectory length of 50 was used for both HMC and S2HMC. The results in Tables 3  and 4 show that A-S2HMC improves in performance as the dimensionality increases.

VI. CONCLUSION
We present the adaptive Separable Shadow Hamiltonian Hybrid Monte Carlo algorithm which combines the benefits of sampling from a shadow Hamiltonian and adaptively setting the path length while leaving the target invariant. The empirical results show that the new algorithm provides significant improvement on Separable Shadow Hamiltonian Hybrid Monte Carlo algorithm. The performance of the new method improves as the dimensionality of the problem increases across multiple benchmark problems.
A key limitation of the proposed algorithm is the computational time associated with the method. This leads to poor performance on a normalised ESS basis, particularly in low dimensional problems. We aim to address this issue in future by using a surrogate model to approximate the shadow Hamiltonian during the burn-in period of the method as an active learning task.
This work has a broader impact on machine learning applications where predictive uncertainty is of high interest. These include highly sensitive application domains such as medicine and law enforcement. Recent advances in Bayesian inference of Neural Networks also suggest that Hamiltonian dynamics based Markov Chain Monte Carlo samplers remain a crucial benchmark for new approximate inference methods [35]. Adaptive Shadow Hamiltonian methods thus contribute to both the tuning and handling of high dimensional parameter spaces in Hamiltonian dynamics based samplers.

DERIVATION OF THE SEPARABLE SHADOW HAMILTONIAN
Using the BCH formula it can be shown that the fourth-order shadow Hamiltonian is given as: One can interpret the first error term in Equation (23) as a perturbation to the kinetic energy K and the second as a perturbation to the potential energy U . The perturbation to the kinetic energy changes the constant mass matrix M into a different and possibly position-dependent matrix. This makes the distribution of the momentum to be non-Gaussian. We now seek a canonical transformation that can transform the Hamiltonian in Equation (23) to a separable Hamiltonian. By definition, a canonical transformation yields a new Hamiltonian system whose Hamiltonian is simply the result of substituting the change of variables into the original Hamiltonian [25]. The desired canonical change of variables (Q, P ) = X (q, p) should satisfy X = id + O(h 2 ), where id is the identity mapping id(q, p) = (q, p) [25]. We set the canonical transformation to: Pima (7) Heart (14) Australian (15) HMC 979 (7) where taking the first order Taylor expansion of U (W) at w and substituting W gives: and substituting P into B gives: By using the second order Taylor approximation of U (W) at w and substituting W, we have that: We then have that: Similarly, from the first order Taylor approximation of U (W) at w, it follows that U W = U w + O(ǫ 2 ) and noting that K PP = K pp as K is quadratic, we have that: It then follows that: H [4] (X (w, p)) = A + B + C + D = H(w, p) + ǫ 2 1 12 − λ K p U ww K p + To remove the second term which contains mixed derivatives of w and p and noting that K pp is a constant, we set λ = 1 12 and have that: H [4] (X (w, p)) = H(w, p) + ǫ 2 24 U w K pp U w + O(ǫ 4 ) (32) VOLUME 4, 2021