Latent Gaussian Model Boosting

Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.


Introduction
Boosting [Freund andSchapire, 1996, Friedman, 2001] is a machine learning technique that achieves state-of-the-art prediction accuracy Guestrin, 2016, Shwartz-Ziv andArmon, 2021]. This is reflected in statements such as "[i]n general 'boosted decision trees' is regarded as the most effective off-the-shelf non-linear learning method for a wide range of application problems" [Johnson and Zhang, 2013]. In boosting, and in many other supervised machine learning algorithms, it is assumed that a potentially complex predictor function F (·) relates a set of predictor variables to a response variable, and that conditional on F (·) evaluated at the predictor variables, different samples are independent. Apart from this potentially unrealistic independence assumption, tree-boosting can have difficulty with high-cardinality categorical variables, and it produces discontinuous predictions. The latter is often unrealistic for spatial and spatio-temporal data.
Latent Gaussian models are a broad class of flexible prior models in which, conditional on latent Gaussian variables, a response variable is assumed to follow a known parametric distribution, and parameters of this distribution are related to the latent Gaussian variables. Two widely known types of latent Gaussian models are Gaussian process [Williams and

The View of Latent Gaussian Models as Priors and Regularizers
An algorithm for learning a predictor function F (·), which relates predictor variables to a response variable, should result in an estimateF (·) that has both low bias and low variance. Intuitively, a low bias estimatorF (·), such as a flexible machine learning model, can have high variance if the complexity of the function F (·) is large relative to the sample size. Examples of data for which this can occur include, first, time series, spatial, and spatiotemporal data where the amount of variation over space and/or time is large relative to the sample size and, second, data with high-cardinality categorical variables where the number of categories is large relative to the sample size.
Modern supervised machine learning approaches such as deep neural networks and treeboosting typically have low bias but need to apply some form of regularization to avoid high variance inF (·). General-purpose regularization options for boosting include early stopping, learning rate shrinkage, and restrictions on the base learners such as the depth of trees and the minimal number of samples per leaf. However, as we argue in this article, for applications involving, e.g., spatial data or high-cardinality categorical variables, it can be advantageous to apply problem-specific regularization which incorporates available prior knowledge instead of relying on agnostic general-purpose regularization.
Prior models such as latent Gaussian models which explicitly model residual dependence among data can be interpreted as applying a form of regularization. For instance, an important prior assumption of Gaussian processes is that observations that are close together in space and/or time, or any other feature that defines a Gaussian process, are "more similar to each other than distant samples". For spatial data, this prior assumption is often referred to as Tobler's first law of geography [Tobler, 1970]. Such a prior model implies regularization in the sense that predictions for points that are close together are similar, and that the amount of similarity varies in a continuous, or potentially smooth, manner with distance. Further, heuristically, a prior assumption of grouped random effects models is that different group effects are similar to some degree, and deviations from a global average are stochastic and identically distributed. Crucially, important characteristics of a latent Gaussian model such as the speed at which the dependency decays over space and/or time, the smoothness, the amount of variation over space and/or categories, and thus the amount of regularization implied by the prior is characterized by hyperparameters which can be learned from data. Our proposed approach allows for incorporating this reasonable prior knowledge and thus for applying explicit data-specific regularization in boosting algorithms.
Intuitively, we conjecture that the improvement in prediction accuracy of our novel approach over classical independent tree-boosting is the larger, the more categories a categorical variable has and the faster the covariance decays over space and/or time or, in other words, the higher the complexity of F (·) is compared to the sample size since appropriate regularization is more important in these cases. This hypothesis is confirmed in simulated experiments in Section 4.1.
For non-Gaussian data, there exists little prior work on combining non-linear machine learning methods with latent Gaussian models. For the special case of grouped random effects, Hajjem et al. [2017], Fokkema et al. [2018], and Speiser et al. [2020] propose algorithms that use regression trees to model the function F (·). Speiser et al. [2019] and Pellagatti et al. [2020] extend these algorithms by replacing trees with random forests. However, all of these methods are heuristically motivated. In particular, it is unclear which objective functions these algorithms minimize -they do not maximize a marginal or approximate marginal log-likelihood neither in a component-wise way nor using an EM algorithm -and whether and to which values they converge.
A straightforward alternative to the use of Gaussian processes and grouped random effects is to simply include the variables that define the latent Gaussian model, such as spatial coordinates, time points, and categorical variables, in the deterministic predictor function F (·) of a statistical or machine learning model. A special example of this is the approach Hothorn et al. [2010] where splines are used to model spatial effects and ridge regression is used to model grouped random effects. However, while the adoption of splines avoids discontinuities in predictions, this approach has several drawbacks compared to using latent Gaussian models. First, the hyperparameters, and thus the amount of regularization or smoothing, cannot be learned from data and need to be chosen using, e.g., cross-validation and, second, since the base learners are deterministic, probabilistic predictions cannot be made. Further, splines have the disadvantage that they suffer from the so-called "curse of dimensionality" when the dimension of the "locations" is large and the locations are thus sparse in space. This approach can thus not be used in situations where Gaussian processes are applied to higher-dimensional non-spatial "locations" as is often done in machine learning applications of Gaussian processes.
The linearity assumption in mixed effects models can also be relaxed by using splines or generalized additive models Tibshirani, 1986, Wood, 2017] for modeling the predictor function F (·); see, e.g., Tutz and Reithinger [2007] and Groll and Tutz [2012]. However, one has to assume a certain functional form with only limited possibility for interaction effects for the predictor function by specifying, for instance, main and secondorder interaction effects. In general, this can thus result in model misspecification.

A Non-parametric Latent Gaussian Model
We assume that the response variable y = (y 1 , . . . , y n ) T ∈ R n follows a parametric distribution which has a density p(y|µ, ξ) with respect to a sigma finite product measure with parameters µ ∈ R n and ξ ∈ Ξ ⊂ R r . The focus of this article is on non-Gaussian densities p(y|µ, ξ). If p(y|µ, ξ) is a Gaussian density, calculations simplify as the required marginalization can be done analytically; see Sigrist [2020]. Examples of p(y|µ, ξ) include Bernoulli and Poisson densities for binary classification and Poisson regression. The parameter µ is related to the sum of a predictor function F (·) evaluated at predictor variables and a latent Gaussian variable: where F (X) is the row-wise evaluation of a function F (·) : R p → R, F (X) = (F (X 1 ), . . . , F (X n )) T , and X i = (X i1 . . . , X ip ) T ∈ R p is the i-th row of X ∈ R n×p containing predictor variables for observation i, i = 1, . . . , n. For notational simplicity, we assume that the distribution p(y|µ, ξ) is parameterized in a way such that µ ∈ R n . Otherwise, if the support of µ is not R n , the model needs to be reparametrized using, e.g., a so-called link function. Further, we assume that conditional on µ, the data is independent: Any additional, auxiliary or hyper-, parameters of the likelihood p(y|µ, ξ) are denoted by ξ. In many situations such as classification and Poisson regression, there are no additional parameters. We assume that F (·) is a function in a function space H that is the linear span of a set S of so-called base learners f j (·) : R p → R. Classes of base learners include, e.g., linear functions [Bühlmann et al., 2006], smoothing splines [Bühlmann and Yu, 2003], wavelets [Saberian et al., 2011], reproducing kernel Hilbert space (RKHS) regression functions [Sigrist, 2021b], and regression trees [Breiman et al., 1984], with the latter being the most popular choice. For defining functional derivatives, we additionally assume that the space H is normed. For instance, assuming that the X i 's are identically distributed and that all F ∈ H are square integrable with respect to the law of X 1 , a norm on H can defined by the inner product F, G = E X 1 (F (X 1 )G(X 1 )) for F, G ∈ H.
Examples of latent Gaussian variables b ∈ R m include finite-dimensional versions of Gaussian processes and/or grouped random effects. We assume that the covariance matrix Cov(b) = Σ is parametrized by a set of parameters θ ∈ Θ ⊂ R q whose dimensionality is often relatively low, and Σ can depend on predictor variables S ∈ R n×d . For instance, for spatial and temporal Gaussian processes, these predictor variables S are locations and time points, respectively. For notational simplicity, we suppress the dependence of Σ on its parameters θ and on S. Further, Z ∈ R n×m are predictor variables which relate the random variable b to µ. Often, Z is simply an incidence matrix with entries in {0, 1}. For instance, for grouped random effects, Z consists of dummy variables that encode categorical variables. In general, Z can also contain continuous predictor variables, e.g., in the case of random coefficient models [Gelfand et al., 2003]. Note that, conditional on F (X) and Z, dependence among the response variable y can arise either due to the matrix Z being non-diagonal or due to the covariance matrix Σ being non-diagonal.
In summary, we distinguish between three sets of predictor variables: X with input variables for the predictor function F (·), S which determines the covariance structure of the random variable b, and Z which relates b to µ and thus also determines the covariance structure of µ and y. Note that these three sets of predictor variables may or may not be over-lapping. If e.g., X and S contain disjoint sets of predictor variables, one assumes that there are no interactions among them. On the other hand, if, for instance, spatial locations in S are also included in X, interactions among locations and other predictor variables in X can be modeled.
In comparison to our approach, existing boosting algorithms, and many supervised machine learning algorithms in general, do not distinguish between the different types of predictor variables X, S, and Z, and one essentially has two options: either ignore the additional predictor variables in S and Z or include them in the set of predictor variables X for the predictor function F (·). It goes without saying that the former is not a good option as potentially important information is neglected. Furthermore, the second option can result in the high variance problem mentioned in the introduction, and this translates into inferior prediction accuracy; see, e.g., our experiments in Sections 4 and 5. Besides, existing boosting algorithms assume that the data y is independent conditional on F (X) and thus ignore any potential residual correlation. Further, in most latent Gaussian models, it is assumed that F (·) is either a linear function, F (X) = Xβ, or that F (·) is simply zero, For notational simplicity, we assume that only one parameter µ of the data distribution p(y|µ, ξ) is related to a latent Gaussian variable. However, the extension to multivariate data and/or the situation where multiple parameters depend on potentially multiple Gaussian variables is straightforward. Also note that we assume that the latent variable b follows a Gaussian distribution, but moderate violations of this assumption have been shown to have only a small effect on prediction accuracy in the context of generalized linear mixed models [McCulloch and Neuhaus, 2011].

Definition of Learners
The marginal density of the response y is given by p(y|F, θ, ξ) = p(y|µ, ξ)p(b|θ)db.
( 2) Ideally, we would like to minimize the empirical risk functional .
If p(y|µ, ξ) is a Gaussian distribution, the marginalization in (2) can be done analytically.
For non-Gaussian data, however, an approximation has to be used. In order that an approximation is applicable for the boosting algorithms presented in this article, it needs to fulfill two requirements. First, one must be able to compute it efficiently as this needs to be done repeatedly. Second, the gradient with respect to F (·) must be computable in an efficient way. Our goal is thus to find the joint minimizer where R A (F (·), θ, ξ) is an approximate empirical risk functional and L A (y|F, θ, ξ) is an approximation to the negative logarithmic marginal likelihood − log(p(y|F, θ, ξ)). Note that R A (F (·), θ, ξ) is calculated by evaluating F (·) at X and then calculating L A (y|F = F (X), θ, ξ). I.e., the risk functional R A (F (·), θ, ξ) is, in general, infinite dimensional in its first argument and finite dimensional in its other arguments.

The Laplace Approximation
In this article, we focus on the Laplace approximation [Tierney and Kadane, 1986] for approximating the marginal likelihood in (2). However, other approximations that satisfy the above-mentioned requirements can equally well be used. For instance, if p(y|µ, ξ)p(b|θ) factors into low-dimensional components, numerical integration, such as adaptive Gauss-Hermite quadrature, can be used to approximate (2). Examples, where this applies, are single-level grouped random effects models. Another potential approximation is expectation propagation (EP) [Minka, 2001]. Depending on the data distribution, for instance, for binary classification, this can lead to more accurate approximations [Kuss and Rasmussen, 2005], but it is computationally more demanding than the Laplace approximation. For applying the Laplace approximation, we assume that p(y i |µ i , ξ) is three times differentiable in µ i . The Laplace approximation for (2) is given by whereb is the mode of p(y|b, F, ξ)p(b|θ), Note thatb depends on F = F (X), θ, and ξ, but we suppress this dependence for notational simplicity. The mode can be found, for instance, using Newton's method. Modulo constant terms that do not depend on θ, ξ, or F , the Laplace approximation to the negative log-marginal likelihood − log(p(y|F, θ, ξ)) is given by Since the Laplace approximation in (5) is equivalent to the following Gaussian approximation to the posterior p(b|y, θ, ξ):

Gradients
For the boosting algorithms introduced in the following, we need to calculate ∂L LA (y,F,θ,ξ) ∂F and also ∂L LA (y,F,θ,ξ) ∂θ and ∂L LA (y,F,θ,ξ) ∂ξ if, e.g., a first-order optimization method is used for minimizing with respect to θ and ξ. These are obtained as follows.
Proposition 2.1. The gradients with respect to F , θ, and ξ of the approximate negative logarithmic marginal likelihood of the Laplace approximation L LA (y, F, θ, ξ) in (6) are given by where Proof of Proposition 2.1. The derivation is similar as in Williams and Rasmussen [2006, Chapter 5.5.1]. All three gradients are sums of the explicit derivatives of L LA (y, F, θ, ξ) and implicit derivatives through the dependency ofb on F , θ, and ξ. The explicit derivatives with respect to F , θ, and ξ, ignoring any dependency throughb, are given in the first two summands in (8), (9), and (10). For the implicit derivatives, we first note that where we use the fact that the derivative of the first two terms in (6) with respect tob vanishes, and with respect to F i and obtain The statement in Equation (12) thus follows. Similarly, multiplying Equation (15) with Σ and differentiating it with respect to θ k gives from which we obtain Equation (13) by multiplying with Σ −1 . Equation (14) follows analogously.
We note that in our software implementation, we use different equivalent versions of the above result depending on the specific latent Gaussian model for computational efficiency and stability. If Zb consists of only grouped random effects, we use the version presented in Proposition 2.1 except that in (9) In this case, Z and Σ −1 are sparse, and the random effects dimension m is smaller than the number of samples n. It follows that a Cholesky factor for Z TW Z + Σ −1 can be computed efficiently using sparse matrix algebra, and also the remaining calculations for obtaining the gradients in Proposition 2.1 can be done efficiently. If Zb contains a finite dimensional versions of a Gaussian process, we use the Sherman-Morrison-Woodbury formula factorize the matrix I m +W 1/2 ZΣZ TW 1/2 , and, similarly as in Williams and Rasmussen [2006, Chapter 5.5.1], adapt all calculations in Proposition 2.1 accordingly.

Latent Gaussian Model Boosting
We propose to do the minimization of the risk functional in (3) using a novel boosting algorithm presented in the following. For known and fixed θ and ξ, boosting finds a minimizer of the approximate empirical risk functional R A (F (·), θ, ξ) in a greedy way by sequentially adding an update f m (·) to the current estimate F m−1 (·): where f m (·), m = 1, . . . , M , is chosen such that its addition results in the minimization of the risk. This minimization cannot be done analytically and an approximation is thus used.
In general, such an approximation consists of either a penalized functional first-order or a functional second-order Taylor expansion of the risk around the current estimate F m−1 (·).
This corresponds to functional gradient descent or functional Newton steps. See Sigrist [2021a] for more information on the distinction between gradient and Newton boosting.
In our case, we use functional gradient descent. Specifically, f m (·) is given by the least squares approximation to the vector obtained when evaluating the negative functional gradient of R A (F (·), θ, ξ) at (F m−1 (·), I X i (·)), i = 1, . . . , n, where I X i (·) are indicator functions which equal 1 at X i and 0 otherwise. Equivalently, f m (·) is the minimizer of a first-order functional Taylor approximation of R A (F (·), θ, ξ) around F m−1 (·) with an L 2 penalty on f (·) evaluated at (X i ); see, e.g., Sigrist [2021a] for more information. It is easily seen that the negative Gâteaux derivative of R A (F (·), θ, ξ) evaluated at (F m−1 (·), I X i (·)) is given by which we denote shortly as − ∂L A (y,F m−1 ,θ,ξ)

∂F
. This means that f m (·) can be found as the following least squares approximation: where f = (f (X 1 ), . . . , f (X n )) T . Note that ∂L A (y,F m−1 ,θ,ξ) ∂F depends on the approximation used for the marginal log-likelihood. For the Laplace approximation, this is given in Proposition 2.1.
It has been empirically observed that damping the update in (16), results in higher prediction accuracy [Friedman, 2001]. Further, functional gradient descent can also be accelerated using momentum. For instance, Biau et al. [2019] and Lu et al. [2019] propose to use Nesterov acceleration [Nesterov, 2004] for gradient boosting.
To jointly learn F (·) and (θ, ξ), we propose to combine functional boosting updates in the direction of F (·) with coordinate descent steps in θ and ξ. The reasons for this choice are outlined in Sigrist [2020]. The LaGaBoost Algorithm 1 summarizes our approach. Note that, despite not being explicitly stated in Algorithm 1, the approximation for the negative logarithmic marginal likelihood needs to be calculated repeatedly in the algorithm whenever L A (y, F, θ, ξ) is evaluated or a gradient of it is calculated.

Algorithm 1: LaGaBoost: Latent Gaussian model Boosting
Input : Initial values θ 0 ∈ Θ and, if applicable, ξ 0 ∈ Ξ, learning rate ν > 0, number of boosting iterations M ∈ N, approximation L A (y, F, θ, ξ) Output: FunctionF (·) = F M (·), hyperparametersθ = θ M , and auxiliary parametersξ = ξ M 1: Initialize F 0 (·) = argmin c∈R L A (y, c · 1, θ 0 , ξ 0 ) 2: for m = 1 to M do Update F m (·) = F m−1 (·) + νf m (·) 6: end for If the risk functional R A (F (·), θ, ξ) is convex in its arguments and Θ and Ξ are convex sets, then (3) is a convex optimization problem since H = span(S) is also convex. This means that there exists a unique minimizer and the LaGaBoost algorithm converges to the minimum, as long as the learning rate ν is not too large to avoid overshooting, i.e., that the risk increases when doing too large steps. Further, the computational complexity of the algorithm depends on the specific latent Gaussian variable model and the marginal likelihood approximation used. For instance, for the Laplace approximation, the calculation of Cholesky factors is usually the bottleneck.

Out-of-sample Learning for Hyperparameters
It has recently been observed that state-of-the-art machine learning techniques such as neural networks, kernel machines, or boosting can achieve a zero training loss and interpolate the training data while at the same time having excellent generalization properties [Zhang et al., 2017, Wyner et al., 2017, Belkin et al., 2018, Bartlett et al., 2020. Such an interpolation of the training data could be problematic for the hyperparameter estimation in the LaGaBoost algorithm. We propose to circumvent this potential problem by estimating the hyperparameters θ and the auxiliary parameters ξ using out-of-sample validation data obtained by applying cross-validation or by partitioning the data into two disjoint training and validation sets. To avoid that the function F (·) and/or the parameters θ and ξ are only learned on a fraction of the full data, we propose a two-step approach presented in the LaGaBoostOOS Algorithm 2. In brief, the LaGaBoostOOS algorithm first runs the LaGaBoost algorithm on the training data and obtains predictionsF val for the function F (·) on the left out validation data. The parameters θ and ξ are then estimated on the validation data using the predicted valuesF val . Finally, the LaGaBoost algorithm is run a second time on the full data while holding θ and ξ fixed. When k-fold cross-validation is used, both the function F (·) and the parameters θ and ξ are thus learned using the full data.

Prediction
In the following, we show how predictions can be made. We distinguish between predicting observables variables y p and latent variables µ p . Let y p ∈ R np and µ p ∈ R np denote the observable and latent random variables for which predictions should be made. The following holds true: where b p ∈ R mp is a latent random variable for which no corresponding data has been observed in y, (I m , 0 m×mp ) ∈ R m×(m+mp) , I m ∈ R m×m is an identity matrix, 0 m×mp ∈ R n×mp is a matrix of zeros, the matrix Z p ∈ R np×(m+mp) relates the vector of observed and new latent variables , and X p ∈ R np×p is the predictor variable matrix of the predictions.
If we apply the Laplace approximation, then by (7), (18), standard results for conditional distributions of multivariate Gaussian distributions, and the law of total variance, we have where in the last line, we have used the Sherman-Morrison-Woodbury formula. Further, the integral in (19) is analytically tractable for a Bernoulli likelihood with a probit link [see, e.g., Williams and Rasmussen, 2006, Chapter 3.9], but for other likelihoods, it needs to be numerically approximated. In our software implementation and the experiments below, we use adaptive Gauss-Hermite quadrature [Liu and Pierce, 1994] as numeric integration technique.

Software Implementation
The LaGaBoost and LaGaBoostOOS algorithms based on the Laplace approximation are implemented in the GPBoost library written in C++ with corresponding Python and R packages; see https://github.com/fabsig/GPBoost for more information. For linear algebra calculations, we rely on the Eigen library [Guennebaud et al., 2010]. Sparse matrix algebra is used, in particular for calculating Cholesky decompositions, whenever covariance matrices are sparse, e.g., in the case of grouped random effects. Further, multi-processor parallelization is done using OpenMP. For the tree-boosting part, in particular the tree growing algorithm, we use the LightGBM library [Ke et al., 2017]. The GPBoost library allows for modeling Gaussian processes, grouped random effects including nested and crossed ones, random coefficients, and combinations of the former. Further, the GPBoost library currently implements gradient descent with optional Nesterov acceleration and the Nelder-Mead method for minimizing with respect to the parameters θ and ξ in line 3 of the LaGaBoost Algorithm 1.

Simulated Experiments
In the following, we perform simulated experiments to compare the novel LaGaBoost algorithm to alternative approaches. We simulate binary classification data from a latent Gaussian model as in (1) assuming a Bernoulli likelihood with a probit link function: y i ∈ {0, 1}, P (y i = 1) = Φ(µ i ), i = 1, . . . , n, where Φ(·) denotes the standard normal cumulative distribution function. For the latent Gaussian variable Zb, we consider both grouped random effects with a single grouping level and a spatial Gaussian process model with an exponential covariance function c(s, s ) = σ 2 1 exp(− s − s /ρ) where the locations s are in [0, 1] 2 and ρ = 0.1. The marginal variance in both models is set to σ 2 = 1. Concerning the function F (·) and the predictor variables X, we sample independently from This function has been used previously in Hajjem et al. [2014] and Sigrist [2020] to compare non-parametric mixed effects models for Gaussian data. The constant C 1 is chosen such that the mean of F (x) is approximately 0, and C 2 is chosen such that the variance of F (x) equals approximately 1, i.e., that F (x) has the same signal strength as the latent Gaussian variable. We simulate 100 times training data sets of size n and two test data sets each also of size n. All models are trained on the training data and evaluated on the test data. We use a sample size of n = 5000 for the grouped random effects with m = 500 different groups. This corresponds to a categorical variable with 500 different categories and 10 samples per category. For the Gaussian process model, we use a sample size of n = 500. The reason for using a smaller sample size is that this allows us to avoid any additional approximation error due to a large data approximation. In every simulation run, two test data sets, denoted as "interpolation" and "extrapolation" test sets, are generated as follows. For the grouped random effects model, the interpolation test data set consists of n samples from the same m groups as in the training data, and the extrapolation test data consists of n samples for m new groups that have not been observed in the training data. For the Gaussian process model, training data locations are samples uniformly from [0, 1] 2 excluding [0.5, 1] 2 , the interpolation test data sets are obtained by also simulating locations uniformly in the same area, and the extrapolation test data contains locations sampled uniformly from the excluded square [0.5, 1] 2 . Figure 1 illustrates this. Figure 1: Example of locations for training and test data for the spatial data. 'Test' and 'Test ext' refers to locations of the interpolation and extrapolation test data sets, respectively.
We compare the LaGaBoost algorithm based on the Laplace approximation to the following alternative approaches: linear Gaussian process and grouped random effects models for binary data with a probit link function and F (x) = x T β, β ∈ R p , independent Newton boosting for binary data with the log loss ('LogitBoost') [Friedman et al., 2000], and model-based gradient boosting ('mboost') [Hothorn et al., 2010] with the log loss, i.e., a negative Bernoulli log-likelihood, and a probit link function. For the LogitBoost algorithm, we include the locations for the spatial data and the categorical grouping variable as additional predictor variables in the function F (·). For all boosting algorithms, we use trees as base learners, except for the grouped and spatial random effects in mboost. Learning and prediction with the LaGaBoost and LaGaBoostOOS algorithms, the linear latent Gaussian models, and LogitBoost is done using the GPBoost library version 0.7.0 compiled with the MSVC compiler version 19.24.28315.0 and OpenMP version 2.0. For the linear latent Gaussian models, the LaGaBoost algorithm, and the LaGaBoostOOS algorithm, optima for hyperparameters θ are found using Nesterov accelerated gradient descent. Further, for the linear models, the coefficients β are also learned using Nesterov accelerated gradient descent. Note that the gradient of L A (y, F, θ, ξ) with respect to β is given by For LogitBoost applied to the grouped random effects data, we consider the grouping variable as a numeric variable and not as a categorical variable as suggested by the authors of LightGBM 1 since the number of categories is large. Concerning the mboost algorithm, we use the mboost R package [Hofner et al., 2014] version 2.9-2, where spatial effects are modeled using bivariate P-spline base learner (bspatial with df=6), grouped random effects are modeled using random effects base learners (brandom with df=4), and all other predictor variables are modeled using trees as base learners. All calculations are done on a laptop with a 2.9 GHz quad-core processor and 16 GB of random-access memory (RAM). Tuning parameters are chosen by simulating 10 additional training and test sets and choosing the parameter combinations that minimize the average log loss on the test sets. In doing so, we use the union of both the interpolation and extrapolation test data sets to calculate test losses. For all boosting algorithms, we consider the following grid of tuning parameters: the number of boosting iterations M ∈ {1, . . . , 1000}, the learning rate ν ∈ {0.1, 0.05, 0.01}, the maximal depth of the trees ∈ {1, 2, 5, 10}, and the minimal number of samples per leaf ∈ {1, 10, 100}.
The results for the grouped and spatial data are reported in Tables 1 and 2. We report average test error rates ('Error') and test log losses ('NegLL') for both the interpolation and extrapolation (' ext') test sets. Further, we calculate p-values of paired t-tests comparing the LaGaBoost algorithm to the other approaches. We find that the LaGaBoost algorithm significantly outperforms all alternative approaches in all prediction accuracy measures for both the grouped and spatial data. In Tables 1 and 2, we additionally report the results for the LaGaBoostOOS algorithm, root mean square errors (RMSEs) and biases for the hyperparameters, and wall-clock time. Overall, we observe no large differences between the LaGaBoost and the LaGaBoostOOS algorithms. However, for the spatial data, the hyperparameter estimates of the LaGaBoostOOS algorithm have smaller RMSEs and biases compared to the LaGaBoost algorithm in line with our arguments laid out in Section 3.1. As expected, the LaGaBoostOOS algorithm has a higher computational time.  An alternative option to simulating additional training and test sets for choosing tuning parameters is to use cross-validation on the training data sets in every of the 100 simulation runs. However, this is computationally more expensive as the number of simulation runs  is relatively large. To investigate the differences between these two options for choosing tuning parameters, we redo the simulated experiments with 10 simulation runs and choose tuning parameters using 4-fold cross-validation on the training data in every simulation run. The results of this are reported in Tables A.1 and A.2 in the appendix. Overall, we observe only minor differences. Next, we also perform the same simulated experiments using a Poisson likelihood with a logarithmic link function instead of a binary Bernoulli likelihood. Specifically, we simulate grouped and spatial random effects as described above with σ 2 = 0.2, and we simulate F (X) according to (20) with C 1 chosen as described above and C 2 chosen such that the variance of F (X) is approximately 0.2. Response variable data is then simulated from a Poisson distribution with mean equaling exp(F (X) + Zb). Tuning parameters are chosen similarly as for the binary data by minimizing the test negative Poisson likelihood. Further, we use the RMSE and the negative Poisson likelihood for evaluating prediction accuracy. The results of this are reported in Tables 3 and 4. We find qualitatively very similar results as for the binary data. In particular, the LaGaBoost algorithm significantly outperforms all alternative approaches in all prediction accuracy measures for both the grouped and the spatial data.

When Does the LaGaBoost Algorithm Outperform Independent Boosting?
It is relatively obvious that the LaGaBoost algorithm tends to outperform linear latent Gaussian models when there are non-linearities and interactions. It is less clear in which situations the LaGaBoost algorithm outperforms classical boosting algorithms which include categorical variables and/or spatial locations in the predictor variables X for F (·) and, conditionally on this, assume independence among samples. As mentioned in Section 1.1, intuitively, we expect that the improvement in prediction accuracy of our novel approach over independent tree-boosting is the larger, the smaller the number of observations per category of a categorical variable is and the faster the covariance decays over space and/or time. To analyze this, we repeat the above simulated experiments for the binary   data with varying numbers of samples per group and varying range parameters ρ. Specifically, for the grouped random effects with n = 5000 samples, we consider the following number of samples per group: 10, 20, 50, 100, and 200. For the spatial data, we consider the following range parameters ρ: 0.1, 0.2, 0.5, and 1. Apart from this, we use the same experimental setup as above for the Bernoulli likelihood. Figure 2 reports the relative decrease in the test error of the LaGaBoost algorithm compared to the LogitBoost algorithm as well as the average test error of the two algorithms for the interpolation test data sets. These results confirm our hypothesis that the improvement in prediction accuracy is the larger, the smaller the number of observations per group is and the faster the covariance decays over space. In other words, the higher the complexity of the underlying true function relative to the sample size, the larger is the improvement obtained by the LaGaBoost algorithm. We conjecture that this is not just due to more accurate learning of the random effects themselves but also more efficient learning of the remaining part of the predictor function F (·). Figure 2 also shows that, as expected, average test errors of both algorithms decrease when having fewer categories for the categorical grouping variables and when the correlation decays slower of space. Figure 2: Comparison of the LaGaBoost and LogitBoost algorithms for grouped data with varying number of samples per group and spatial data with varying range parameters ρ. The top row shows the relative decrease in test error of the LaGaBoost vs. the LogitBoost algorithm visualized using violin plots. The red rhombi representing means over the simulation runs. The bottom row shows the average test error of the two algorithms.

Real-world Applications
In the following, we apply the LaGaBoost algorithm to two real-world binary classification data sets and compare its prediction accuracy to alternative approaches. We consider both grouped data with a high-cardinality categorical variable and a spatial data set. For the former, we consider data on poverty among young females in the US collected by the National Longitudinal Study of Youth (NLSY) and available from https: //www3.nd.edu/~rwilliam/statafiles/teenpovxt.dta. Here, the person id number is the high-cardinality categorical variable that determines the grouped random effects, and the goal is to predict the binary poverty indicator. As a spatial data set, we consider species distribution data. Specifically, we use presence-absence data on rainforest understorey vascular plants in North-east New South Wales, Australia, species "nsw43" obtained from the disdat R package [Elith et al., 2020]. The goal is to predict the presence or absence of the species.  We compare the LaGaBoost algorithm to the same alternative approaches as in the simulated experiments in Section 4 using nested 4-fold cross-validation. For the grouped poverty data, we perform stratified cross-validation such that every fold contains approximately the same amount of data for every category of the grouping variable. A reason for doing this is that one of the alternative approaches, the mboost R package, does not allow for making predictions for unobserved groups. Tuning parameters are chosen by doing an additional inner 4-fold cross-validation on every of the four training data sets. 2 We consider the same set of tuning parameters and selection criterion as in the simulated experiments The results are reported in Table 6. In addition to the test error and the test log loss, we also report the test area under the ROC curve (AUC). We find that the LaGaBoost algorithm outperforms all alternative methods in all three prediction accuracy metrics for both the grouped data with a high-cardinality categorical variable and the spatial data.  Table 6: Results for the real-world data sets. 'Linear' denotes the linear grouped mixed effects and linear Gaussian process models.

Conclusion
We have introduced a novel way for combining latent Gaussian models, such as Gaussian processes and random effects models, with boosting. This is done by applying functional gradient descent to the negative logarithmic marginal likelihood of a generalized mixed effects model in a boosting framework while jointly learning hyperparameters. We have obtained increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments. Future research can investigate how the approximation used for the marginal likelihood impacts properties such as prediction accuracy and computational time of the LaGaBoost algorithm.

A Appendix
Additional Results for the Simulated Experiments   Table A.2: Results for the spatial data and a binary Bernoulli likelihood based on only 10 simulation runs and using 4-fold cross-validation on the training data for determining tuning parameters.