Sequential Gaussian Processes for Online Learning of Nonstationary Functions

Many machine learning problems can be framed in the context of estimating functions, and often these are time-dependent functions that are estimated in real-time as observations arrive. Gaussian processes (GPs) are an attractive choice for modeling real-valued nonlinear functions due to their flexibility and uncertainty quantification. However, the typical GP regression model suffers from several drawbacks: 1) Conventional GP inference scales $\mathcal {O}(N^{3})$ with respect to the number of observations; 2) Updating a GP model sequentially is not trivial; and 3) Covariance kernels typically enforce stationarity constraints on the function, while GPs with non-stationary covariance kernels are often intractable to use in practice. To overcome these issues, we propose a sequential Monte Carlo algorithm to fit infinite mixtures of GPs that capture non-stationary behavior while allowing for online, distributed inference. Our approach empirically improves performance over state-of-the-art methods for online GP estimation in the presence of non-stationarity in time-series data. To demonstrate the utility of our proposed online Gaussian process mixture-of-experts approach in applied settings, we show that we can sucessfully implement an optimization algorithm using online Gaussian process bandits.


Introduction
Data are often observed as streaming observations that arrive sequentially across time.Examples of streaming data include hospital patients' vital signs, telemetry data, online 2 Problem Statement and Background

Gaussian Processes
Consider the problem of estimating an unknown function f : R D → R from streaming data.In detail, we have D-dimensional input data x i ∈ R D , and output data, y i ∈ R, which arrive at times i = 1, 2, . . ., N .We assume a functional relationship, y i = f (x i ) + i , where i ∼ N (0, σ 2 ) where the function f is assumed to be distributed from a Gaussian process.Gaussian processes (GPs) are a popular approach to modeling distributions over arbitrary real-valued functions f (•), with applications in regression, classification, and optimization, among others.Characterized by a mean function µ(•), and a covariance function Σ(•, •), we can sample instantiations from a GP at a fixed set of locations X = [x 1 , . . ., x N ] T , according to an N -dimensional multivariate normal: where µ(X) and Σ XX are the mean and covariance functions evaluated at the input data X.Typically, the mean function is set to be zero in Gaussian process regression.In the presence of noisy observations, y = [y 1 , . . ., y N ] T , whose mean value is a GP-distributed function, we observe the data generating process: Due to conjugacy between the likelihood and prior, we can marginalize out f analytically such that: y|X ∼ N N 0, Σ XX + σ 2 I .
We typically fit a GP regression model by optimizing the marginal likelihood with respect to the model hyperparameters.Given some previously unseen inputs X * , the posterior predictive distribution of the outputs y * is XX Σ XX * + σ 2 I y * |y, X, X * ∼ N (m y * , S y * ) . (3)

Fast Gaussian Process Inference
Fitting GP-based models is dominated by O(N 3 ) covariance matrix inversion operations, meaning that these models are challenging to fit to large sample size data.To allow GP models to be computationally tractable for large data, numerous approaches have been developed for scalable GP inference.These approaches largely fall into two groups: sparse methods and local methods.Sparse methods approximate the GP posterior distribution with an M N number of inducing points that act as pseudo-inputs to the GP function, thereby reducing the computational complexity to O(N M 2 ) [3,4].We can further speed up these sparse methods using stochastic variational inference (SVI) that reduces the complexity by fitting the model with subsets of the M inducing points to form stochastic gradients at each iteration [5].
Local methods, on the other hand, exploit structure among samples to represent the covariance matrix using a low-rank approximation [6,7].To do this, they partition the data into K sets, and approximate the covariance matrix by setting the inter-partition covariances to zero.They invert this approximate covariance matrix by inverting K dense matrices of size N/K, resulting in a complexity of O(N 3 /K 2 ).An additional benefit of local approaches is that we can estimate GP hyperparameters within each block; when samples are partitioned based on input location, this implicitly captures non-stationary functions.A product of experts model implies the strong assumption that, across experts, the observations are independent.This independence assumption ignores the correlation between experts and can lead to poor posterior predictive uncertainty quantification.
Instead, a more flexible approach to local methods in the GP domain is to assume that there is a mixture of GP-distributed experts.Mixture-of-experts GP models can model non-stationary functions [8,9,10,11], but integration over partitions makes inference computationally intractable for most realistic settings.In contrast, variational methods for mixtures of GPs are effective to speed up computation [12,13,14].Alternate methods using sum-product networks and importance sampling have also proven to be fast and effective for Bayesian inference of GP mixtures [15,2].

Online Gaussian Process Inference
The online product-of-experts (POE) variant of GP regression assumes the data are strictly partitioned-leading to a block diagonal structure in the covariance matrix [16].Contrast this to the mixture-of-experts approach, where the partition is integrated out, leading to a more expressive covariance structure, but is not inherently amenable to online updating.
Although local methods are simple to update in real-time, as we can send new data to clusters and only update clusters receiving new data, sparse and mixture-of-experts GP inference methods discussed above do not trivially allow model updates as new data arrive.Previous work in online GP inference required non-trivial adaptations to allow real-time GP inference.The sparse online Gaussian process is an inducing-point online method by approximating the posterior using variational inference.However, this approach assumes the kernel hyperparameters are fixed and known a priori [17].The online sparse variational Gaussian process (OSVGP) is another inducing point variational method for online GP regression that updates the hyperparameters as new data arrive [18].
Many scalable inference algorithms take advantage of training the model only using a small subset of the data in a "minibatching" approach.Examples of minibatches being used for scalable inference include stochastic gradient descent [19], stochastic variational inference (SVI) [20] and stochastic gradient MCMC [21].However, in the sparse online settings we cannot use minibatching in a stochastic approximation setting.This is because SVI requires minibatches uniformly sampled i.i.d.from the entire dataset.But in the streaming case, samples from the new data may not be from the same distribution as the previous observations [18].Moreover, minibatching in an SVI-type setting requires constant updating of the entire model, unlike in local methods that can take minibatches of the incoming data and only update a small subset of experts.
One notable variant of exact inducing point online method for streaming GP regression, called the Woodbury inversion with structured kernel interpretation (WISKI) [22], replaces the covariance kernel with a structured, sparse matrix approximation for GP regression [23].Using this "structured kernel interpolation" method in conjunction with the Woodbury rank-one updates to the kernel inverse leads to an online GP method with constant computational complexity with respect to the number of observations.Empirically, the online variational method of OSVGP tends to suffer from numerical stability issues and is vulnerable to underfitting the model in the streaming setting [22].Alternatively, we may interpret sparse online GP under a different formulation.Instead of the typical approaches in the aforementioned papers, the authors reframe the sparse model as a state space model and use the Kalman filtering update equations to train the model sequentially [24].

Modeling Non-Stationary Functions with Gaussian Processes
Modeling non-stationary functions using GPs is often computationally challenging because non-stationary kernels generally include more parameters than stationary kernels and are more computationally intensive to fit.Many non-stationary kernels model the hyperparameters as a GP-distributed function of the input space [25,26,27,28].In this setting, if we have a GPdistributed kernel hyperparameter for each observation then the computational complexity of posterior sampling the hyperparameters scales O(N 4 ) as we have to update N .
The local Gaussian process product-of-experts approach to fitting online GPs that can account for non-stationary behavior, but inference depends on a single partitioning of the input data using K-means [16].Hard partitioning may create undesirable edge effects due to its implicit assumptions that partitions have zero correlation.The Bayesian treed GP addresses the problem of hard partitioning by integrating over the space of decision tree partitions via reversible jump MCMC [8].While this method flexibly captures non-stationary functions, reversible jump is slow to marginalize over the space of treed GP partitions and is not parallelizable due the Markov dependencies in MCMC.
The Gaussian process change point model captures non-stationary behavior in GPs by jointly modeling a GP regression model and a change point generating process [29].The prior distribution on the change point locations is assumed to follow a survival function that is known a priori, which represents the run time of a functional regime.This model estimates the change point runtime in an online setting, based on a Bayesian online change point detection model [30].

Dirichlet Process Mixture Modeling
For a finite mixture model, we assume that there are K mixtures where the observations y are parameterized by θ k ∼ H for some distribution H defined on a parameter space Θ with mixture weights π = [π 1 , . . ., π K ], where 0 < π k < 1 and K k=1 π k = 1: For the purpose of computational tractability, we can augment the mixture model with a latent indicator z i ∈ {1, . . ., K}, which tells us the latent mixture membership for observation y i , so that our mixture model is now parameterized as: In our proposed approach, we assume that now there is an a priori infinite number of mixtures and that the mixture distribution is generated from a Dirichlet process.The Dirichlet process (DP) is a distribution over probability measures G ∼ DP(α, H), defined on some parameter space, Θ, with a concentration parameter α ∈ R + .We define a Dirichlet process by its finite dimensional marginals.For a finite partition of Θ, (A 1 , . . ., A K ), if the probability masses G(A 1 ), . . ., G(A K ) are Dirichlet (αH(A 1 ), . . ., αH(A K ))-distributed then G ∼ DP(α, H) [31].We obtain the Dirichlet process mixture model (DPMM) by taking the limit in Equation 5 as K → ∞.
Furthermore, we can integrate out the mixing proportion π and sample the latent indicators z from the predictive distribution of the Dirichlet process-the Chinese restaurant process [32,33].In the Chinese restaurant process (CRP), the i-th latent indicator probability is now: where is the number of previous observations from z 1 to z i−1 assigned to mixture k.
We place this non-parametric assumption on the mixture distribution because we expect there will be an infinite number of mixtures as we observe an infinite amount of data, but the DPMM will adaptively instantiate a finite number of mixtures for a finite number of observations.

Importance Sampling and Sequential Monte Carlo
A central issue in Bayesian inference is the computation of the often intractable integral f = f (θ)P (θ)dθ.For this task, importance sampling (IS) is a popular tool.To perform importance sampling, J samples of θ (j) (also known as "particles") are drawn from a proposal distribution Q(θ) that is simple to sample from and approximates P (θ) well.The integral is estimated using the weighted empirical average of these samples f (θ (j) ) with weights w (j) = P θ (j) /Q θ (j) to approximate the integral: For problems where θ arrives in a sequence, θ 1:N = (θ 1 , . . ., θ N ), we can rewrite w (j) as a sequential update of the importance weight: In a Bayesian context, one can view this sequential importance sampler as sequentially updating a prior distribution at i = 0 to the posterior of all observed data at time N .Moving from i = 1 to N in this sampler may lead to a situation where the weight of one particle dominates the others, thereby increasing the variance of our Monte Carlo estimate and propagating suboptimal particles to future time steps.To avoid this, we can resample the particles according to their weights w (j) for j = 1, . . ., J particles-leading to the sequential Monte Carlo (SMC) sampler [34].Monte Carlo techniques like IS and SMC are appealing from a computational perspective as we can trivially parallelize the computation for each weight w (j) i to as many processors available as each weight is independent of the other.We can sequentially update a Dirichlet process mixture model for online inference using an SMC sampler under general proposal distributions and modeling assumptions [35].In the particle filtering algorithm for DP mixtures, we have an analytically convenient expression for the weight updates in the special case where the observation model is a Dirichlet process of normal inverse-Wisharts [36].Moreover, SMC methods have also been used in online GP learning for sequentially updating GP hyperparameters, but rely on strong modeling and conjugate prior choices as this SMC sampler takes advantage of closed-form marginalization of nuisance parameters for efficient inference [37].Later work has generalized the SMC sampler for updating the GP hyperparameters sequentially without restrictive conjugacy requirements [38].However, these methods are not truly online methods and are unsuitable for large-scale streaming scenarios because the complexity of updating the particle weight still grows cubically with respect to the number of observations.Finally, IS-MOE is a fast method that unifies non-stationary function learning and parallel GP inference [2].To achieve scalability, IS-MOE integrates over the space of partitions using importance sampling, which allows distributed computation.This method uses "minibatched" stochastic approximations, where the model is fit with only a subset of the training set and the likelihood is upweighted to approximate the full data likelihood.However, IS-MOE cannot update GP models as new data arrive.For this purpose, we develop a new approach for online, parallelizable inference of a mixture-of-experts GP model that we call "GP-MOE".

Gaussian Processes and Optimization: A Bandit Perspective
Bayesian optimization techniques seek to find parameters that best model a conditional probability.Many approaches optimize parameter configurations adaptively [39,40,41], with bandit formulations being particularly successful in GP contexts [42,43,44].The bandit framework considers the problem of optimizing a function f , sampled from a GP, by sequentially selecting from a set of arms corresponding to inputs x i , where noisy values y i are observed.
The goal of a bandit algorithm is to sequentially select arms that maximize long-term rewards; to do this, one needs to approximate the expected rewards for each arm (exploration) and then select the arms that maximize rewards (exploitation).There are two common algorithmic strategies for choosing arms: optimization based algorithms and sampling based approaches.Here, we look at the context of using GP-distributed functions for bandit optimization.At time i, we select a new point x i that maximizes some acquisition function.We can select the next point to evaluate in the function by optimizing with respect to the predictive reward: which can quickly converge to a poor local maxima.To better explore the function, we can select x i with respect to the GP upper confidence bound (GP-UCB): where m(•), s(•) are the predictive posterior mean and variance for a test point x * , and β i is a tuning parameter that acts as the trade-off between exploration and exploitation.The GP-UCB acquisition function prefers to select new points where there is both high uncertainty and a high predicted reward [43].
In Thompson sampling (TS), we take samples from the posterior predictive distribution evaluated at X * and select the test input that yields the highest sample of y * : The TS approach to the bandit problem is closely related to optimizing the acquisition function with respect to the UCB bound and can be seen as a sampling-based variant balancing the exploration and exploitation trade-off [45].
Online GPs are a naturally appealing method for Bayesian optimization.The function being optimized, f , is usually difficult to evaluate.Hence, we are unlikely to observe more than handful of evaluations of f upon initially fitting the GP model, and we are unlikely to have an accurate estimate of the hyperparameters at the initial fit of the model.Since we do not know the GP hyperparameters of Bayesian optimization a priori either, we should update the GP as new function queries arrive.

Sequential Gaussian Processes for Online Learning
We assume that our data is generated from a Gaussian process mixture, similar to previous mixture-of-expert models for Gaussian processes.This hierarchical model allows for greater flexibility in modeling functions, at the cost of more difficult inference for which we propose a distributable solution in the next section.Our approach adopts the following generative model: where (X k , y k ) = (x i , y i : z i = k) represent the data associated with the mixture k.We assume that the inputs are distributed according to a Dirichlet process mixture of normal-inverse Wishart distributions, and we marginalize out the parameter locations from the inputs.The outputs are then assumed to be generated by independent GPs, given the mixture indicator.
The GP parameters, , are assumed to be log-normally distributed, and we update the state of θ k using the elliptical slice sampler [46].
The elliptical slice sampler (ESS) is an MCMC technique for sampling from a posterior distribution where the prior is assumed to be a multivariate Gaussian distribution but the likelihood is not conjugate.For the ESS algorithm, the sampler is always guaranteed to accept a transition to a new state unlike typical Metropolis-Hastings samplers and therefore is very efficient for MCMC sampling.In fact, slice sampling techniques are preferable in posterior sampling of the covariance function hyperparameters in GP models [47].
We assign the i-th input sequentially to clusters according to the Chinese restaurant prior and the mixture locations marginalized out [48]: where T (µ k , Ψ k , ν k ) is a multivariate-t distribution with the mean, covariance, and degreesof-freedom parameters for observation i's sequential assignment to mixture k, where K + refers to the previously occupied clusters, {k : N k > 0}.
We use the (•) notation to indicate that the summary statistics are conditioned only on observations i = 1, . . ., i − 1.We also place a Gamma prior on the DP concentration parameter, α, which allows us to easily sample its full conditional up to observation i with a variable augmentation scheme [49]:

SMC for Online GP-MOE
In an SMC setting with j = 1, . . ., J particles, we first propagate the particles (z (j) , θ (j) , α (j) ) from time i − 1 to i and fit a GP product-of-experts model.Then we calculate the particle weights.At the initial time, i = 1, the particle weight j is: For i > 1, the particle weight in Equation 8 can written as the product of: 1.The previous weight, w The ratio of the model's likelihood up to observation i over the likelihood up to observation i − 1 [38], 3. The particle weight of z (j) i for the Dirichlet process mixture model [36].
The GP term of the particle weight from [38] in this setting simplifies to the ratio of the new likelihood (including observation i) and the old likelihood (excluding observation i) of the mixture z i .We can store the old likelihood in memory from the last time we updated the particle weights, so the only computationally intensive step is computing the new likelihood for mixture z i .This particle update is: • P (x i |z where (X k , y k ) =(x i , y i , i : After calculating the particle weights, we calculate the effective sample size, N eff = 1/ J j=1 w If the effective number of samples drops below a certain threshold (typically J/2), then we resample the particles with probability w (1) i , . . ., w (J) i to avoid the particle degeneracy problem.The details for updating the particles are in Algorithm 1.
To calculate the predictive posterior distribution of the GP-MOE for test data x * , we calculate the predictive mean and variance on each individual particle, averaged over the mixture assignment for the test data.Then, we average the predictive distribution on each particle, weighted by w Assuming each batch is on average size N/K, the computational complexity of fitting our model is O(JN 3 /K 2 ) for J particles.SMC methods allows for parallel computation, as we can update the particles independently.Thus, we can reduce the computational complexity to fitting GP-MOE to be O(N 3 /K 2 ).While the computational complexity of GP-MOE is considerably reduced from the typical O(N 3 ) cost, the complexity still grows considerably as new data arrive.
To mitigate this problem, we can adopt an approach where we subsample B N k observations uniformly without replacement within a mixture assignment to approximate the likelihood P (y k |X k , θ k ) by upweighting the likelihood by a power of N k /B.This stochastic approximation provides an estimate for the full likelihood and has been used successfully in other Bayesian inference algorithms [50,51].
We use this stochastic approximation when the number of observations in a mixture, N (j) k exceeds B, so that the complexity of fitting the GP-MOE does not increase when N With this stochastic approximation, the complexity of our model is O(J min{N k , B} 3 /K 2 ).In Table 1, we compare the computational complexity of our GP-MOE method, POE (our method using only one particle), WISKI [22] and OSVGP [18].

Algorithm 2: GP-MOE Prediction
for j = 1, . . ., J in parallel do Predict new observations on particle j with Average predictions: P (ȳ * |y, X, x * ) = J j=1 w (j) i P (y Table 1: Comparison of inference complexity.N is the number of data points, K is the number of experts, M is the number of inducing points, and J is the number of particles.

Optimization in a Bandit Setting with Online GP-MOE
To motivate using the online GP-MOE for an applied problem, we show how this approach can be useful for a typical setting where GPs are popularly used-optimization.We can perform bandit Gaussian process optimization using our online GP-MOE method.The objective of this setting is that we want to optimize the function, f , using a mixture of GPs, where the arms of the bandit are indexed by k, corresponding to the kernel-specific hyperparameters of the mixture components.We update the model by first selecting a new test point x i using the UCB Thompson-Sampling from Eqn. 11.Then, we evaluate the point x i at f (x i ) and update the model using Alg. 1 at (x i , f (x i )).x i stochastically selects the arm, z i , which produces the best reward that corresponds to the marginal likelihood P (y k |X k , θ k , z 1:i ).Lastly, we can update the particle weight, w k and α (j) after selecting the arm.We continue this procedure for N total function evaluations.

Empirical Analyses for Online GP-MOE
To demonstrate the ability of our algorithm to fit streaming non-stationary GPs, we apply our online GP-MOE to a collection of empirical time-series datasets that exhibit non-stationary behavior 1 .In our comparisons, we look at the following datasets: 1.)An accelerometer measurement of a motorcycle crash.2.) The price of Brent crude oil.3.) The annual water level of the Nile river data.4.) The exchange rate between the Euro and the US Dollar.5.) The annual carbon dioxide output in Canada.6.)The heart rate of a hospital patient in the MIMIC-III data set [52].We pre-process the data so that the inputs and outputs have zero-mean and unit variance.
We compare our method against three alternative approaches: a product-of-experts model (POE), which is a special case of our algorithm with only one particle, a sparse online GP method using the Woodbury identity and structured kernel interpolation (WISKI) [22], and an online sparse variational GP method 2 (OSVGP) [18].Our choice of kernel for each of these methods is the radial basis function (RBF): for length scale parameter, θ.
In these experiments, we set the number of inducing points to be 50 for each of the sparse methods.We evaluate our method when J is equal to 1 (equivalent to a POE), 100, and 500 particles.The OSVGP method requires a fixed number of optimization iterations to estimate the variational parameters, and the results are highly sensitive to this setting.To make the model settings comparable to the GP-MOE settings, we also set the optimizer iterations to 1, 100, and 500.In the GP-MOE model, we distribute the inference of each particle over 16 cores on a shared memory process using OpenMP.In this setting, we run an online prediction experiment where we initialize the model using the first observation in the time-series data set.Then, we sequentially predict the subsequent observation and update the model with the next data point.
Our method generally performs better than the competing online GP methods (Tables 2  and 3).We broadly observe that the GP-MOE performs better than POE because we can integrate over the space of partitions and, thus, we will better capture the predictive uncertainty.However, WISKI is undoubtedly the fastest method as the computational complexity is constant with respect to the number of observations.But because these data sets exhibit non-stationarity, WISKI is not able to handle changes in the kernel behavior (like heteroscedasticity, for example) and therefore performs poorly in terms of MSE and log likelihood in these experiments.The GP-MOE methods perform comparatively to the OSVGP, which only uses a single GP, in terms of wall time (Table 4).
The performance of the OSVGP strongly depends on the number of optimizer iterations, and OSVGP has poor performance when only one iteration is used as opposed to 500 iterations.OSVGP often runs into numerical stability issues relating to calculating Cholesky decompositions in the sparse variational GP [22].Like WISKI, the OSVGP can sometimes obtain adequate point estimates for the posterior mean but generally mischaracterizes the posterior predictive variance in these non-stationary examples.While OSVGP can somewhat capture the non-stationary behavior as the hyperparameters update as new data arrive, the posterior predictive variance is generally wide compared to GP-MOE as evidenced by the superior performance of GP-MOE with respect to the predictive log likelihood.
For the motorcycle, Brent, Canadian carbon dioxide, and heart rate datasets, the GP-MOE performs the best in terms of predictive mean squared error and log likelihood.In the Nile river data set, the OSVGP performs the best in terms of online predictive log likelihood.This could be because the only non-stationary component of the Nile river data set is the   mean value, which we assume to be constant at all values of x in GP-MOE.The OSVGP and WISKI obtain the best MSE and log likelihood results on the EUR-USD data sets as well, which exhibits only time varying noise, zero mean, and stationary length-scale for the entire duration of the time-series data.Here, OSVGP and WISKI produce wider noise estimates than GP-MOE.However, a stochastic volatility model would perhaps be a better fit for this type of data than a mixture of GPs.The other datasets exhibit time-varying noise terms and length scales, which our mixture of GPs can capture, and thus the GP-MOE exhibits superior performance.
The online predictive performance of each of these online GP models show that WISKI and OSVGP produce overly wide credible intervals in comparison to GP-MOE and POE (Figures 1-5).In some instances, WISKI and OSVGP produce inaccurate predictive means in the case of the motorcycle, Brent, and Canada data sets.In the GP-MOE and POE results, we have colored the observations based on the mixture assignments of the largest particle.We can see from these plots that the mixture assignments largely correspond to changes in the underlying functional behavior.For example in the plot for the motorcycle data set, we observe an initial low-noise regime and a subsequent high-noise regime.The GP-MOE assigns data to different mixtures in these separate parts of the time series where the noise and length scale behavior changes (Figures 1-5).In contrast, stationary methods like WISKI or OSVGP model the motorcycle data with a constant noise term across time, and we can see that the initial low-noise regime of the data is modeled with an extremely large predictive 95% credible interval.

Hyperparameter Settings
Our approach has important hyperparameters that can affect the performance of our online GP-MOE algorithm.Specifically, these important hyperparameters are the number of particles and the minibatch size.The authors of the IS-MOE method provides an in-depth examination of how changing these hyperparameters affect the predictive performance in terms of test set log likelihood and mean squared error [2].Increasing the number of importance samples, J, and minibatch size B of the algorithm will improve model performance and increase computation time, while increasing the number of experts, K, will decrease both computation            time and model performance (as K increases, the covariance matrix becomes diagonally dominant, and is unable to capture input correlation).However, there are intermediate settings for each parameter that can drastically reduce computation time without a decrease in predictive model performance.
We can look at the effect of increasing J and B on the online predictive MSE and log likelihood (Figure 7).Unsurprisingly, increasing J and B will improve the predictive performance of GP-MOE.However, there is a point at which the performance of GP-MOE will plateau with respect to an increasing J and B, indicating that there are lower settings for which we can obtain accurate performance.

Bandit Optimization
Next, we use our GP-MOE in a bandit optimization setting using Thompson sampling.Here, we are maximizing four functions typically used to evaluate optimization algorithms: the    8).We choose these functions as they are multimodal, and we want to examine how well the competing GP optimization methods both explore and exploit the target function.
We compared GP-MOE, POE, WISKI, OSVGP and GP-TS on these four optimization tasks, setting J = 100 for the GP-MOE, and the number of inducing points to be 50 for the sparse methods.We run each method for N = 500 iterations.In bandit optimization, we focus on two results: Minimizing the mean average regret (MAR) and obtaining the maximum value of the function max x f (x).GP-MOE and POE typically obtain the highest (best) maximum value, while the stationary methods (especially the basic GP-TS) obtain better MAR (Table 5).
This suggests that basic online GP approaches are liable to only explore a suboptimal mode.While the bandit can reliably estimate the neighborhood surrounding this mode, thus resulting in a low MAR, it cannot actually find a better optimum compared to GP-MOE and POE.The only setting where a stationary method can compete with GP-MOE and POE is for the Parsopoulos function where GP-TS performs comparably to GP-MOE and POE.Though the POE can attain a maximum function value on par with GP-MOE, we see that the GP-MOE produces a maximum value with typically lower variance than the POE, which is an effect we would expect to see in an ensemble method like SMC.In these optimization experiments, we see that POE has the lowest CPU wall time.Thus, for an optimization task, like the ones analysed in this section, the optimal setting may be one where the number of particles, J, is rather low in order to take advantage of the variance minimization of SMC while maintaining the speed of the POE.

Conclusion and Future Directions
In this paper, we introduced an online inference algorithm for fitting mixtures of Gaussian processes that can perform online estimation of non-stationary functions.We show that we can apply this mixture of GP experts in an optimization setting.In these bandit optimization experiments, we observe that GP-MOE finds the highest maximum value compared to the competing methods, and can generally produce better performance in terms of minimizing regret.
For future work, we are interested in extending this method in a few different domains: First, we want to implement additional features that may improve the performance of our online method with regards to the sequential Monte Carlo sampler.Among these, we want to investigate incorporating retrospective sampling of the mixture assignments, z, as we only sample these indicators once when the i-th observation arrives.Retrospective sampling could lead to better predictive performance in subsequent observations, though the additional computational cost could be prohibitive for extremely large data sets.
Moreover, we are interested in introducing control variates in conjunction with the SMC sampler in order to reduce the variance of the Monte Carlo estimates.Basic Monte Carlo methods tend to exhibit high estimator variance.But, if we have some random variables correlated with our estimator whose expectation is known, then we can reduce the variance by a considerable amount.We seek to investigate these possible extensions in future work.
Next, we are interested in applying our online mixture of expert approach for modeling  patient vital signs in a hospital setting.Gaussian processes have enjoyed notable success in the health-care domain, for example, in predicting sepsis in hospital patients [53,54], and in jointly modeling patients' vital signals through multi-output GPs [55].In future research, we will to extend our approach to multi-output GP models, and implement kernel functions customized for health care scenarios.By combining fast inference with flexible modeling, these approaches will have a profound impact in real-time monitoring and decision-making in patient health.
details for prediction in GP-MOE are available in Algorithm 2.

i
, and the other model parameters, θ

Figure 1 :
Figure 1: Online posterior predictive mean (plotted with solid red lines) and 95% credible intervals (plotted with dashed black lines) for the motorcycle dataset.The color of the data points in these figures for GP-MOE and POE represent the mixture assignment of that observation for the particle with the highest weight.N = 94.

Figure 2 :
Figure 2: Online posterior predictive mean (plotted with solid red lines) and 95% credible intervals (plotted with dashed black lines) for the Nile river dataset.The color of the data points in these figures for GP-MOE and POE represent the mixture assignment of that observation for the particle with the highest weight.N = 100.

Figure 3 :
Figure 3: Online posterior predictive mean (plotted with solid red lines) and 95% credible intervals (plotted with dashed black lines) for the Brent crude oil dataset.The color of the data points in these figures for GP-MOE and POE represent the mixture assignment of that observation for the particle with the highest weight.N = 1025.

Figure 4 :
Figure 4: Online posterior predictive mean (plotted with solid red lines) and 95% credible intervals (plotted with dashed black lines) for the Canadian CO 2 dataset.The color of the data points in these figures for GP-MOE and POE represent the mixture assignment of that observation for the particle with the highest weight.N = 215.

Figure 5 :
Figure 5: Online posterior predictive mean (plotted with solid red lines) and 95% credible intervals (plotted with dashed black lines) for the USD-EUR exchange rate dataset.The color of the data points in these figures for GP-MOE and POE represent the mixture assignment of that observation for the particle with the highest weight.N = 3139.

Figure 6 :
Figure 6: Online posterior predictive mean (plotted with solid red lines) and 95% credible intervals (plotted with dashed black lines) for the heart rate dataset.The color of the data points in these figures for GP-MOE and POE represent the mixture assignment of that observation for the particle with the highest weight.N = 10000.

Figure 7 :
Figure 7: Left: Performance of GP-MOE with increasing values of J on the motorcycle dataset.Right: Performance of GP-MOE with increasing values of B on the motorcycle dataset.

Figure 8 :
Figure 8: Test functions for the bandit optimization experiments.

Table 2 :
Online predictive log likelihood over five trials.One standard error reported in parentheses.

Table 3 :
Online predictive mean squared error over five trials.One standard error reported in parentheses.Branin-Hoo function, the Syblinski-Tang function, the Parsopoulos function, and the Ursem function (plotted in Figure

Table 4 :
CPU wall time in seconds for online prediction over five trials.One standard error reported in parentheses.

Table 5 :
Mean absolute regret and maximum function value for optimization experiments over five trials for 500 iterations.One standard error reported in parentheses.

Table 6 :
CPU wall time for optimization experiments over five trials for 500 iterations.One standard error reported in parentheses.