Streaming Variational Monte Carlo

Nonlinear state-space models are powerful tools to describe dynamical structures in complex time series. In a streaming setting where data are processed one sample at a time, simultaneous inference of the state and its nonlinear dynamics has posed significant challenges in practice. We develop a novel online learning framework, leveraging variational inference and sequential Monte Carlo, which enables flexible and accurate Bayesian joint filtering. Our method provides an approximation of the filtering posterior which can be made arbitrarily close to the true filtering distribution for a wide class of dynamics models and observation models. Specifically, the proposed framework can efficiently approximate a posterior over the dynamics using sparse Gaussian processes, allowing for an interpretable model of the latent dynamics. Constant time complexity per sample makes our approach amenable to online learning scenarios and suitable for real-time applications.


INTRODUCTION
Nonlinear state-space models are generative models for complex time series with underlying nonlinear dynamical structure [1], [2], [3].Specifically, they represent nonlinear dynamics in the latent state-space, x t , that capture the spatiotemporal structure of noisy observations, y t : where f θ and g ψ are continuous vector functions, P denotes a probability distribution, and t is intended to capture unobserved perturbations of the state x t .Such state-space models have many applications (e.g., object tracking) where the flow of the latent states is governed by known physical laws and constraints or where learning an interpretable model of the laws is of great interest, especially in neuroscience [4], [5], [6], [7], [8], [9].If the parametric form of the model and the parameters are known a priori, then the latent states x t can be inferred online through the filtering distribution, p(x t |y 1:t ), or offline through the smoothing distribution, p(x 1:t |y 1:t ) [10], [11].Otherwise the challenge is in learning the parameters of the state-space model, {θ, ψ}, which is known in the literature as the system identification problem.
In a streaming setting where data is processed one sample at a time, joint inference of the state and its nonlinear dynamics has posed significant challenges in practice.In this study, we are interested in online algorithms that can recursively solve the dual estimation problem of learning both the latent trajectory, x 1:t , in the state-space and the parameters of the model, {θ, ψ}, from streaming observations [12].
Popular solutions such, as the extended Kalman filter (EKF) or the unscented Kalman filter (UKF) [13], build an online dual estimator using nonlinear Kalman filtering by augmenting the state-space with its parameters [13], [14], [15], [16].While powerful, they usually provide coarse approximations to the filtering distribution and involve many hyperparameters to be tuned which hinder their practical performance.Moreover, they do not take advantage of modern stochastic gradient optimization techniques commonly used throughout machine learning and are not easily applicable to arbitrary observation likelihoods.
Recursive stochastic variational inference has been proposed for streaming data assuming either independent [17] or temporally-dependent samples [6], [18], [19].However the proposed variational distributions are not guaranteed to be good approximations to the true posterior.As opposed to variational inference, sequential Monte Carlo (SMC) leverages importance sampling to build an approximation to the target distribution in a data streaming setting [20], [21].However, its success heavily depends on the choice of proposal distribution and the (locally) optimal proposal distribution usually is only available in the simplest cases [20].While work has been done on learning good proposals for SMC [22], [23], [24], [25] most are designed only for offline scenarios targeting the smoothing distributions instead of the filtering distributions.In [22], the proposal is learned online but the class of dynamics for which this is applicable to is extremely limited.
In this paper, we propose a novel sequential Monte Carlo method for inferring a state-space model for the streaming time series scenario that adapts the proposal distribution on-the-fly by optimizing a surrogate lower bound to the log normalizer of the filtering distribution.Moreover, we choose the sparse Gaussian process (GP) [26] for modeling the unknown dynamics that allows for O(1) recursive Bayesian inference.Specifically our contributions are: 1) We prove that our approximation to the filtering distribution converges to the true filtering distribution.2) Our objective function allows for unbiased gradients which lead to improved performance.

STREAMING VARIATIONAL MONTE CARLO
which recursively uses the previous filtering posterior distribution, p(x t |y 1:t−1 ) = p(x t |x t−1 )p(x t−1 |y 1:t−1 )dx t−1 .However, the above posterior is generally intractable except for limited cases [12] and thus we turn to approximate methods.Two popular approaches for approximating (2) are sequential Monte Carlo (SMC) [20] and variational inference (VI) [27], [28], [29].In this work, we propose to combine sequential Monte Carlo and variational inference, which allows us to utilize modern stochastic optimization while leveraging the flexibility and theoretical guarantees of SMC.We refer to our approach as streaming variational Monte Carlo (SVMC).For clarity, we review SMC and VI in the follow sections.

Sequential Monte Carlo
SMC is a sampling based approach to approximate Bayesian inference that is designed to recursively approximate a sequence of distributions p(x 0:t |y 1:t ) for t = 1, . .., using samples from a proposal distribution, r(x 0:t |y 1:t ; λ 0:t ) where λ 0:t are the parameters of the proposal [20].Due to the Markovian nature of the state-space model in (1), the smoothing distribution, p(x 0:t |y 1:t ), can be expressed as We enforce the same factorization for the proposal, r(x 0:t |y 1:t ; λ 0:t ) = r 0 (x 0 ; λ 0 ) t j=1 r j (x j |x j−1 , y j ; λ j ).A naive approach to approximating (3) is to use standard importance sampling (IS) [30].N samples are sampled from the proposal distribution, x 1 0:t , • • • , x N 0:t ∼ r(x 0:t ; λ 0:t ), and are given weights according to The importance weights can also be computed recursively where The samples and their corresponding weights, {(x i 0:t , w i 0:t )} N i=1 , are used to build an approximation to the target distribution p(x 0:t |y 1:t ) ≈ p(x 0:t |y where δ x is the Dirac-delta function centered at x.While straightforward, naive IS suffers from the weight degeneracy issue; as the length of the time series, T , increases all but one of the importance weights will go to 0 [20]. To alleviate this issue, SMC leverages samplingimportance-resampling (SIR).Suppose at time t − 1, we have the following approximation to the smoothing distribution where w i t−1 is computed according to (6).Given a new observation, y t , SMC starts by resampling ancestor variables, a i t ∈ {1, . . ., N } with probability proportional to the importance weights, w j t−1 .N samples are then drawn from the proposal, x i t ∼ r t (x t |x a i t t−1 , y t ; λ t ), and their importance weights are computed, w i t , according to (6).The introduction of resampling allows for a (greedy) solution to the weight degeneracy problem.Particles with high weights are deemed good candidates and are propagated forward while the ones with low weights are discarded.
The updated approximation to p(x 0:t |y 1:t ) is now where x i 0:t = (x i t , x a i t 0:t−1 ).Marginalizing out x 0:t−1 in (9) gives an approximation to the filtering distribution: p(x t |y 1:t ) = p(x 0:t |y 1:t )dx 0:t−1 As a byproduct, the weights produced in an SMC run yield an unbiased estimate of the marginal likelihood of the smoothing distribution [21] and a biased but consistent estimate of the marginal likelihood of the filtering distribution [21], [31] For completeness, we reproduce the consistency proof of (12) in section A of the appendix.The recursive nature of SMC makes it constant complexity per time step and constant memory because only the samples and weights generated at time t are needed, {w i t , x i t } N i=1 , making them a perfect candidate to be used in an online setting [32].These attractive properties have allowed SMC to enjoy much success in fields such as robotics [33], control engineering [34] and target tracking [35].
The success of an SMC sampler crucially depends on the design of the proposal distribution, r t (x t |x t−1 , y t ; λ t ).A common choice for the proposal distribution is the transition distribution, r t (x t |x t−1 , y t ; λ t ) = p(x t |x t−1 ), which is known as the bootstrap particle filter (BPF) [36].While simple, it is well known that the BPF needs a large number of particles to perform well and suffers in high-dimensions [37].In addition, BPF requires the knowledge of p(x t |x t−1 ) which may not be known.
Designing a proposal is even more difficult in an online setting because a proposal distribution that was optimized for the system at time t may not be the best proposal K steps ahead.For example, if the dynamics were to change abruptly, a phenomenon known as concept drift [38], the previous proposal may fail for the current time step.Thus, we propose to adapt the proposal distribution online using variational inference.This allows us to utilize modern stochastic optimization to adapt the proposal on-the-fly while still leveraging the theoretical guarantees of SMC.

Variational Inference
In contrast to SMC, VI takes an optimization approach to approximate Bayesian inference.In VI, we approximate the target posterior, p(x t |y 1:t ), by a class of simpler distributions, q(x t ; ϑ t ), where ϑ t are the parameters of the distribution.We then minimize a divergence (which is usually the Kullback-Leibler divergence (KLD)) between the posterior and the approximate distribution in the hopes of making q(x t ; ϑ t ) closer to p(x t |y 1:t ).If the divergence used is KLD, then minimizing the KLD between these distributions is equivalent to maximizing the so-called evidence lower bound (ELBO) [29], [27]: For filtering inference, the intractability introduced by marginalizing over p(x t−1 |y 1:t−1 ) in ( 13) makes the problem much harder to optimize, rendering variational inference impractical in a streaming setting where incoming data are temporally dependent.

A Tight Lower Bound
Due to the intractability of the filtering distribution, the standard ELBO is difficult to optimize forcing us to define a different objective function.As stated above, we know that the sum of importance weights is an unbiased estimator of p(y 1:t ).Jensen's inequality applied to (11) [25], [39] gives, Expanding ( 14), we obtain log p(y t |y 1:t−1 ) + log p(y ) end for 7: Lt ← log( i w i t ) 8: Θ t ← Θ t−1 + α∇ Θ Lt SGD 9: end for 10: Resample, propose and reweigh N particles 11: Leveraging this we propose to optimize We call L t the filtering ELBO; it is a lower bound to the log normalization constant (log partition function) of the filtering distribution where R t (N ) accounts for the bias of the estimator (12).As R t (N ) is not a function of Θ t , it can be ignored when computing gradients.
As the number of samples goes to infinity (17) can be made arbitrarily tight; as a result, the implicit approximation to the filtering distribution (18) will become arbitrarily close to the true posterior, p(x t |y 1:t ), almost everywhere which allows for a trade-off between accuracy and computational complexity.We note that this result is not applicable in most cases of VI due to the simplicity of variational families used.
We summarize this result in the following theorem (see the proof in section B of the appendix).
Theorem 2.1 (Filtering ELBO).The filtering ELBO (17), L t , is a lower bound to the logarithm of the normalization constant of the filtering distribution, p(x t |y 1:t ).As the number of samples, N , goes to infinity, L t will become arbitrarily close to log p(y t |y 1:t−1 ).
Theorem 2.1 leads to the following corollary [41] (proof in section C of the appendix).

Stochastic Optimization
As in variational inference, we fit the parameters of the proposal, dynamics and observation model, Θ t = {λ t , θ t , ψ t }, by maximizing the (filtering) ELBO (Alg.1).While the expectations in (17) are not in closed form, we can obtain unbiased estimates of L t and its gradients with Monte Carlo.Note that when obtaining gradients with respect to Θ t , we only need to take gradients of E[log p(y t |y 1:t−1 )].We also assume that the proposal distribution, r(x t |x t−1 , y t ; λ t ), is reparameterizable, i.e. we can sample from r(x t |x t−1 , y t ; λ t ) by setting x t = h(x t−1 , y t , t ; λ t ) for some function h where t ∼ s( t ) and s is a distribution independent of λ t .Thus we can express the gradient of ( 17) using the reparameterization trick [42] as where L ≤ N is the number of subsamples to accelerate calculations.In Algorithm 1, we perform N SGD stochastic gradient descent (SGD) updates for each step.While using more samples, N , will reduce the variational gap between the filtering ELBO, L t , and log p(y t |y 1:t−1 ), using more samples, L, for estimating ( 19) may be detrimental for optimizing the parameters, as it has been shown to decrease the signal-to-noise ratio (SNR) of the gradient estimator for importance-sampling-based ELBOs [43].The intuition is as follows: as the number of samples used to compute the gradient increases, the bound gets tighter and tighter which in turn causes the magnitude of the gradient to become smaller.The rate at which the magnitude decreases is much faster than the variance of the estimator, thus driving the SNR to 0. In practice, we found that using a small number of samples to estimate (19), L < 5, is enough to obtain good performance.

Learning Dynamics with Sparse Gaussian Processes
State-space models allow for various time series models to represent the evolution of state and ultimately predict the future [44].While in some scenarios there exists prior knowledge on the functional form of the latent dynamics, f θ (x), this is usually never the case in practice; thus f θ (x) must be learned online as well.While one can assume a parametric form for f θ (x), i.e. a recurrent neural network, and learn θ through SGD, this does not allow uncertainty over the dynamics to be expressed which is key for many real-time, safety-critical tasks.An attractive alternative over parametric models are Gaussian processes (GPs) [45].Gaussian processes do not assume a functional form for the latent dynamics; rather, general assumptions, such as continuity or smoothness, are imposed.Gaussian processes also allow for a principled notion of uncertainty, which is key when predicting the future.
A Gaussian process is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.It is completely specified by its mean and covariance functions.A GP allows one to specify a prior distribution over functions where m(•) is the mean function and k(•, •) is the covariance function; in this study, we assume that m(x) = x.With the GP prior imposed on the dynamics, one can do Bayesian inference with data.With the current formulation, a GP can be incorporated by augmenting the state-space to (x t , f t ), where f t ≡ f (x t−1 ) .The importance weights are now computed according to Examining (21), it is evident that naively using a GP is impractical for online learning because its space and time complexity are proportional to the number of data points, which grows with time t, i.e., O(t 2 ) and O(t 3 ) respectively.In other words, the space and time costs increase as more and more observations are processed.
To overcome this limitation, we employ the sparse GP method [26], [46].We introduce M inducing points, z = (z 1 , . . ., z M ), where z i = f (u i ) and u i are pseudo-inputs and impose that the prior distribution over the inducing points is p(z) = N (0, k(u, u )).In the experiments, we spread the inducing points uniformly over a finite volume in the latent space.Under the sparse GP framework, we assume that z is a sufficient statistic for f t , i.e.
where m t = m(x t−1 ).Note that the inclusion of the inducing points in the model reduces the computational complexity to be constant with respect to time.Marginalizing out f t in ( 22) Equipped with equation ( 23), we can express the smoothing distribution as and the importance weights can be computed according to Due to the conjugacy of the model, p(z|x 0:t−1 ) can be recursively updated efficiently.Let p(z t |x 0:t−1 ) = N (z t |µ t−1 , Γ t−1 ).Given x t and by Bayes rule we obtain the recursive updating rule: where To facilitate learning in non-stationary environments, we impose a diffusion process over the inducing variables.Letting p(z t−1 |x 0:t−1 ) = N (µ t−1 , Γ t−1 ), we impose the following relationship between z t−1 and z where η t ∼ N (0, σ 2 z I).We can rewrite (25) where To lower the computation we marginalize out the inducing points from the model, simplifying (29) where where For each stream of particles, we store µ i t and Γ i t .Due to the recursive updates ( 27), maintaining and updating µ i t and Γ i t is of constant time complexity, making it amenable for online learning.The use of particles also allows for easy sampling of the predictive distribution (details are in section E of the appendix).We call this approach SVMC-GP; the algorithm is summarized in Algorithm 2.

Design of Proposals
As stated previously, the accuracy of SVMC depends crucially on the functional form of the proposal.The (locally) optimal proposal is which is a function of y t and f t [47].In general (33) is intractable; to emulate (33) we parameterize the proposal as r(x where µ λt and σ λt are neural networks with parameters λ t .

RELATED WORKS
Much work has been done on learning good proposals for SMC.The method proposed in [24] iteratively adapts its proposal for an auxiliary particle filter.In [22], the proposal is learned online using expectation-maximization but the class of dynamics for which the approach is applicable for is extremely limited.The method proposed in [23] learns the proposal by minimizing the KLD between the smoothing distribution and the proposal, D KL [p(x 0:t |y 1:t ) r(x 0:t ; λ 0:t )]; while this approach can be used to learn the proposal online, biased importance-weighted estimates of the gradient are used which can suffer from high variance if the initial proposal is bad.Conversely, [25] maximizes E[log p(y 1:t )], which can be shown to minimize the KLD between the proposal and the implicit smoothing distribution, D KL [q(x 0:t |y 1:t ) p(x 0:t |y 1:t )]; biased gradients were used to lower the variance.In contrast, SVMC allows for unbiased and low variance gradients that target the filtering distribution as opposed to the smoothing distribution.In [48], the proposal is parameterized by a Riemannian Hamiltonian Monte Carlo and the algorithm updates the mass matrix by maximizing E[log p(y 1:t )].At each time step (and for every stochastic gradient), the Hamiltonian must be computed forward in time using numerical integration, making the method impractical for an online setting.Previous works have also tackled the dual problem of filtering and learning the parameters of the model online.A classic approach is to let the parameters of the generative model evolve according to a diffusion process, θ t = θ t−1 +υ t : one can then create an augmented latent state, xt = [x t , θ t ], and perform filtering over xt either using particle filter- ing [49] or joint extended/unscented Kalman filtering [16], [15].One can also use a separate filter for the latent state and the parameters, which is known as dual filtering [16], [15].As SVMC is a general framework, we could also let the parameters of the generative model evolve according to a diffusion process and do joint/dual filtering; the availability of the filtering ELBO allows us to learn the variance of the diffusion online, while previous approaches treat this a fixed hyper-parameter.Besides, as we demonstrate in later experiments, we can learn the parameters of a parametric model online by performing SGD on the filtering ELBO.In [50], they combine extended Kalman filtering (EKF) with Gaussian processes for dual estimation; the use of EKF involves approximations and restricts the observation models one can apply it on.Moreover, the use of a full Gaussian process-as opposed to a sparse alternative-prevents it from being deployed for long time series.In [2], particle filtering is combined with a sparse Gaussian process to learn the latent dynamics and the emission model online; while similar to SVMC-GP, there are important differences between the two works.Firstly-and most importantly-the latent fixed dynamics are not learned online in [2]; training data is collected a priori and used to pre-train the GP and is kept during the filtering process.While a priori training data can also be used for SVMC-GP, our formulation allows us to continuously learn the latent dynamics in a purely online fashion.Second, a fixed proposal-similar to the one found in bootstrap particle filtering-is used while SVMC-GP allows for the proposal to adapt on-the-fly.In [19], they tackle the problem of dual estimation by leveraging the recursive form of the smoothing distribution to obtain an ELBO that can be easily computed online, allowing for the parameters of the generative model to be inferred using SGD.While similar to SVMC, we note that their approach relies on simple parametric variational approximations which are not as expressive as the particle based ones used in SVMC.
number of gradient steps number of particles (for stochastic gradient) Investigating how the performance of SVMC-measured via RMSE (lower is better) and ELBO (higher is better)-depends on number of gradient steps, N SGD and number of particles used to compute stochastic gradient, L. For each setting, we run 100 realizations of SVMC on the chaotic RNN data from sec.4.1.2.Solids lines are the average where error bars are the standard error.A) For a fixed number of particles used to compute stochastic gradient, L = 4, the number of gradient steps, N SGD taken at every time step is varied.B) For a fixed number of gradient steps, N SGD = 15, the number of particles used to compute stochastic gradient, L, is varied.

EXPERIMENTS
To showcase the power of SVMC, we employ it on a number of simulated and real experiments.For all experiments, the Adam optimizer was used [52].

Linear Dynamical System
As a baseline, we apply SVMC on data generated from a linear dynamical system (LDS) LDS is the de facto dynamical system for many fields of science and engineering due to its simplicity and efficient exact filtering (i.e., Kalman filtering).The use of an LDS also endows a closed form expression of the log marginal likelihood for the filtering and smoothing distribution.Thus, as a baseline we compare the estimates of the negative log marginal likelihood, − log p(y 1:T ), produced by SVMC, variational sequential Monte Carlo (VSMC) [25] (which is an offline method) and BPF [36] in an experiment similar to the one used in [25].We generated data from ( 35) with T = 50, d x = d y = 10, (A) ij = α |i−j|+1 , with α = 0.42 and Q = R = I where the state and emission parameters are fixed; the true negative log marginal likelihood is 1168.2.For SVMC and VSMC, we used the same proposal parameterization as [25] r(x where λ t = {µ t , β t , σ 2 t }.To ensure VSMC has enough time to converge, we use 25,000 gradient steps.To equate the total number of gradient steps between VSMC and SVMC, 25,000/50 = 500 gradient steps were done at each time step for SVMC.For both methods, a learning rate of 0.01 was used where L = 4 particles were used for computing gradients, which was used in [25].To equate the computational complexity between SVMC and BPF, we ran the BPF with 125,000 particles.We fixed the data generated from (35) and ran each method for 100 realizations; the average negative ELBO and its standard error of each the methods are shown in Table 1.To investigate the dependence of the ELBO on the number of particles, we demonstrate results for SVMC and VSMC using a varying number of particles.
From Table (1), we see that SVMC performs better than VSMC for all number of particles considered.While SVMC with 100 particles is outperformed by BPF, SVMC with 1,000 particles matches the performance of BPF with a smaller computational budget.

Chaotic Recurrent Neural Network
To show the performance of our algorithm in filtering data generated from a complex, nonlinear and high-dimensional dynamical system, we generate data from a continuous-time "vanilla" recurrent neural network (vRNN) where W (t) is Brownian motion.Using Euler integration, (37) can be described as a discrete time dynamical system where ∆ is the Euler step.The emission model is ∼ ST (0, ν y , σ y ) (39) where each dimension of the emission noise, v t , is independently and identically distributed (i.i.d) from a Student's t distribution, ST (0, ν y , σ y ), where ν y is the degrees of freedom and σ y is the scale.
We set d y = d x = 10 and the elements of W x are i.i.d.drawn from N (0, 1/d x ).We set γ = 2.5 which produces chaotic dynamics at a relatively slow timescale compared to τ [53].The rest of the parameters values are: τ = 0.025, δ = 0.001, Q = 0.01I, ν y = 2 and σ y = 0.1, which are kept fixed.We generated a single time series of length of T = 500 and fixed it across multiple realizations.SVMC was ran using 200 particles with proposal distribution (34), where the neural network was a multi-layer perceptron (MLP) with 1 hidden layer of width 100 and relu nonlinearities; 15 gradient steps were performed at each time step with a learning rate of .001with L = 4.For a comparison, a BPF with 10,000 particles and an unscented Kalman filter (UKF) was run.Each method was ran over 100 realizations.We compare the ELBO and root mean square error (RMSE) between the true latent states, x 1:T , and the inferred latent states, x1:T . 1 With a similar computational budget, SVMC can achieve better performance than a BPF using almost two orders of magnitude less samples.To investigate the effect the number of gradient steps has on the performance of SVMC, we plot the RMSE and ELBO as a function of number of gradient 1.We use the posterior mean as our estimate of the latent states.In each trial the network started at the initial state, eventually reached either choice (indicated by the color) after the stimulus onset, and finally went back around the initial state after receiving reset signal.The rest three panels show the phase portraits of inferred dynamical system revealing the bifurcation and how the network dynamics were governed at different phases of the experiment.At the spontaneous phase (when the network receive no external input), the latent state is attracted by the middle sink.After the stimulus is onset, the middle sink disappears and the latent state falls into either side driven by noise to form a choice.When the reset is onset, the latent state is pushed back to the only sink that is close to the middle sink of the spontaneous phase, and then the network is ready for a new trial.(C) Simulation from the fitted model.We generated latent trajectory and spike train by replicating the experiments on the fitted model.The result shows that the model can replicate the dynamics of the target network.1A; taking more gradient steps leads to a decrease in RMSE and an increase in the ELBO.We next investigate the effect the number of samples used to compute the stochastic gradient, L, has on the performance 1B; as was demonstrated in [43], larger L leads to a decrease in performance.

Synthetic NASCAR ® Dynamics
We test learning dynamics online with sparse GP on a synthetic data of which the underlying dynamics follow a recurrent switching linear dynamical systems [51].The simulated trajectory resembles the NASCAR ® track (Fig. 2A).We train the model with 2, 000 observations simulated from y t = Cx t + ξ t where C is a 50-by-2 matrix.The proposal is defined as N (µ t , Σ t ) of which µ t and Σ t are linear maps of the concatenation of observation y t and previous state x i t−1 .We use 50 particles, squared exponential (SE) kernel and 20 inducing points for GP and 1E-4 learning rate.We also run a SVMC (with the same setting on particles and learning rate as the former) with MLP (1 hidden layer and 20 hidden units) dynamics for comparison.GP dynamics not only estimate the velocity field but also give the uncertainty over the estimate while MLP dynamics is only a point estimate.To investigate the learning of dynamics, we control for other factors, i.e. we fix the observation model and other hyper-parameters such as noise variances at their true values.(See the details in section D of the appendix.) Figure 2A shows the true (blue) and inferred (red) latent states.The inference quickly catches up with the true state and stays on the track.As the state oscillates on the track, the sparse GP learns a reasonable limit cycle (Fig. 2F) without seeing much data outside the track.The velocity fields in Figure 2D-F show the gradual improvement in the online learning.The 500-step prediction also shows that the GP captures the dynamics (Fig. 2B).We compare SVMC with Kalman filter in terms of mean squared error (MSE) (Fig. 2C).The transition matrix of the LDS of the Kalman filter (KF) is learned by expectation-maximization which is an offline method, hence not truly online.Yet, SVMC performs better than KF after 1000 steps.

Winner-Take-All Spiking Neural Network
The perceptual decision-making paradigm is a wellestablished cognitive task where typically a low-dimensional decision variable needs to be integrated over time, and subjects are close to optimal in their performance.To understand how the brain implements such neural computation, many competing theories have been proposed [54], [55], [56], [57], [58].We test our method on a simulated biophysically realistic cortical network model for a visual discrimination experiment [57].In the model, there are two excitatory subpopulations that are wired with slow recurrent excitation and feedback inhibition to produce attractor dynamics with two stable fixed points.Each fixed point represents the final perceptual decision, and the network dynamics amplify the difference between conflicting inputs and eventually generates a binary choice.The simulated data was organized into decision-making trials.We modified the network by injecting a 60 Hz Poisson input into the inhibitory sub-population at the end of each trial to "reset" the network for the purpose of uninterrupted trials to fit the streaming case because the original network was designed for only one-time experiment.In each trial the input to the network consisted of two periods, one 2-sec stimulus signal with different strength of visual evidence controlled by "coherence", and one 2-sec 60 Hz reset signal, each follows a 1-sec spontaneous period (no input).We subsampled 480 selective neurons out of 1600 excitatory neurons from the simulation to be observed by our algorithm.

B A
Figure 5. Prediction performance on 3D data generated from an analog stable oscillator circuit.We compare Dual-EKF and SVMC both with dynamics parameterized with MLP (2-20-2).(A) Normalized MSE of 100 time step prediction using the filtered system.Median and best out of 11 randomly initialized filters are shown.To estimate the normalized MSE, 11 realizations were used, and for ease of visual parsing 11 bin moving window averaging was applied.(B) Comparison of normalized MSE of the last 500 time steps.L = 2, learning rate 1E-4 and 15 gradient steps, did well at learning the dynamics of target network.The inferred latent trajectories of several trials (Fig. 3B).In each trial the network started at the initial state, eventually reached either choice (indicated by the color) after the stimulus onset, and finally went back around the initial state after receiving reset signal.The other three panels (Fig. 3B) show the phase portraits of the inferred dynamical system revealing the bifurcation and how the network dynamics are governed during different phases of the experiment.In the spontaneous phase (when the network receives no external input), the latent state is attracted by the middle sink.After the stimulus is onset, the middle sink disappears and the latent state falls into either side driven by noise to form a choice.When the reset is onset, the latent state is pushed back to the only sink that is close to the middle sink of the spontaneous phase, and then the network is ready for a new trial.We generated a latent trajectory and corresponding spike train by replicating the experiments on the fitted model (Fig. 3C).The result shows that the model can replicate the dynamics of the target network.
The mean-field reduction of the network (Fig. 6 [58]) also confirms that our inference accurately captured the key features of the network.Note that our inference was done without knowing the true underlying dynamics which means our method can recover the dynamics from data as a bottomup approach.

Nonstationary system
Another feature of our method is that its state dynamics estimate never stops.As a result, the algorithm is adaptive, and can potentially track slowly varying (nonstationary) latent dynamics.To test adaptation to perturbation to both the state and system, we compared a dual EKF and the proposed approach (50 particles, GP dynamics with 10 inducing points and squared exponential kernel, linear proposal, 1E-4 learning rate) on a 2D nonstationary linear dynamical system.A spiral-in linear system was suddenly changed from clockwise to counter-clockwise at the 2000th step and the latent state was perturbed (Fig. 4).To adapt EKF, we used Gaussian observations that were generated through linear map from 2-dimensional state to 200-dimensional observation with additive noise (N (0, 0.5)).To focus on the dynamics, we fixed all the parameters except the transition matrix for both methods, while our approach still has to learn the recognition model in addition.Our method quickly adapts in a few steps.

Real Data: Learning an Analog Circuit
It has been verified that the proposed methodology is capable of learning the underlying dynamics from noisy streaming observations in the above synthetic experiments.To test it in real world, we applied our method to the voltage readout from an analog circuit [59].We designed and built this circuit to realize a system of ordinary differential equations as ).This circuit performed oscillation with a period of approximately 2 Hz.The sampling rate was 2000 Hz.
We assume the following model: where x t ∈ R 2 , y t ∈ R 3 , t ∼ N (0, 10 −3 ) and ξ t ∼ N (0, 10 −3 ).We parameterize f (•) using an MLP (1 hidden layer, 20 hidden units) and perform dual estimation using SVMC and dual EKF on 3,000 time steps (Fig. 5A).We chose two different levels of diffusion (0.001, 0.0001) on the parameters for dual EKF to implement different learning rates.We forecast 10 realizations of 100 steps ahead every filtering step and show the mean and standard deviation of the logarithm of MSE to the true observation (Fig. ).As dual EKF has trouble learning the parameters of the observation model, we fixed C and d for dual EKF while we let SVMC (500 particles, lr 1E-4 and 15 gradient steps) learn both the parameters of the latent dynamics, C and d. Figure shows SVMC achieve the same level of performance but of less variance, and the slow convergence in the beginning was due to learning more parameters.The inferred dynamics shows that the limit cycle can implement the oscillation (Fig. 5B).The prediction of future observations (500 steps) resemble the oscillation and the true observation is covered by 100 repeated predictions (Fig. 5C).The predictions started at the final state of the training data, and we simulated the future observation trajectory from the trained model without seeing any new data.We repeated the procedure of prediction 100 times.Figure .5D shows the normalized predictive MSE (relative to the mean observation over time).The solid line is the mean normalized MSE and the shade is the standard error.Since the simulation included the state noise, the prediction diverged from the true observations as time goes.

DISCUSSION
In this study we developed a novel online learning framework, leveraging variational inference and sequential Monte Carlo, which enables flexible and accurate Bayesian joint filtering.Our derivation shows that our filtering posterior can be made arbitrarily close to the true one for a wide class of dynamics models and observation models.Specifically, the proposed framework can efficiently infer a posterior over the dynamics using sparse Gaussian processes by augmenting the state with the inducing variables that follow a diffusion process.Taking benefit from Bayes' rule, our recursive proposal on the inducing variables does not require optimization with gradients.Constant time complexity per sample makes our approach amenable to online learning scenarios and suitable for real-time applications.In contrast to previous works, we demonstrate our approach is able to accurately filter the latent states for linear / nonlinear dynamics, recover complex posteriors, faithfully infer dynamics, and provide long-term predictions.In future, we want to focus on reducing the computation time per sample that could allow for real-time application on faster systems.On the side of GP, we would like to investigate the priors and kernels that characterize the properties of dynamical systems as well as the hyperparameters.

Figure 2 .
Figure 2. NASCAR ® Dynamics[51].(A) True and inferred latent trajectory using SVMC-GP.(B) Filtering and prediction.We show the last 500 steps of filtered states and the following 500 steps of predicted states.(C) Forecasting error.We compare the 500-step predictive MSE (averaged over 100 realizations) of SVMC-GP, SVMC-MLP, and Kalman filter.The transition matrix of the Kalman filter was learned by EM (offline).The periodic tendency is due to the periodic nature of ground truth.(D)-(E) Inferred dynamics as velocity field.For SVMC-GP, posterior variance of dynamics is additionally shown as uncertainty.

Figure 3 .
Figure 3. Winner-Take-All Spiking Neural Network.(A) 4 trials of training data.The neuronal activity was drawn over a 25 sec time window.Each row represents one neuron.Each dot represents that neuron fired within that time bin.(B) Inference.The top-left panel shows the inferred latent trajectories of several trials.In each trial the network started at the initial state, eventually reached either choice (indicated by the color) after the stimulus onset, and finally went back around the initial state after receiving reset signal.The rest three panels show the phase portraits of inferred dynamical system revealing the bifurcation and how the network dynamics were governed at different phases of the experiment.At the spontaneous phase (when the network receive no external input), the latent state is attracted by the middle sink.After the stimulus is onset, the middle sink disappears and the latent state falls into either side driven by noise to form a choice.When the reset is onset, the latent state is pushed back to the only sink that is close to the middle sink of the spontaneous phase, and then the network is ready for a new trial.(C) Simulation from the fitted model.We generated latent trajectory and spike train by replicating the experiments on the fitted model.The result shows that the model can replicate the dynamics of the target network.

Figure 4 .
Figure 4. Prediction of nonstationary dynamical system.The colored curves (blue: EKF, red: SVMC) are the RMSEs (solid line: mean, shade: stderr) of one-step-ahead prediction of nonstationary system during online learning (50 trials each run, 10 runs).The linear system was changed and the state was perturbed at the 2000th step (center).Both online algorithms quickly learned the change after a few steps.

Table 1
Experiment 1 (LDS) with 100 replication runs (true negative log-likelihood is 1168.12).The average negative ELBO and runtime are shown with the standard error for SVMC, VSMC and BPF where the number in parenthesis is the number of particles used.