Variational Hierarchical Mixtures for Probabilistic Learning of Inverse Dynamics

Well-calibrated probabilistic regression models are a crucial learning component in robotics applications as datasets grow rapidly and tasks become more complex. Unfortunately, classical regression models are usually either probabilistic kernel machines with a flexible structure that does not scale gracefully with data or deterministic and vastly scalable automata, albeit with a restrictive parametric form and poor regularization. In this paper, we consider a probabilistic hierarchical modeling paradigm that combines the benefits of both worlds to deliver computationally efficient representations with inherent complexity regularization. The presented approaches are probabilistic interpretations of local regression techniques that approximate nonlinear functions through a set of local linear or polynomial units. Importantly, we rely on principles from Bayesian nonparametrics to formulate flexible models that adapt their complexity to the data and can potentially encompass an infinite number of components. We derive two efficient variational inference techniques to learn these representations and highlight the advantages of hierarchical infinite local regression models, such as dealing with non-smooth functions, mitigating catastrophic forgetting, and enabling parameter sharing and fast predictions. Finally, we validate this approach on large inverse dynamics datasets and test the learned models in real-world control scenarios.


INTRODUCTION
P RINCIPLED data-driven, adaptive, and incremental learning is desirable in domains in which datasets are dynamic and accumulate slowly over time.For example, robots must build models of their dynamics and the environment as they interact with the world.Moreover, these models must be computationally efficient during both learning and evaluation.In the case of general-purpose robots, these models must incorporate different modalities of continuous and discrete stochastic random variables and possibly incorporate heteroscedastic noise [1], [2].Predominant and successful regression techniques, such as Gaussian process regression (GPR) [3], artificial neural networks (ANNs) [4], and local regression (LR) [5], have a mixed set of properties that are useful in different scenarios.
Two categories of LR exist [40], lazy learners that maintain all seen data points in memory [29], [31], and memoryless learners that compress data by constructing basis functions in the input space and fitting and storing locally parameterized regression models [35].Prominent examples of the latter include receptive field weighted regression (RFWR) [30] and locally weighted projection regression (LWPR) [32].However, these methods are often difficult to tune as they possess many hyperparameters.
A limited attempt at a Bayesian treatment of LR is made in [41] by constructing local nonparametric kernels and placing gamma priors on the kernel widths to alleviate the need to tune the basis functions.This approach leads to a localized GP formulation that needs to retain the training data in memory, again leading to the computational issues of vanilla GPR.Local Gaussian regression (LGR) is a further Bayesian generalization of LR [42].The authors treat the local models in a Bayesian framework and couple them via the loss function that reinforces global coordination.Nonetheless, both approaches rely on heuristics and thresholds for adding and pruning local models and fall short of formulating a generative model over input and output.
Following this introduction, we believe that local regression with a proper generative Bayesian treatment can x Fig. 1: Left, gap data learned with infinite local regression (ILR).The top plot depicts the mean prediction (red) on the training data (dots) and the true mean function (dashed).The blue area represents the predictive uncertainty of two standard deviations.The predictive uncertainty in areas lacking training data is large, and the mean prediction falls back to the prior.Right, the CMB dataset learned by ILR.The top figure depicts the mean prediction (red) with a predictive uncertainty of three standard deviations (shaded blue).ILR captures the heteroscedastic spread of the data with a handful of local regression models.The bottom plots show the activation of the models over the input space.
potentially serve as a powerful general-purpose function approximator.Moreover, as a probabilistic and efficient representation, it can drive many low-level applications in control and robotics that favor fast predictions and do not require deep representations.
We introduce two probabilistic graphical mixture models for local regression in the upcoming sections.The first, infinite local regression (ILR) [43], is a generative formulation that relies on the paradigm of Bayesian nonparametrics (BNP) [44] to automatically grow the mixture size based on observed data.This technique ultimately results in a general formulation of related methods that alleviates the need for any heuristic considerations.However, despite the effectiveness of ILR, and like other local regression techniques that rely on locally linear or polynomial approximations, it maintains a one-to-one correspondence between the activations and local regression units.This effect limits the model's capacity to share parameters across the input space and often forces the generation of duplicate components, needlessly increasing the overall number of parameters.To address this limitation, we introduce hierarchical infinite local regression (HILR), a multi-level development of ILR that enables multi-modal activations of the same regression component, giving the model a structure that allows sharing of regression parameters across repeating local patterns in the data.This architecture increases the flexibility of the representation and contributes towards its compression.
For learning these models, we derive two general variational Bayes (VB) schemes [45] that efficiently infer the posterior parameters and overcome the need for computationally heavy sampling methods.We benchmark the models on a range of tasks highlighting their strengths, such as dealing with heteroscedasticity, non-continuous functions, and multi-modal activation.Additionally, we test on large real-world high-dimensional datasets to benchmark learn-ing the inverse dynamics of robotic manipulators.Most importantly, we deploy an instance of ILR to perform inverse dynamics control on a real Barrett-WAM manipulator.
Alternative Bayesian extensions of the generative mixture of experts exist in the literature.A prominent area of research focused on Gaussian processes mixtures, in which local components are modeled by separate GPs [46], [47].Given that a single Gaussian process is an excellent approximator of nonlinear trends, the motivation for constructing such experts is not to improve the quality of approximation but rather to reduce the computational complexity and memory requirements of GPs by associating every component only with a slice of the overall data, thus limiting the negative effects of cubic scaling.While this motivation is understandable from a computational standpoint, it is often unclear how to decompose a nonlinear function into a mixture of nonlinear experts, especially when each expert has the representational flexibility of a GP.In contrast, models like ILR and HILR rely on locally linear experts that offer a natural unit of local approximation.
Finally, the infinite mixtures used in ILR and HILR to regularize model complexity are rooted in Bayesian nonparametrics.Here we reference influential work on Markov chain Monte Carlo (MCMC) techniques for Bayesian nonparametric density estimation [48], [49], [50], [51], which developed the first seeds of Bayesian inference for Dirichlet processes [52], [53].These concepts inspired comparable infinite mixture regression models.However, these attempts exclusively relied on expensive Gibbs sampling algorithms [54], [55], [56], [57].Instead, we focus on developing efficient deterministic variational inference (VI) algorithms that dramatically improve the practical aspects of training and deploying Bayesian finite and infinite mixture models [58], [59].For a detailed discussion on the scalability of different paradigms of Bayesian inference, we refer the reader to [60].
This paper is organized as follows.In Section 2, we review Bayesian linear regression, finite and infinite mixture models, and variational inference.Section 3 presents the infinite local regression (ILR) model and a variational Bayes inference scheme [43].In Section 4, we introduce the hierarchical infinite local regression (HILR), an extension of ILR, that incorporates multi-modal activations and parameter sharing.Finally, in Section 5, we test ILR and HILR on a set of simulated and real-world benchmarks.

PRELIMINARIES
This section introduces related concepts, such as Bayesian linear regression, Bayesian mixture models, Dirichlet processes, and variational inference.

Bayesian Linear Regression
We start by discussing the Bayesian treatment of a single component of a Bayesian local regression model, namely Bayesian linear regression [61].The conditional data likelihood takes a feature vector x ∈ R m as a random input variable and returns a random response variable y ∈ R d according to a linear mapping A : R m → R d , a bias vector c, and additive zero-mean noise with a precision matrix V y = Ax + c + e, e ∼ N(0, V).
For a fully Bayesian treatment, we consider all parameters of this model to be random variables on which we place proper conjugate priors.In this case, we place matrix-normal (MN) and normal-Wishart (NW) priors on the matrix A, the bias coefficient c, and precision matrix V where M, the mean of A, is a d × m matrix and V and K are d × d and m × m that serve as row and column precision matrices of A, respectively.The mean θ is an m-dimensional vector, and the scalar ρ modulates the amplitude of the precision.Finally, the parameters of the Wishart distribution are the d × d positive definite scale matrix Φ and the degrees of freedom η.Due to the conjugate nature of the priors, the joint posteriors p(A, c, V | D) are matrix-normal and normal-Wishart distributions, conditioned on the data of N independent and identically distributed data pairs D = {(x 1 , y 1 ), . . ., (x N , y N )}.

Bayesian Finite Mixture Models
Gaussian mixture models (GMMs) are hierarchical latent variable models with universal approximation capabilities for arbitrary continuous densities.This insight is of central interest when connected to density estimation for local regression models, which are themselves universal nonlinear function approximators [5].A finite K-component Gaussian mixture of a random variable x is a weighted linear combination of densities with K unique mean vectors µ k and precision matrices Λ k .The latent quantity z is a one-hot random variable distributed according to a categorical distribution p(z) = Cat(π), governed by the weights π = {π 1 , . . ., π K } that satisfy 0 ≤ π k ≤ 1 and K k=1 π k = 1.The Bayesian extension of this model [58] introduces a conjugate normal-Wishart prior on the means and precision matrices (µ k , Λ k ) ∼ NW(λ), where λ contains the hyperparameters.Furthermore, a conjugate Dirichlet prior, with a concentration parameter α, is placed on the mixing weights π ∼ Dir(α).This Bayesian perspective has proven effective in regularizing the shortcomings of GMMs by allowing superfluous components to fall back onto their priors instead of severely over-fitting to small clusters.This effect can be understood as sparsification bias over K [62], [63].

Dirichlet Process and Stick-Breaking
Scaling the finite mixtures in Section 2.2 to infinite components requires placing a nonparametric Dirichlet process (DP) prior on the parameters of the mixture.A DP is a distribution over probability measures G.We write G ∼ DP(α, H), where α is the concentration parameter and H is the base measure [64], [65].Intuitively, a Dirichlet process is a distribution over distributions, meaning each draw G is itself a distribution.The base distribution H is the mean of the DP, and the concentration parameter α can be interpreted as an inverse variance.The larger α is, the smaller the variance, and the process concentrates more of its mass around the mean distribution H.
We will rely on the stick-breaking construction [66] of a DP as an algorithmic realization.Stick-breaking delivers an infinite sequence of mixture weights π k of an infinite mixture model from the stochastic process This process is sometimes denoted as π ∼ GEM(α) [64].The stick-breaking procedure describes how the random variables s k , representing stick lengths, are drawn from a beta distribution and combined to obtain the mixture weights π k .If the concentration parameter α increases, the magnitude of the mixing weights π k decreases on average, and the number of possible active components increases.
This representation of DPs can be used to replace the priors placed on the finite Gaussian mixture model [59].In such a setting, the base H is a normal-Wishart distribution, and the sampled measure G ∼ DP(α, NW) is a draw of an unbounded number of parameters (µ k , Λ k ) ∼ NW(λ) for an infinite number of clusters, associated with an infinite number of weights π k generated by the stick-breaking process.These draws from a Dirichlet process are discrete with probability one, which leads to the clustering effect of the DP.Eventually, the same parameters will be sampled over and over, forcing the associated data points to cluster.

Variational Inference of Structured Models
MCMC [67], and VI [68] have become the two main approaches for (approximate) probabilistic inference in graphical models.While MCMC constructs a stochastic sampling process that converges to the posterior, VI formulates the inference task as a deterministic optimization problem.Despite its reliance on a coarser functional posterior approximation, VI is often preferred in settings with large number of parameters.Moreover, the deterministic nature of VI circumvents the issue of label switching in Monte-Carlobased posterior inference of mixture models.
In a nutshell, in variational inference, a typically intractable posterior is approximated by a tractable functional distribution q(β) that minimizes the Kullback-Leibler divergence (KL) to true posterior p(β | D) where D is observed data and the vector β subsumes both the parameters θ and hidden indicators z in structured models.Note that the KL in the above objective is modeseeking, meaning it will lock on one mode of a possibly multi-modal posterior.In general, the posterior p(β | D) is unknown, because the normalizer p(D) is not tractable.In consequence, the KL cannot be minimized directly but rather optimized via a related objective that is equal up to the constant term equivalent to the evidence p(D) This modified objective is denoted as the negative evidence lower bound (ELBO) and can be reformulated to take the traditional maximization objective of VI algorithms ).The choice of the approximate posterior distribution q(β) is open.In this paper, we focus on variational Bayes methods that rely on the (structured) mean-field assumption as a general recipe for maximizing the ELBO [45], [69].This approximation requires that the posterior factorizes over the set of the hidden variables q(β) = M i=1 q i (β i ).It is emphasized that no other assumptions are made about q(β).The resulting posterior will be determined solely by the assumed likelihood and priors.
Specifically, we follow the scheme of variational Bayes expectation-maximization (VBEM) as a probabilistic generalization of expectation-maximization (EM).This approach constitutes a coordinate ascent scheme that iteratively optimizes the ELBO for individual factors of the approximate posterior q(β) while holding the others constant A more practical version of this optimization can be achieved by using stochastic variational inference (SVI), a batched stochastic gradient ascent approach.In the case of the conjugate exponential family, SVI not only facilitates scalability over large datasets but also resembles a natural gradient ascent algorithm on the ELBO with favorable convergence properties [70].

THE INFINITE MIXTURE OF LOCAL LINEAR RE-GRESSION MODEL
Using the previously presented concepts of Bayesian linear regression, Bayesian mixture models, and Dirichlet processes, we now construct the Bayesian infinite local regression (ILR) model.
Our approach to tackling the regression problem is a Bayesian joint density estimation technique over the input x and target y, both conditioned on a latent discrete label z.To avoid the need for a fixed number of components over z, we assume a generative model driven by a stickbreaking process with a concentration parameter α and a base distribution H, which is a product of a normal-Wishart, a Matrix-normal, and Wishart distribution.
The data generation proceeds as depicted in Figure 2. The stick-breaking process is sampled to generate the categorical weights π, mixture activations {µ k , Λ k } ∞ k=1 , and and those parameters are used to generate the one-hot labels z n and input-output data pairs x n , y n Notice that the densities over the input space, parameterized by (µ k , Λ k ), naturally play the role of basis functions or so-called receptive fields as in the receptive field weighted regression (RFWR) [31] and locally weighted projection regression (LWPR) [32] algorithms.Given these modeling assumptions, the regression objective can be cast as a joint density inference problem of the posterior p(Z, s, µ, The result of this inference is then leveraged to perform predictions via conditioning and marginalization.

Variational Bayes for ILR
For inference, we focus on a variational Bayes EM algorithm that alternates between a variational expectation step (Estep) and a maximization step (M-step).Deriving such an algorithm for this model requires pinning down the following definitions of the likelihood, prior, and posterior.
Complete Data Likelihood.For the general case of multivariate regression with m inputs and d outputs, we assume the following structured joint likelihood over the inputs, outputs, and indicator variables where the dimensions of all quantities follow the notation of Bayesian linear regression.
Infinite Conjugate Prior.We construct a factorized conjugate infinite mixture prior This prior assumes that the cluster means µ k and precision matrices Λ k are sampled from normal-Wishart distributions while matrix-normal-Wishart and normal-Wishart priors are placed on the regression coefficients (A k , c k ) and the precision matrices The parameters π k are generated by a stick-breaking process , where the stick lengths s are independently beta distributed Truncated Mean-Field Posterior.We rely on a structured mean-field approximation of the posterior [69] that factorizes between the labels q(Z) and the remaining parameters q(s, µ, Λ, A, c, V), thus automatically leading to the decomposition p(.| D) ≈ q(Z) q(s) q(µ, Λ) q(A, c, V).Further, we follow [59] by allowing a truncation of the posterior while maintaining an infinite prior, so that q(s K where r n are the expected responsibilities of the mixture.During evaluation, the truncation threshold K is chosen to be very high and is seldom reached.

Variational Expectation
Step.In the E-step, the responsibilities are computed by following the recipe of VBEM in Section 2.4.The responsibilities are the variational parameters of the posterior categorical The expectations associated with the data likelihoods of x n and y n can be straightforwardly computed [71].However, the expectations associated with infinite-dimensional categorical require more careful consideration [59].We provide the necessary details in the appendix.

Variational Maximization
Step.The M-step updates the remaining variational distributions given an approximation of the categorical posterior as follows Each of these updates reflects a conjugate computation of K log-posterior densities given a log-prior and a log-likelihood weighted by the responsibilities r nk = E[z nk ].We provide more details and general computational recipes based on conjugacy rules in the appendix.

Posterior Predictive Distribution
For predicting the function value ŷ conditioned on a test query x, we marginalize the likelihood over the posterior parameters to get the joint posterior predictive density.To make the marginalization tractable, we replace the true posterior with our approximate variational posterior q(.| D) inferred under a training dataset D. The conditional predictive for a single component z = k is a conditional multivariate Student's t-distribution of the form where we have defined Additionally, the joint activation of a component k is a Student's t-distribution weighted by the expected categorical probability under the posterior stick-breaking process These K-activation probabilities enable two prediction techniques.A mode-prediction, where the most likely active component is selected and used to perform prediction with the corresponding linear regression model, or a mean-prediction, that averages the predictions of all components weighted by their activation probabilities.

Computational Complexity
We calculate the training-time computational cost to be O(N K(d + m) 3 ), which can be straightforwardly reduced to O(LK(d + m) 3 ) by applying stochastic updates [70], where L is the batch size.This result shows linear scalability with the data, which is considerably more efficient than simple variants of GPR.The test-time complexity of a mean prediction is O(K(d 3 + dm)), which combines the input membership query and the linear matrix transformation for every model k.This computation is, in contrast to GPR, independent of the training data size, hence the advantage of memoryless locally-parametric representations during realtime critical applications.

THE HIERARCHICAL INFINITE MIXTURE OF LO-CAL LINEAR REGRESSION MODEL
The local regression model presented in Section 3 offers a very flexible and well-regularized alternative to previously developed approaches [30], [32], [42].However, like similar local regression representations, it cannot account for shiftinvariance across the input space.This drawback can cause it to generate duplicate regression components to account for similar local function trends across disconnected regions of the input space.This limitation results from the hierarchical design that directly couples activations and local function units via a one-to-one correspondence and enforces a uni-modal activation per regression component.This coupling can be observed in the definition of the likelihood in Section 3, where the activation and the local regression units share the same assignment variable.It stands to reason that ILR does not offer enough flexibility and hinders parameter sharing between components.Therefore, we present a modified formulation of ILR that explicitly accounts for shift-invariance in the input space and provides the freedom to create regression units with multimodal, theoretically infinitely-modal, activations, if needed.The resulting model, hierarchical infinite local regression (HILR), is an infinite mixture over infinite mixtures that shares similarities with existing representations developed for hierarchical clustering [72], [73], [74].
We start by describing the generative process of the hierarchical mixture over mixtures as depicted in Figure 3. HILR consists of two levels.At the upper level, a meta Dirichlet process generates the stick-breaking weights ω, the meta-activations {τ m , Λ m } ∞ m=1 , and the shared slope and output precision matrices where β is the concentration parameter of the upper-level DP and t are the generated stick lengths.
HILR's ability to account for shift-invariance is a result of allowing multi-modal activations.That way, every regression unit can be associated with multiple local regions in the input space.This structure is realized by endowing every upper-level component m with its own local DP.These lower-level DPs, with a concentration parameter α, generate the weights π m , the activation centers {µ mk } ∞ k=1 and the shift coefficients We assume here that the lower-level centers µ mk and coefficients c mk are tied via their respective upper-level precision matrices Λ m and V m .
Finally, given all necessary parameters, the upper-and lower-level labels h n , z n and the data pairs x n , y n are sampled according to the likelihoods Notice how the upper-level generative process resembles the structure of ILR.However, in this model, the metaactivations (τ m , Λ m ) induce another hierarchy over the lower-level multi-modal activation centers µ mk .Moreover, the upper-level DP only accounts for the shared slope of the local regression units A m , whereas the shift coefficients c mk are induced at the lower level, as they are influenced by the individual activation centers µ mk .
Similar to the procedure in Section 3, we aim to infer a posterior over the parameters of this model given a set of inputs X = {x 1 , . . ., x N } and outputs Y = {y 1 , . . ., y N }.
To that end, we derive a structured variational Bayes expectation-maximization algorithm for this model.We start the derivation by defining the complete data likelihood, the infinite prior, and the mean-field posterior factorization.Fig. 3: A unified plate notation for hierarchical infinite tied mixtures of Bayesian local regression models.This model outlines a two-level architecture that allows sharing of parameters between single components in order to compress the representation.It can be interpreted as a local regression model with multi-modal activations.Each unit of the upper level is itself a mixture of local regression models that share the same slope A m and output precision V m .Each of these m different slopes can be activated at k unique lower-level input regions centered around µ mk and tied via a shared input precision Λ m .Both the upper-and lower-level mixtures are governed by independent Dirichlet process priors.
Complete Data Likelihood.The likelihood model is a two-level precision-tied joint density over the observations (X, Y) and the one-hot upper-and lower-labels (H, Z) where the conditioning of the lower-level labels z n on the upper-level labels h n manifests as a one-hot selector of the corresponding lower-level categoricals Infinite Conjugate Prior.We assume a factorized twolevel precision-tied conjugate infinite mixture prior The meta-activation prior is a normal-Wishart distribution over the meta-centers τ m and precision matrices while the activation centers µ mk are sampled from a conditional normal distribution The mappings A m and precision matrices V m are sampled form a matrix-normal-Wishart while the biases c mk are drawn from a K-tied conditional normal distribution Finally, the stick-breaking priors p(t, s) follow the definitions from Section 3 Truncated Mean-Field Posterior.We assume a structured decomposition of the posterior that leads to conjugate computation while maintaining the dependencies between the discrete labels, the input activations, and the regression parameters, respectively.Moreover, we apply the truncation scheme from [59] to establish a tractable posterior approximation while maintaining an infinite prior p(.| D) ≈ q(H) q(Z | H) q(t) q(s) q(µ, τ , Λ) q(A, c, V) where g n and r n are the upper-and lower-level posterior responsibilities, respectively.

Variational Expectation
Step.The E-step computes the joint posterior categorical over joint labels H and where these expectations can computed in a similar fashion to Section 3.More details can be found in the appendix.

Variational Maximization
Step.The M-step updates the variational gating, activations, and regression parameters As previously stated, these updates resemble posterior computations weighted by the responsibilities g nm = E[h nm ] and r nmk = E[z nmk ].The appendix provides further details.

Posterior Predictive Distribution
Prediction with HILR is akin to that with ILR, as described in Section 3.2.We briefly state the conditional predictive for a component h = m and an activation z = k as where the computation of a mk is similar to that in Section 3.2.Further, the weight of the k−th activation of the m−th component is computed as follows p(x, ĥ = m, ẑ = k | D) ∝ E q(t) p( ĥ = m | ω(t))

EMPIRICAL EVALUATION
We evaluate the presented models on a range of tasks.Our goals are (1) to highlight the advantages of ILR and HILR, such as dealing with out-of-distribution predictions, recovering an input-dependent noise function, hierarchical gating, sharing parameters, and the ability to perform Bayesian sequential updates, (2) to benchmark the models on high dimensional datasets from real robots, and (3) to deploy the models in a real-world scenario to further empirically demonstrate its validity.A reference implementation can be found under https://github.com/hanyas/mimo.Out-of-distribution Uncertainty.In Figure 1 (left), we apply ILR on a synthetic dataset with two large gaps.We observe how the predictive uncertainty strongly reflects the lack of training data in these regions and how the mean prediction falls back to the prior values.This example highlights the model's reasonable uncertainty quantification.The out-of-distribution behavior of ILR is strongly influenced by the discrete gating and a query's activation probability that jointly define the overall membership weights.
Heteroscedastic Noise.We test on two different problems with input-dependent noise, the cosmic microwave background (CMB) [75], Figure 1 (right), and a synthetic dataset from a stochastic sinc function y(x) = sinc(x) + ϵ, where the noise ϵ is distributed according to zero-mean normal with a standard deviation σ ϵ (x) = 0.05 + 0.2(1 + sin(2x))/(1 + e −0.2x ).In Figure 4 (left), we see that ILR can approximate the nonlinear functions well.In particular, the heteroscedastic noise function is recovered in great detail.
Hierarchical Parameter Sharing.In Figure 4 (right), we test HILR's ability to share slope parameters via multimodal activations.We consider a dataset stemming from a periodic triangle signal overlayed with additive noise.HILR decides to activate two upper-and two lower-level regions to match the data structure, despite having more degrees of freedom at each level.5: Left and middle, learning discontinuous functions with ILR.The top figures show the mode-prediction (red) and two standard deviations confidence (shaded blue).The left example is a simple step function that can be captured with linear features, while in the middle, we use a polynomial transformation of the input for more flexibility.Right, we tackle inverse mapping problems with ILR.This example includes scattered data that maps the input x to multiple output values y.A discriminative modeling approach fails in these scenarios, as it tries to capture the ambiguous mean of the function f : x → y.By approximating the joint density over both input and output and using mode-prediction, ILR can reconstruct these non-unique relations via local linear approximations.The bottom plots show the activation over the input space.Discontinuous (polynomial) Functions.In Figure 5 (left), a combination of step functions is fitted using modeprediction, as described in Section 3.2.By using polynomial input features, more expressive local models can be realized.Figure 5 (middle) depicts an example of cubic regressors, which are still linear in the parameters, fitted to data sampled from noisy cubic polynomials.
Inverse Mapping.One crucial advantage of generative over discriminative modeling is the ability to deal with non-unique inverse mapping problems.Such scenarios arise when the same input can be mapped to multiple output values.Joint modeling of the input-output data allows for flexible conditioning and alleviates the directional graph constraints.In Figure 5 (right), we show a simple example of how ILR can learn these mappings, adapted from [64].
Bayesian Sequential Updates.In Figure 6, we demonstrate a sequential learning problem.Data from a chirp signal arrives in three batches so that the posterior of previous batches serves as the prior for the next learning phase.ILR successfully captures the data trend, and no significant  6: Bayesian sequential updates.Mean (red) and a two standard deviations interval (shaded blue) of the predictive distribution fitted to sequentially arriving data (three batches) from the chirp dataset (gray dots).For the second and third plots, the posterior fitted to the previous batches is used as a prior to perform a Bayesian sequential update.There is no catastrophic forgetting of previously learned knowledge, and in regions with no data, the prediction falls back to the prior.32 000 GPR N/A N/A -SGPR (8.50 ± 0.03) × 10 −1 (6.000 ± 0.008) × 10 −3 -rBCM (4.52 ± 0.05) × 10 −1 (2.600 ± 0.030) × 10 −3 315 gPoE (4.60 ± 0.06) × 10 −1 (3.000 ± 0.075) × 10 −3 315 catastrophic forgetting is observed.The approximation errors due to the mean-field posterior factorization assumption have little influence because the posterior updates are localized in the input domain.Robot Inverse Dynamics.Next, we use ILR and HILR to learn the inverse dynamics of anthropomorphic manipulators.The dynamics of these manipulators are governed by the following equation u = M(q)q + C(q, q) + G(q) + ϵ(q, q, q), where (q, q, q) are joint angles, velocities and accelerations, and u are torques.M(q) is the inertia matrix, C(q, q) are the Coriolis and centripetal forces, and G(q) is the gravity force.ϵ(q, q, q) are general unmodelled nonlinearities such as sticktion/friction and hydraulic and tendon/cable dynamics, that motivate a data-driven approach to learn the mapping (q, q, q) → u.Later, we use the learned ILR model for online inverse dynamics control.
We use the mean squared error (MSE), normalized mean squared error (NMSE), and the number of local experts as evaluation criteria.These measures cover the prediction accuracy as well as the complexity of the learned model.We compare to popular (probabilistic) methods such as local Gaussian regression (LGR) [42], locally weighted projection regression (LWPR) [32], Gaussian process regression (GPR) [3] and sparse Gaussian process regression (SGPR) [10], and two scalable Gaussian process product of experts: the robust Bayesian committee machine (rBCM) [12] and the generalized product of experts (gPoE) [11].
We benchmark the prediction accuracy of all regression techniques on a high-dimensional dataset collected from a 7-DoF (degrees of freedom) anthropomorphic SARCOS arm  (2.30 ± 0.01) × 10 −3 -SGPR (1.80 ± 0.05) × 10 −1 (6.30 ± 0.02) × 10 −3 -rBCM (3.80 ± 0.35) × 10 −1 (19.00 ± 1.80) × 10 −3 100 gPoE (3.40 ± 0.13) × 10 −1 (16.00 ± 0.60) × 10 −3 100 [32].The dataset consists of 44484 training points and 4449 test cases.Overall there are 21 input variables (q, q, q) mapping to 7 motor torques u.We also benchmark on an inverse dynamics dataset from a 4-DoF Barrett-WAM manipulator, mapping from a 12-D to 4-D space.This dataset contains 25000 training and 5000 test pairs.Table 1 and Table 2 list the results for both datasets.We report the MSE, NMSE, and the number of active models over all joints, averaged over five seeds.We compute the mean and standard deviation for every cell in the table, except for LGR * , because of the unreasonable training times achieved while using the authors' code.When evaluating GPR on the SARCOS dataset, we faced GPU-memory constraints (32 GB), and we have discarded this evaluation.For rBCM and gPoE, we assigned an expert to every 1000 data points, repeated for every output dimension.
The results show that ILR and HILR outperform the related local regression methods LWPR and LGR, both in terms of prediction accuracy and number of active experts.Nonetheless, GPR is still the gold standard when the kernel size is within memory limits.Interestingly, the figures also indicate that ILR and HILR are competitive with sparse Gaussian process regression (SGPR) and the two product of experts rBCM and gPoE.Finally, the evidence reveals that HILR tends to activate roughly 10 − 15% fewer components than ILR.This observation hints that HILR may be taking advantage of shift-invariance patterns in the data and avoiding duplicate regression units.However, this hypothesis is hard to validate due to the data's high dimensionality.
Real Inverse Dynamics Control.Finally, we demonstrate the validity of the learned dynamics in a real-robot control scenario.In this evaluation, we tackle two exper- iments.In the first, we focus on the Bayesian sequential learning aspect, whereas the second highlights our model's ability to deal with massive amounts of data.Both experiments deploy an ILR model on the Barrett-WAM to track an 8-shaped trajectory in the xy-plane of the end-effector.
For the first experiment, we collect 30000 training samples (roughly 1 minute) from multiple trajectories with different velocity profiles and Bayesian sequential learning over 15 batches for multiple seeds.Figure 7 depicts the progression of the learning process.We then select the best model and perform model-based control to track held-out test trajectories with unseen velocity profiles.ILR predicts the feed-forward torques while supported by a low-gain PD-controller for safety considerations.We compare this controller to one with access to the analytical dynamics and the same low-gain PD-controller and to a model-free PDcontroller.Figure 8 depicts the tracking performance of the different controllers on two test trajectories.
In the second experiment, we learn a model covering a larger region of the state-action space.We generate a larger Barrett-WAM dataset consisting of 150000 training examples (roughly 5 minutes).We test the model learned by ILR on held-out test trajectories and compare it to the same controller constellation as in the first experiment.As benchmarking criteria, we evaluate the MSE with respect to the desired trajectory and the mean torque contribution of the PD-controller to the overall control signal.The rationale is that a good inverse dynamics model will consistently produce a low MSE without relying on the PD-controller's assistance.Table 3 shows the benchmarks on three test trajectories.The results indicate that ILR significantly improves tracking performance with minimal contribution from the PD-controller.Moreover, we are able to consistently achieve a prediction frequency of 2000 Hz during both tasks, although the Barrett-WAM robot requires only 500 Hz.

CONCLUSION
In this paper, we presented two probabilistic hierarchical local regression models, ILR and HILR, and derived an efficient variational inference technique for data-driven learning.These representations are based on infinite mixtures from Bayesian nonparametrics.We situate our contributions as the next iteration in a large family of local linear regression techniques such as RFWR, LWPR, and LGR.We have shown that placing Dirichlet process priors on Bayesian mixtures of local regression units can regularize model complexity with minor loss in performance and without relying on heuristics.Moreover, we have highlighted the advantage of the two-level architecture of HILR in compressing the learned models and reducing the number of active experts.Empirical evaluation indicates that the models outperform LWPR and LGR, and are competitive with sparse GPR and products of experts.Finally, we have empirically confirmed the practicality of this approach for online inverse dynamics control on a Barrett-WAM robot.Nonetheless, these presented concepts still suffer from multiple drawbacks.The mean-field assumption is a source of significant errors in posterior inference.Collapsed formulations of Dirichlet process priors promise better approximations [76].In addition, Bayesian mixture models are influenced by a large number of hyperparameters, which cannot be directly optimized via empirical Bayes [77], leading to lower predictive performance when compared to optimized GPR.Nonetheless, the ELBO offers an objective for hyperparameter optimization.Naive gradient-based techniques have proven too brittle due to their reliance on Euclidean distance metrics.A natural-gradient approach appears to be a suitable alternative in the future.
Further development of hierarchical local regression may focus on treating ILR and HILR as layers in a multilayered representation.This extension would allow the models to benefit from intermediate nonlinear projections into high dimensional spaces that have proven powerful in deep neural networks.Another practical consideration is to incorporate physical inductive biases such as inverse dynamics [78] to facilitate learning meaningful quantities.

ACKNOWLEDGMENTS
This work has received funding from the European Unions Horizon 2020 research and innovation program under grant agreement No. 640554 (SKILLS4ROBOTS).

APPENDIX A
In the following, we present more detailed derivations of the E-step and M-step of ILR and HILR.

E-Step of Infinite Linear Regression
where The individual expectations of the log-sticks under the beta posteriors are given by where Ψ is the Digamma function.

M-Step of Infinite Linear Regression
where r nk = E q(Z) [z nk ] are the expected responsibilities computed in the E-step.Further details on performing these updates via general exponential family recipes are provided in the upcoming sections.

M-Step of Hierarchical Infinite Linear Regression
log q(t) = E q(H) log p(H|ω(t)) + log p(t) where the quantities ĝ = g nm = E q(H) [h nm ] and r = r nmk = E q(Z) [z nmk ].After computing the necessary expectations, these updates correspond to the posterior update recipes described in later appendix sections.

APPENDIX B
Our work mainly considers random variables with probability density functions belonging to the exponential family.
The unified minimal parameterization of this class of distributions lends itself to convenient and efficient posterior computation when paired with conjugate priors.We assume the natural form for a probability density of a random variable x where h(x) is the base measure, η are the natural parameters, t(x) are the sufficient statistics and a(η) is the logpartition function, or log-normalizer.Following the same notation, a conjugate prior g(η|λ) to the likelihood f (x|η) has the form with prior sufficient statistics t(η) = [η, −a(η)] ⊤ and hyperparameters λ = [α, β] ⊤ .By applying Bayes' rule, we can directly infer the posterior q(η|x) q(η|x) ∝ f (x|η)g(η|λ) ∝ exp ρ(x, λ) • t(η) − a(ρ) , where the posterior natural parameters ρ(x, λ) are a function of the likelihood sufficient statistics t(x) and prior hyperparameters [α, β] ρ(x, λ) = α + t(x), β + 1 ⊤ .
The structure of the resulting posterior reveals a simple recipe for data-driven inference.By moving into the natural space, the posterior parameters are computed by combining the prior hyperparameters with the likelihood's sufficient statistics and log-partition function.By definition, every exponential family distribution has a minimal natural parameterization that leads to a unique decomposition of these quantities [79].

APPENDIX C
We present an outline of all M-step updates.We use an adapted form of the exponential natural parameterization, as it offers a clear methodology on how to derive and implement such updates for all relevant distributions.
The standard-form posterior parameters are

Infinite Categorical with Stick-Breaking Prior
We assume one-hot random variable z of infinite size with a categorical likelihood p(Z|π(s)) = arXiv:2211.01120v2 [cs.LG] 10 Sep 2023

Fig. 2 :
Fig.2:A unified plate notation for infinite mixtures of Bayesian local regression.Assuming Gaussian and linear Gaussian densities, the basis parameters (µ k , Λ k ) are sampled from a normal-Wishart distribution, while the regression parameters (A k , c k ) and precision matrices V k are sampled from a matrix-Normal-Wishart, for every component k.The latent variables z n assign every x n and y n to a component and are drawn from a categorical distribution parameterized by π.The mixture weights π are generated by a stick-breaking process with a concentration parameter α.

Fig. 4 :
Left, a challenging heteroscedastic example of a sinc function heavily overlayed with input-dependent noise.The figure shows the mean prediction (red) on the training data (dots) and the true mean function (dashed black) corrupted by noise (dashed green).The blue dashed lines represent the recovered noise process.Right, hierarchical local regression with HILR using parameter sharing in shift-invariant functions.The top figure shows the mode-prediction (red) along with two standard deviations of predictive uncertainty (shaded blue).The bottom plot highlights the multi-modal activation, leading to shared slope information over non-adjacent regions.

Fig. 7 : 8 -Fig. 8 : 8 -
Fig.7: 8-Shaped trajectory learning.Bayesian sequential updates on a dataset collected from a Barrett-WAM.We plot the NMSE for five different seeds on accumulated data over the number of batches.The NMSE consistently improves with new data, and no catastrophic forgetting is observed.
A weighted Gaussian likelihood over a random variable x ∈ R d has the following precision-based parameterization p(X|µ, Λ) = N n=1 N(x n |µ, Λ) wn ∝ exp

.w
The resulting posterior parameters areκ = κ 0 + n x n x ⊤ n − κ m m ⊤ ) −1 .Linear-Gaussian with Matrix-Normal-Wishart PriorA weighted linear-Gaussian likelihood takes a random input variable x ∈ R d and returns a response random variable y ∈ R m according to a linear mappingA : R d → R m p(Y|X, A, V) = N n=1 N(y n |x n , A, V) wn ∝ exp

,( 1 − s k ) α0− 1 ,
where w n are the weights.The prior is an infinitely factorized beta distribution over the stick lengths s kp(s) = ∞ k=1 Beta(s k |γ 0 , α 0 )and the posterior is also a factorized beta density truncated up to a level K[59] q(s) = K k=1 Beta(s k |γ k , α k ),whereγ k = γ 0 +

TABLE 2 :
Accuracy on the Barrett-WAM dataset

TABLE 3 :
Tracking error and torque contributed by the PDcontroller during the Barrett-WAM real robot task.