Adaptive Uncertainty-Guided Model Selection for Data-Driven PDE Discovery

We propose a new parameter-adaptive uncertainty-penalized Bayesian information criterion (UBIC) to prioritize the parsimonious partial differential equation (PDE) that sufficiently governs noisy spatial-temporal observed data with few reliable terms. Since the naive use of the BIC for model selection has been known to yield an undesirable overfitted PDE, the UBIC penalizes the found PDE not only by its complexity but also the quantified uncertainty, derived from the model supports' coefficient of variation in a probabilistic view. We also introduce physics-informed neural network learning as a simulation-based approach to further validate the selected PDE flexibly against the other discovered PDE. Numerical results affirm the successful application of the UBIC in identifying the true governing PDE. Additionally, we reveal an interesting effect of denoising the observed data on improving the trade-off between the BIC score and model complexity. Code is available at https://github.com/Pongpisit-Thanasutives/UBIC.


Introduction
Data-driven discovery has emerged as a popular approach for uncovering the governing partial differential equation (PDE) of a dynamical system, offering flexibility and satisfactory accuracy without relying heavily on domain knowledge.Sparse regression is typically leveraged to approximate a linear combination of user-specified candidate terms that optimally balances between a capability to predict temporal dynamics and the model complexity, hence so-called SINDy (sparse identification of nonlinear dynamics)-based approaches (Brunton, Proctor, and Kutz 2016), which were successfully applied in diverse disciplines from aerodynamics (Li et al. 2019), biological transport (Lagergren et al. 2020) to epidemiology (Horrocks and Bauch 2020).
Diverse regularized regressors, designed to achieve the sparse identification were, for example, STRidge (sequential threshold ridge regression in PDE-FIND) (Rudy et al. 2017), LASSO (least absolute shrinkage and selection operator) (Tibshirani 1996) and SR3 (Sparse relaxed regularized regression) (Zheng et al. 2018).However, tuning the regularization hyperparameter(s) of these methods for model selection is challenging, because the chosen hyperparameter may deliver an overfitted model.Also, it is difficult to control the model complexity.As a result, there is a risk of overlooking the true governing equation of a particular complexity.
The mixed-integer optimization (MIO) for best-subset selection (Bertsimas, King, and Mazumder 2016) has been introduced to customize the desirable sparsity in the provably optimal MIO-SINDy (Bertsimas and Gurnee 2023).Various best-subset solvers are utilized to collect potential PDEs, from which one is automatically selected as the most suitable underlying PDE by an algorithm (Thanasutives et al. 2023).Although impressive progress has been made in deliberate consideration of likely PDEs consecutively arranged with a maximum complexity constraint, the important question remains: How do we select the best model that reveals the true governing PDE form?.
Before the model selection, we assure the existence of such a suitable PDE within our collection of potential models by enhancing the quality of the observed data, which are maybe noisy.A few previous attempts to denoise the distorted observed data (and its derivatives) were through derivative computation using polynomials, splinebased models, and Robust PCA (principal component analysis) (Rudy et al. 2017;Sun et al. 2022;Li et al. 2020).To simplify the experimental design, we concentrate on our preferred choices of denoising techniques that have numerically shown a positive impact on the model selection step: the regularized KSVD (Dumitrescu and Irofti 2017), a dictionary learning generalizing the K-means clustering.
In the model selection given a finite set of potential PDEs, Akaike information criterion (AIC) (Akaike 1974;Mangan et al. 2017;Dong et al. 2023) and Bayesian information criterion (BIC) (Schwarz 1978) are commonly adopted as metrics evaluating point estimates of model parameters.However, in practice, the AIC and BIC values associated with a sequence of complexity-increasing linear models, fitted on an overcomplete candidate library to estimate temporal dynamics, tend to decrease, albeit insignificantly, after even the actual PDE is found.Therefore, naively selecting the PDE that minimizes the information criterion, could lead to erroneous equations (Thanasutives et al. 2023).
In this paper, we suggest a new way to assess the goodness of an approximated model by quantifying its associated uncertainty.We go beyond point estimates of the model parameters or coefficients and instead put a posterior belief on them.For each potential PDE, Bayesian linear regression is modeled on its corresponding subset to obtain the posterior distribution of the coefficients.Then, the coefficients' The UBIC-selected PDE can be optionally validated using simulation-based model selection, measuring the BIC of the PDE simulated state solution for directly predicting the denoised observed data.We investigate the use of a physicsinformed neural network (PINN) (Raissi, Perdikaris, and Karniadakis 2019) as a differentiable automatic PDE solver.By incorporating physical laws as a part of its learning constraints, PINN enables the flexible PDE-solving approach.
We list the main contributions in what follows.
• The UBIC integrates the quantified uncertainty of each potential PDE to penalize the BIC, resulting in the true identification of the parsimonious and stable governing PDE without heavy reliance on hyperparameter tuning.• We numerically exhibit the positive impact of denoising in terms of improving the trade-off given by the BIC.• We explore the simulation-based model selection that evaluates the efficiency of the PDE state solution, simulated by a PINN, in predicting the denoised data directly.
Bayesian PDE Discovery Methods Although such uncertainty-quantified SINDy (UQ-SINDy) (Hirsh, Barajas-Solano, and Kutz 2022) and threshold sparse Bayesian regression with error bars (Zhang and Lin 2018), were previously introduced, the uncertainty-guided model selection remains underexplored.Notably, the UQ-SINDy employed sparsifying priors to induce nonzeros terms identified with their posterior inclusion probabilities generally once, leading to risks of missing some correct terms or including incorrect ones.The latter work also risked excluding small yet important PDE coefficients due to a sensitive threshold.The comparisons are discussed thoroughly in the supplement.

Methodology Problem Formulation
Let us assume without loss of generality that the system state u in a two-dimensional (2D) spatio-temporal grid of spatially distributed physical systems satisfies (1) We aim to discover N , a linear or nonlinear operator involving spatial derivatives of the state variable u only.Parametric dependency ξ j is a constant vector-valued coefficient.For convenience, we consider ∂ x u ≡ u x and other notations alike.Since our observation input ũ may be disturbed with noise, or mathematically ũij = u(x i , t j ) + z(x i , t j ), we begin our PDE discovery approach by denoising on ũ.
Noise z(x i , t j ) ∼ ϵσu 100 N (0, 1) is drawn from standard Gaussian distribution, and scaled proportionally to ϵ% of the standard deviation (sd) σ u calculated over the domain.

Denoising Data
Suppose the grid is in a two-dimensional (2D) space, we turn ũ into a zero-mean array stacking flattened patches, regarded as signals S p (ũ) ∈ R p 2 ×f ; where p and f determine the patch size and the number of features.In Eq. ( 2), we seek the dictionary D of c atoms and corresponding sparse code A to represent S p (ũ), solving the following ρ-regularized dictionary learning problem: A couple of optimized D and A is obtain through regularized K-SVD training iterations.With the final fixed D, the ultimate sparse code A is then found using the orthogonal matching pursuit (OMP) algorithm with ⌊ p 2 10 ⌋ transforming sparsity to reconstruct the denoised observed data û = S −1 p (DA) from the patches, via the inverse S −1 p .∥•∥ F is the Frobenius matrix norm.We define G û(x i , t j ) = ûij as the denoised state function.

PDE Identification
Best-subset regression is used next to recover a sequence of potential parsimonious PDEs along their corresponding coefficients, with a maximal bound of support sizes (i.e., the number of nonzero coefficients).Basically, a careful forward-backward elimination algorithm (Guyon et al. 2002) can be implemented to approximate the best subsets.
We compute a presumably overcomplete (with a maximum derivative order) library Φ(û), collecting candidate terms of the denoised observed data.According to the weak formulation (Reinbold, Gurevich, and Grigoriev 2020), i-th numerical value of j-th candidate (column-wise) in Φ(û) ∈ R NΩ×Nq is given by integrating over a local spatio-temporal subdomain Ω i , whose (rectangular) lengths are H x and H t .
ϕ j is regarded as a candidate function, for example, G 2 û and ∂ 2 x G û. N Ω is the number of domain centers; ∀i, (x c i , t c i ).The smooth weight, e.g., w = (x 2 − 1) 2 (t 2 − 1) 2 ; where is a viable function for discovering the Burgers' PDE as it vanishes along the boundary ∂Ω i .Certainly, higher polynomial orders are possible.By integration by parts on Eq. ( 3), numerical noisy derivative evaluation of ϕ j is supposed to be carried out on the noiseless w.We use the implementation provided in the PySINDy package (de Silva et al. 2020;Kaptanoglu et al. 2022).Remark that the noise-tolerant representation by the convolutional weak formulation (CWF) (Messenger and Bortz 2021) could be leveraged at the library construction stage as well.

Uncertainty-penalized Bayesian Information Criterion (UBIC) for Adaptive Model Selection
This section finds the best number of supports presented in Eq. ( 4) within a given range.The base metric, on which we rely to penalize the maximized log-likelihood value of a regression model with its complexity, is the BIC: L is the model likelihood.Assuming the true governing PDE is parameterized by the vector-valued coefficient that distributes with relatively less uncertainty, compared to those that build the other potential PDEs, we define the UBIC as where U k represents an estimated total uncertainty for ξ k , scaled proportionally to log(N Ω ), similar to the penalizing complexity in the BIC, for convenient unification.The data-dependent λ U controlling influence of U k on model selection is adaptively adjusted by Algorithm 1.Typically, a lower UBIC means a better-discovered PDE.Bayesian linear regression probabilistic view is placed on ξ k with Gaussian conjugate prior N (ξ k | ξ k 0 , V k 0 ) to setup the posterior: (7) Later presented experimental results are produced with ξ k 0 = ξk and V k 0 = I ∥ ξk ∥ 0 as an identity matrix of size ξk 0 .
Note that ξ k 0 = ⃗ 0, reducing the posterior mean to ridge estimate, is also a feasible option.By maximum likelihood estimation (MLE), we accordingly set the error variance Inspired by the coefficient of variation formula, the uncertainty U k is defined as Here we essentially compute the relative standard deviation of the covariance matrix V k by taking an element-wise square (the • 1 2 exponent) root on its diagonal vector (diag) and then the division by ξ k µ 1 .Each CV k is rescaled by its minimum respecting the range of available support sizes, resulting in U k whose numerical value is comparable to s k .
Once U k is obtained, we converge the UBIC by Algorithm 1, iteratively decreasing λ U from its maximum bound λ max U derived to maintain the influence of the log-likelihood value (presumably greater than zero) in Eq. ( 6).We impose the constraint based on the discovered s k -support PDE.
Algorithm 1 finds a proper λ U = 10 λ by reducing λ iteratively.We track the current and competitive optimal support sizes (s k * , s k c ), and test the terminate condition at line 12, which essentially checks whether we have the increased complexity with unsatisfactory improvement (see τ ), or the Algorithm 1: Find the optimal complexity s k * by tuning λ U Input: Φ(û), q 0 and ξk Parameter: τ 0 : Improvement acceptance threshold (default = 0.02), N δ = 3: Number of evenly spaces in [0, log 10 λ max U ] Output: The optimal support sizes s * k and tuned UBIC's hyperparameter λ U break{terminate condition detected} 14: Consider an increased τ 0 to prevent overfitting 20: end if 21: return s k * , λ U = 10 λ and ∀k, I k {used in plotting} decreased complexity with already satisfying improvement.Also, τ 0 can be set adaptively, yet offering the same correct selection as the default value.Such an effective heuristic is µ )}, the 75 th percentile of successive improvement factors respecting just BIC-decreasing models.If an overfitted model is detected by line 18, we retry with a stricter percentile of S, e.g., P 80 (S).Contrarily, lessening τ 0 helps discern selecting a supposedly underfitted 1-support PDE.

Simulation-based Model Selection
As Eq. ( 4) optimizes for the regression model that fits Φ(û) to approximate q 0 (a weak form of u t ), the attained model do not offer the direct comparison metric to the PDE solution u.However, they are expected to yield the true PDE, reducing unnecessary costs of simulating false governing PDEs.
As an aid validation process, which can be computationally expensive or numerically infeasible for some ill-posed PDEs, we solve the found PDEs most likely to be the true PDE form based on our UBIC scores.Well-known numerical PDE solvers based on spectral methods are available in Chebfun (Battles and Trefethen 2004) and Dedalus (Burns et al. 2020).The software can accurately solve canonical stiff PDEs given initial and boundary conditions.Nevertheless, when solving a PDE containing extraneous high-order derivatives, the simulated solution may explode over time.Mathematical analysis is needed to rigorously address illposed PDEs, characterized by numerical instability in their solutions.Though such analysis exceeds our scope, we give the flexible neural network based treatise.
Physics-informed Neural Network (PINN) Learning PINN learning is an alternative approach for solving PDEs.The learned solution not only satisfies a specified PDE but is also stabilized to fit observational data.The general principle is to learn the mapping function from spatio-temporal data (in −−→ ûD ij by minimizing the physicsinformed loss respecting the discovered function N . . D Train is the train split containing subsampled discretized spatio-temporal points, on which the PINN is trained, and likewise D Val for holdout validation dataset.Automatic differentiation is used to compute derivative terms, e.g., ∂ x F D Θ k .Θ k is optimized by the second-order LBFGS (Liu and Nocedal 1989).Trainable ξk is initialized beforehand from Eq. ( 4).
Deciding whether the s k -support PDE is better or worse than the s k+1 -support PDE, we adjust the BIC in Eq. ( 5) as the complexity penalization avoids choosing the overfitted PDE that also generate the PINN predicted solution close to ûDVal on the validation split.The state-level BIC reads , it is advisable not to increase the support sizes to s k+1 under a fair circumstance where the PINN architecture and learning procedure are identical.We adhere to the vanilla paradigm for simplicity, though the PINN training could be improved, e.g., using multi-task learning techniques (Thanasutives, Numao, and Fukui 2021), since the network may overfit on D Train , stuck in a physics-obeying local minimum (Bajaj et al. 2023).

Numerical Result
The PDE dataset description is given in Table 1.Experiments were run on a 2.6 GHz 6-Core Intel i7 CPU with 32 GB RAM.Further results and analysis are in the supplement.  .To denoise ũ, we ran regularized KSVD with ρ = 0.05 on the stack S p (ũ) created with square patches of size 8 × 8.For sparse encoding during the training, OMP was configured with one target sparsity.
We gathered an overcomplete set of candidate functions for transforming to the integral weak forms, (Φ(û), q 0 ).Exhaustive all subsets selection solved Eq. ( 4), attaining ξk for every k ≤ N q = 8 (an constant intercept or bias excluded).
Figure 2a (top) shows the BIC scores of the found PDEs, where the support sizes are arranged in increasing order.After the 2-support PDE, the improvement in BIC becomes stagnant.However, the model selection based on BIC would not choose the 2-support PDE as the optimal choice.This is because the scores continue decreasing beyond the plateau in the BIC scores, and the 5-support PDE actually yields the lowest BIC score.We inspect that the log-likelihood dominates the BIC score, when the number of samples in the library N Ω is large.The evidence reveals the impropriety of using the BIC for identifying the true governing PDE.
To leverage the proposed UBIC, we quantify the uncertainty U k from all the best subsets, as plotted in Figure 2a.The PDE with 2 support sizes exhibits the highest stability.These uncertainties are incorporated to penalize the previously obtained BIC scores, preferring the parsimonious PDE with the reliable coefficient estimates.Algorithm 1 suggests the UBIC scores with λ U = 10 0 = 1.As a result, the selected PDE aligns with the true Burgers' PDE form.
Korteweg-De Vries (KdV) PDE We generated the twosoliton u with the initial condition u(x, 0) = − sin πx 20 .We employed denoising regularized KSVD with ρ = 0.01 on the stack S p (ũ) created with 25 × 25 patches.We set the OMP algorithm with one target sparsity during the training.
In Figure 2b, we observe that the true equation favored by the UBIC with tuned λ U = 10 2.28 stands out, in accordance with the minimal uncertainty, than the other potential PDEs.  ).We used the same regularized KSVD settings as detailed in the KdV example.
Since the set of candidate functions adopted in the KdV example was sufficient to build an overcomplete weak form library for recovering the KS PDE, we underwent the same best-subset regression strategy.
As shown in Figure 2c, the lowest BIC and UBIC score with just λ U = 1 offer the desirable outcome, pointing to the true equation form.

Navier-Stokes (NS) PDE
We consider the explicit form of the NS equation given in the 3D spatial-temporal grid.As seen in Table 1, w denotes the vorticity.The components of the velocity field are denoted by u (x-component) and v (y-component), which are both treated as known terms to construct an overcomplete library.We generated the dataset according to the instructions given in the PDE-FIND paper and focused on the bounded spatial domain (x, y) ∈ [2, 8.48] × [0.3, 3.68] after the cylinder.We were left with, in total 8, 342, 750 data points, for each variable: w, u, v, to which 1%-sd noise was added after the subsampling.
To obtain each noise-reduced variable, we applied 2D Savitzky-Golay filters along the temporal axis and then employed the denoising SVD.As experimented in the PDE-FIND, we specifically retained the top singular values, which were 26, 20, and 20 for w, u, and v (reshaped as met-rics with 325×170 rows and 151 columns), respectively.The process enhanced the variables' quality, thereby preserving the correct equation form to be captured at the discovered 4-support PDE.There were 19 non-weak terms listed from the variables: ψ 1 ∈ {w, u, v}, the spatial derivatives of the vorticity w: ψ 2 ∈ {w x , w xx , w y , w yy } or the interactions ψ 1 ψ 2 .Every best subset, whose cardinality ranges from 1 to 10 is initially approximated by the MIQP (mixed-integer quadratic programming) with the L0-norm based budget constraint (Bertsimas, King, and Mazumder 2016).Then we carried out an all-subsets exhaustive search over the top candidates, each at least existing in one of the 10 best subsets.
Despite the underlying PDE form being dependent on the 4 supports, it is the most stable one, as illustrated in Figure 2d (bottom).Undoubtedly, the quantified uncertainty associated with each discovered PDE is a beneficial factor for finding the correct model by the UBIC with λ U = 10 6.11 .

Denoising Effect
The positive effect of denoising is clearly evident from the major drop in BIC, observed throughout the previously studied examples.We measure the improvement by the maximum reduction in the BIC: 2c, the reduction in BIC scores is not as pronounced as what is demonstrated in the Burgers and KdV examples, implying the challenge of restoring the chaotic pattern of the KS PDE.Meanwhile, in the NS example, the omission of the true PDE when no denoising processes were performed causes the noticeable gap in the BIC values between the two cases, as illustrated in Figure 2d, confirming the usefulness of the denoising step to the model selection.R BIC or generally the area between trade-offs is a prospective metric for tuning denoising hyperparameters.

Robust Adaptive Model Selection
We fully harness the capability of our proposed method in such extremely noisy scenarios that, without the denoising step, the true governing PDE would be omitted or poorly recovered from the potential best subsets.As visualized in Figure 3, we correctly identify the governing PDEs selected using the adaptive UBIC despite the severe noise interference.Particularly noteworthy is the Burgers example, where neither the BIC nor the uncertainty alone can recover the true equation form.Discovery Accuracy We evaluate the proximity between the discovered ξk j and ground truth ξ j by the %CE (percentage coefficient error): 100× ξk j − ξ j /|ξ j |.The average over every effective coefficient is reported in Table 2 for each dataset in Table 1.ξk obtained on the denoised data delivers the lower average %CE than the case without denoising.

Reaction-Diffusion (RD) PDE
The PDE governs a system that simulates double spiral waves on a periodic domain, consisting of 7 actual terms.To countermeasure 10%sd noise that perturbed a stack of the u and v variables, 2D Savitzky-Golay filters were employed along the temporal axis, the results were then collected to construct the noisereduced data.
The candidates were chosen from the variables and their transformations: ψ 1 ∈ {u, v, u 3 , v 3 , u 2 v, uv 2 }, the spatial derivatives (up to second-order) of either u or v: ψ 2 ∈ {u x , u y , u xx , u yy , u xy , v x , v y , v xx , v yy , v xy }, or possible interactions ψ 1 ψ 2 .To identify the best subset for each cardinality from 1 to 10, an initial approximation was obtained using the MIQP with the budget constraint based on the L0norm.Subsequently, we validated and ensured the optimality of the subsets, whose support sizes were not greater than 10, respecting the set of unique effective candidates.
Figure 4a shows that the uncertainty alone is not enough for the model selection as it positively correlated with the number of supports, signifying the necessity of the base BIC in Eq. ( 6).The single-support PDE exhibits the least uncertainty.However, an intriguing observation emerges at the 7support PDE, where the uncertainty drops relatively to the surrounding PDEs plotted alongside.The stability is effectively exploited by the tuned UBIC with λ ≈ 1.31 to successfully identify the true 7 candidates.

Gray-Scott (GS) PDE
The GS PDE models the reactiondiffusion system in the 4D spatial-temporal grid.We looped over the stack of the variables along the t-temporal then z-spatial axes to mitigate the noise effect by 2D Savitzky-  3) 501440( 4) 1 The support sizes are denoted in parentheses.
Golay filters.Hereafter, similarly to the NS example, each variable was reconstructed via the denoising SVD, retaining the 10 most significant singular values.
We comprised an overcomplete candidate library with the variables (including an intercept) and their transformations: {1, u, v, u 3 , v 3 , u 2 v, uv 2 }, the spatial derivatives of u: {u x , u y , u z , u xx , u yy , u zz , u xy , u xz , u yz }, and the spatial derivatives of v: The approximated best subsets were retrieved using the FROLS solver with a maximum support size of 12.These subsets were guaranteed to be at their optimum within all the effective candidates, each once delivered by the solver.
According to Figure 4b, it is evident that the best subsets, which align with the true complexity of the PDE system with support sizes of 6 and 5 for the targets u t and v t , exhibit the minimal uncertainties.The tuned UBIC (λ ≈ 1.51) leverages the uncertainty pattern to penalize the BIC values, hence the successful identification of the true PDE system.

Simulation-based Model Comparison
We simulated the PDE selected by the tuned UBIC and another PDE with an additional candidate, using the PINN learning.Comparing the two PDEs, we measured the proximity of their simulated solutions to the denoised version of the observed state variable using the BIC in Eq. ( 11).Table 3 warrants that the s k *support PDE has indeed the sufficient complexity in yielding the lower-BIC simulated state variable than its competitor, the s k * +1 -support PDE with the extra candidate.

Conclusion
We extend the BIC to the parameter-adaptive UBIC, which incorporates the model uncertainty for selecting the governing PDE amidst noisy data.The uncertainty, quantified from the model coefficient posterior, promotes the reliability into the model selection, preventing the rise of overfitted PDEs unaddressed by the BIC.The proposed UBIC demonstrates the successful identification of the true hidden equation while maintaining computational efficiency thanks to the analytical posterior form.Validating consistency in model selection results, we perform a comparison between the PDE selected by the UBIC and its competitor with an extra support to find the better PDE that delivers a lower BIC value calculated between the simulated state and the denoised observed data.Notably, the discovery on denoised data positively improves the BIC trade-off.Towards more complicated examples, we will explore applying the UBIC to the evolutionary-based discovery methods (Xu, Chang, and Zhang 2020) to relax the overcompleteness assumption.

Supplemental Material Generalized UBIC (G-UBIC)
The generalized UBIC (G-UBIC) is defined with a Γ scaling parameter as follows: where Γ = log(N Ω ) results in Eq. ( 6).λ U Γ determines the uncertainty U k influence on the model selection process.If the influence was given, we would easily find λ U Γ = λ 0 U log(N Ω ), supposing that λ 0 U were the values presented in the main text.Actually, it is not necessary to keep the influence unvaried, since we only need the appropriate model selection results, which are automatically taken care of by the Algorithm 1 that iteratively checks the likely optimal support sizes (obtained by the G-UBIC) based on the BIC improvement if we transit to the corresponding PDE model.The bound on the maximum λ U , which preserves the impact of the log-likelihood on deciding the best PDE, is calibrated according to Γ.
The bound is reduced to Eq. ( 9) with Γ = log(N Ω ).As seen in Figure 5, we test the adaptability of Algorithm 1 in tuning λ U = 10 λ for the basic G-UBIC with Γ = 1.The correct identification of the true governing form is shown for every canonical PDE dataset we experimented with.

Uncertainty-penalized WAIC (UWAIC)
Demonstrating the versatility of the core proposals, we conduct a pilot extension of the uncertainty penalization to the WAIC (widely applicable information criterion) (Watanabe 2010), giving UWAIC = T+ FV NΩ +λ U CV k ; where the T and FV are the training loss and functional variance.UWAIC is more computationally expensive than the UBIC or G-UBIC because it involves full Bayesian inference to compute the WAIC before adding the penalizing unnormalized uncertainty CV k .The UWAIC with a λ U specified in Figure 6 is capable of identifying the true PDE terms, similar to the UBIC.These results signify the flexibility of incorporating the proposed uncertainty penalization into other information criteria.

Sparse Bayesian Regression by Sparsifying Priors
We study the PDE discovery approach based on sparsifying priors as inspired by (Hirsh, Barajas-Solano, and Kutz 2022).Particularly, we examine sparse Bayesian regression with the spike and slap prior (SS) (Ishwaran and Rao 2005;Madigan and Raftery 1994;Mitchell and Beauchamp 1988) and the regularized (Finnish) horseshoe prior (RH) (Piironen and Vehtari 2017) to solve Eq. (3) through Bayesian inference, where we draw samples from the posterior distribution using either Markov chain Monte Carlo (MCMC) or sequential Monte Carlo (SMC) methods implemented in the PyMC package (Salvatier, Wiecki, and Fonnesbeck 2016).
For the SS prior each coefficient is given hierarchically as where β is a probability of success that characterizes the Bernoulli distribution.As a result that B j ∈ {0, 1}, the spike and slap prior enables sparse coefficients.Here, k should be set as the index of the maximum support sizes, depending on available computational resources.
After completing the Bayesian inference, the quantified uncertainty factor of each coefficient is given by the following calculation over its posterior samples of ξSS j : Under our core assumption, U SS j would be comparatively low for the j th effective candidate that corresponds to one of the true terms.For a candidate to be considered effective, its posterior inclusion probability (PIP) must be greater than zero.By counting the number of times a unique subset occurs in the posterior samples, we can also estimate the posterior inclusion probability (PIP) for each (best) subset.The uncertainty factor is computed likewise for the other sparsity-inducing prior (i.e., U RH j for the RH prior).In Figures 7 and 8, the desirable outcomes are expected by inspecting that the best subsets with the correct support sizes (2 in these cases) present the highest PIPs.The marginal posterior of the negligible candidates primarily distributes a spike around the origin, whereas wider or slapped distributions are observed for the true nonzero candidates.However, achieving the results comes with the cost of setting the appropriate probability of success of the Bernoulli distribution in Eq. ( 14), i.e., β = 0.3125 and 0.01 respectively for the Burgers and KdV examples.If a bigger β = 0.1 is asserted for the KdV example, the best subset with the highest PIP is comprised of 3 candidates instead, which does not convey the true sparsity as shown in Figure 9, hence the troublesome sensitivity caused by β.On the contrary, the uncertain pattern remains insensitive to the change in β from 0.01 to 0.1.Also, the u xxx and uu x (5 th and 7 th ) candidates yield the least uncertainty factor values.The uncertainty factors of u xx and uu x (4 th and 5 th candidates) are found to be minimal for the Burgers' PDE case, illustrated in Figure 7.
By the RH prior design (Hirsh, Barajas-Solano, and Kutz 2022), the coefficients sampled from the posterior distribution are not strictly sparse, exhibiting values close to, but not precisely, zero(s).Thus, a definition of pseudo-probabilities may be introduced.In this study, we are inclined to take the uncertainty factor as our preferred alternative.Figure 10 reveals that the true nonzero candidates have the lowest uncertainty factor, akin to the cases where the spike and slap priors are utilized.We set the global shrinkage parameter of the RH prior equal to 10 −3 for both examples.

Threshold Sparse Bayesian Regression
Another approach for achieving the sparse identification of the governing PDE with quantified uncertainty (error bars) is through iterative thresholding until no further changes in sparsity are detected.This approach is known as threshold sparse Bayesian regression (Zhang and Lin 2018).We adhere closely to their iterative thresholding algorithm.In our implementation, we leveraged fast automatic relevance determination that uses sparse Bayesian learning (Tipping and Faul 2003;Faul and Tipping 2001) to estimate mean regression coefficients from the posterior distribution.
We showcase the experimental results with 3 different pre-specified threshold values, as depicted in Figure 11.We find the sensitivity with respect to the threshold: overly high or low threshold values yield underfitted and overfitted PDE models, respectively.This issue is simply addressed by aggregating the best subsets from the three cases (Threshold = 10 −1 , 10 −2 , 10 −3 ) and selecting the one that minimizes the UBIC, as seen on the rightmost subfigure.

Improving PDE Discovery Accuracy
After the model selection is finalized, the coefficient values may be refitted using different candidate representations such as the convolutional weak formulation (CWF).Alternatively, we suggest that the quality of the estimated coefficients can be simply improved by further denoising on û over a subdomain Ω i in Eq. (3).Particularly, we specialize G u + = S α Ωi • G û instead of using G û; where S α Ωi is the composite denoising function with α parameterization.

Practically, S α
Ωi is implemented as the Savitzky-Golay filtering with the window size α and applied to the bounded û over each Ω i , before computing ∀j, q i j .Note that the order of the polynomial used for the filtering is 2. The resulting modification is called the denoised WF (weak formulation).
The comparison of how close the estimated coefficients are to the true values in terms of the average percentage coefficient error (%CE) is listed in Table 4 for the different library representations.With the simple adjustment, the denoised WF effectively outperforms the counterpart CWF method on average over the 3 PDE datasets.Basically, S α Ωi could have been implemented using different denoising filters that may enhance the discovery accuracy even more.Therefore, our work demonstrates not only the effectiveness of the PDE discovery approach using the proposed UBIC but the accurately estimated coefficients.outcomes emphasize the advantage of incorporating penalizing uncertainty information for the model selection, especially in the scenarios contaminated with substantial noise.

Simulation-based Model Selection: Extended Results
The evidence presented in Table 3 indicates that the s ksupport PDE has sufficient complexity to generate the lower-BIC simulated state variable compared to the s k+1 -support PDE, as demonstrated via the PINN learning.To strengthen the results, we also solve the PDEs on their entire spatiotemporal domain using the Chebfun and Dedalus software, unlike the PINN approach that necessitates a train-validation data split.The PySR package (Cranmer 2023) is used to recover the analytical equation of the initial condition.The results are given in Table 5, revealing the same conclusion.

Figure 1 :
Figure 1: Schematic diagram of the Burgers' PDE discovery example incorporating the UBIC for the model selection 11) where |Θ * k | tells the network's number of trainable parameters.The modification involves utilizing the validation data points in D Val for the comparison.If BIC û Θ

Figure 2 :
Figure 2: We plot the BIC, uncertainty U k and UBIC with tuned λ U = 10 λ for the model selection in the Burgers, KdV, KS, and NS examples, arranged from left to right."✓" indicates that the UBIC selects the true PDE form.

Figure 3 :
Figure 3: Robust adaptive model selection by the UBIC under the extremely noisy scenarios.

Figure 4 :
Figure 4: Model selection results for the RD and GS PDEs.

Figure 5 :
Figure 5: Successful model selection results by G-BIC with Γ = 1.In each subfigure, the (re)tuned λ(s) is/are annotated.It should be noted all the U k s are already given in Figure (2) and Figure (4) in the main text.

Figure 7 :Figure 8 :
Figure 7: Sparse Bayesian regression with the SS prior (β = 0.3125) for discovering the Burgers' PDE.The candidates are listed in the following order: [u, u 2 , u x , u xx , uu x , u 2 u x , uu xx , u 2 u xx ].

Figure 9 :
Figure 9: Sparse Bayesian regression with the SS prior (β = 0.1) for discovering the KdV PDE.Please refer to the candidate order listed in Figure 8.

Figure 10 :
Figure 10: Sparse Bayesian regression with the RH prior for discovering the Burgers and KdV PDE.The global shrinkage parameter is set equal to 10 −3 .Please refer to the candidate order listed in Figures 7 and 8.
Figure 10: Sparse Bayesian regression with the RH prior for discovering the Burgers and KdV PDE.The global shrinkage parameter is set equal to 10 −3 .Please refer to the candidate order listed in Figures 7 and 8.
Figure 11: Threshold sparse Bayesian regression for discovering Burgers' PDE.Please refer to the candidate order listed in Figures 7. 0

Figure 13 :
Figure13: Robust model selection by the UBIC without preceding denoising under the highly noisy scenarios.For the Burgers and KS cases, we assign τ 0 = P 75 (S).For the KdV cases in this Figure, we assign τ 0 = P 85 (S).

Table 2 :
The better is underlined (%CE) or on bold (R BIC ).

Table 4 :
Comparison based on %CE between the denoised WF and CWF methods for the library preparation.The lower %error is on bold.

Table 5 :
Simulation-based model comparison between the two likely PDEs, whose support sizes are s k * and s k * +1 .