TRustworthy Uncertainty Propagation for Sequential Time-Series Analysis in RNNs

The massive time-series production through the Internet of Things and digital healthcare requires novel data modeling and prediction. Recurrent neural networks (RNNs) are extensively used for analyzing time-series data. However, these models are unable to assess prediction uncertainty, which is particularly critical in heterogeneous and noisy environments. Bayesian inference allows reasoning about predictive uncertainty by estimating the posterior distribution of the parameters. The challenge remains in propagating the high-dimensional distribution through the sequential, non-linear layers of RNNs, resulting in mode collapse leading to erroneous uncertainty estimation and exacerbating the gradient explosion problem. This paper proposes a TRustworthy Uncertainty propagation for Sequential Time-series analysis (TRUST) in RNNs by introducing a Gaussian prior over network parameters and estimating the first two moments of the Gaussian variational distribution using the evidence lower bound. We propagate the variational moments through the sequential, non-linear layers of RNNs using the first-order Taylor approximation. The propagated covariance of the predictive distribution captures uncertainty in the output decision. The extensive experiments using ECG5000 and PeMS-SF classification and weather and power consumption prediction tasks demonstrate 1) significant robustness of TRUST-RNNs against noise and adversarial attacks and 2) self-assessment through the uncertainty that increases significantly with increasing noise.


I. INTRODUCTION
T IME-SERIES analysis and prediction are essential areas of machine learning (ML).Many critical applications rely on time-series analysis, including earthquake prediction, economic forecasting and healthcare monitoring.Recurrent neural networks (RNNs) are specialized artificial neural networks designed to process and learn from sequential and time-series data [1], [2].Popular variants of RNNs include long short-term memory (LSTM) and gated recurrent unit (GRU) networks [3], [4], [5].LSTMs and GRUs employ multiple gates in their architecture to control the information flow and overcome the vanishing gradients problem of traditional RNNs [4].These models have shown remarkable success in dealing with time dependencies of the sequential data and have achieved promising performance in various time-series classification and prediction tasks [1], [2], [3], [5].Applications that demonstrate the success of LSTMs and GRUs include power consumption estimation [6], [7], weather forecasting [8], healthcare diagnoses and disease prognoses [9], [10].
RNNs, like other traditional artificial neural networks, use training data to learn point estimates of network parameters by minimizing a loss function-a measure of discrepancy between a ground truth and model predictions.During testing, the learned network's parameters are used to provide deterministic predictions for any new data examples.RNN architectures do not provide an estimation of uncertainty (or confidence) in the learnable parameters or model predictions.However, acknowledging the level of uncertainty in the model's parameters and predictions is critical for high-stake applications, e.g., financial prediction, legal decision-making and medical diagnosis.Consider examples of detecting heart failure or life-threatening arrhythmias by analyzing the electrocardiogram (ECG) signal [10] or monitoring smart grids, gas pipelines or aircraft engines [11], [12], [13].Missing a prediction due to artifacts in the time-series signals could endanger human lives and significantly impact productivity.Uncertainty in models' decisions may serve as a warning and allow users to explore alternative solutions, such as recommending additional diagnostic tests and thus preventing tragic health, financial or societal damage due to highly uncertain decisions.
Quantifying uncertainty in model predictions can justify the behavior of a model under input data distribution shift (due to predictions in noisy environments) and improves the generalization on the out-of-distribution inputs [14].Moreover, the vulnerability of neural networks to adversarial attacks-i.e., crafted imperceptible (to humans) noise that misleads ML algorithms to make erroneous output-has raised concerns and even halted the deployment of RNNs and their variants in healthcare or safety-critical applications [15], [16], [17].The uncertainty or model confidence provides valuable information to users for detecting and tackling adversarial attacks [14].
Bayesian formulation facilitates a mathematically grounded approach for estimating uncertainty in deep neural networks (DNNs) [14].In the Bayesian settings, we define a prior distribution over the network parameters and estimate their posterior distribution using Bayes' rule after observing the training data.The predictive distribution of any new data example can then be computed by marginalizing out the network parameters.The variance (or variance-covariance matrix in the multivariate case) of the predictive distribution provides a quantitative measure of uncertainty associated with the model's prediction.However, due to the non-linear structure of DNNs and the high dimensionality of the parameter space, the exact Bayesian inference on the parameters is intractable.Variational inference is an effective and extensively used method in the literature for approximating the posterior distribution of DNNs' parameters [18], [19], [20], [21].

A. Our Contributions
In this paper, we propose new sequential machine learning models (i.e., new Bayesian recurrent neural networks) that are robust to artifacts, noise, and adversarial attacks.These models can quantify uncertainty in the predictions and self-assess their performance.We start by defining a Gaussian distribution as a prior over RNN parameters.Later, the variational posterior distribution is estimated by minimizing the evidence lower bound (ELBO) objective function.In the proposed Bayesian framework, the learnable network parameters are random variables defined with a probability distribution function.Therefore, all operations at each layer of an RNN, LSTM or GRU are derived, considering the parameters as random variables.These operations include (1) inner products between two random vectors; (2) Hadamard products between two random vectors; and (3) non-linear transformations operating over random vectors, such as hyperbolic tangent (Tanh), sigmoid and rectified Linear Unit (ReLU).We use the first-order Taylor series to approximate the mean and covariance matrix of the variational distribution after the non-linear activation functions.At the network's output, the mean vector represents the prediction or classification decision, and the variance-covariance matrix reflects the uncertainty associated with the output decision.We evaluate the models' performance for sequential time-series analysis.More specifically, the contributions are summarized as follows.
1) Introduce TRustworthy Uncertainty propagation for Sequential Time-series analysis (TRUST) framework for RNNs and their variants, including LSTMs and GRUs.We adopt powerful and computationally efficient statistical frameworks from sequential Bayesian estimation used for tracking probability distributions in non-linear dynamical systems [30].2) Establish the mathematical foundations for propagating the first two moments (mean and covariance matrix) of the variational probability distribution through non-linear layers of various RNN, LSTM, and GRU architectures for analyzing time-series data and quantifying uncertainty in the networks' predictions.3) Perform an extensive performance evaluation and analysis on a variety of benchmark supervised classification and prediction time-series datasets.Our experiments include weather and power consumption as prediction tasks [31], [32], and PeMS-SF and ECG5000 datasets as classification tasks [33].4) Demonstrate significant robustness of the proposed TRUST models (compared to the state-of-the-art Bayesian and deterministic models) against random noise and adversarial attacks.The TRUST robustness is achieved without significantly increasing the number of parameters or computational cost at the inference time.5) Exploit the uncertainty information at the TRUST models output (in the form of the variance-covariance matrix of the predictive distribution) to self-assess the performance degradation and failure under noisy conditions and adversarial attacks.The remainder of this paper is organized as follows.Section II presents a review of the related work in the area of Bayesian RNNs.Section III explains the proposed TRUST framework in detail for RNNs, LSTMs and GRUs.Section IV elaborates on the experimental settings used to evaluate the proposed TRUST models compared to the state-of-the-art Bayesian and deterministic homologs.Experimental results are presented and discussed in Section V. Finally, we conclude this paper in Section VI.

II. RELATED WORK
Recently, Fortunato et al. extended BBB to RNNs (BBB-RNNs) by introducing a fully factorized Gaussian distribution over network parameters and formulating the ELBO objective function for a truncated sequence of an unrolled RNN [21].The authors introduced a hierarchical posterior distribution over the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
parameters to allow the networks to adapt locally to batches of data.Gal and Ghahramani extended dropout-CNN [22] formulation to RNNs, termed variational RNNs or VAR-RNNs [23].The dropout in RNNs was implemented by removing the same recurrent network units at each time step and randomly eliminating inputs, outputs, and recurrent connections.Molchanov et al. applied sparse variational dropout (SparseVD) to RNNs with unbounded dropout rates as a Bayesian compression approach that induced sparsity during the training of RNNs [34].Lobacheva et al. extended SparseVD to GRU and LSTM models [35].Later, Goel and Bajpai used VAR-LSTM to quantify uncertainty in forecasting global sales in hotel businesses [36], and Zhu and Laptev used the same network for time-series anomaly detection at Uber [37].Gan [40].DAR model performed a univariate forecast using an encoder-decoder LSTM architecture.The decoder consists of a fully-connected layer with two outputs, one for the mean and one for the standard deviation, followed by a softplus activation function.DAR model parameters were learned by optimizing a log-likelihood loss function [40].
These state-of-the-art uncertainty quantification techniques in RNNs and their variants rely on a frequentist probability approach or adopt dropout to quantify uncertainty [21], [23], [36], [37], [38].They follow Monte Carlo (MC) sampling by drawing one random sample from the variational distribution and passing it forward through the network layers.The moments of the variational distribution are generally not propagated through recurrent layers or various operations within the RNN network.At test time, the uncertainty in the model prediction is always estimated by performing multiple passes (MC sampling) through the trained model and computing the sample variance of the predictions.Moreover, such measures of uncertainty in the model output have not been extensively analyzed under noisy conditions, out-of-distribution inputs, or adversarial attacks, nor there was a variance-covariance versus Signal-to-Noise-Ratio (SNR) analysis to evaluate the proposed measure of uncertainty.
There are other methods in the literature for estimating confidence intervals of point predictions, including ensemble and frequentist approaches [41], [42], [43], [44].These ensemble and frequentist RNNs are built by creating multiple perturbed versions of the original RNN by re-sampling the optimal parameters.These methods are usually post-hoc, that is, the RNNs are trained using deterministic settings, ignoring the computation of uncertainty or confidence during the training phase.We direct the reader to survey articles for further details on estimating confidence intervals of point predictions [45], [46].

A. Mathematical Foundations of TRUST Models
This section provides the mathematical foundations of TRUST models, including RNNs, LSTMs and GRUs.We adopt variational inference and propagate the first two moments, i.e., the mean vector and the variance-covariance matrix of the (multivariate) variational distribution q φ (Ω) through RNN layers, including non-linear activation functions.We derive mathematical relations for various operations within one cell of an RNN and extend the framework to LSTM and GRU cells.By propagating the mean and the covariance of the variational distribution, we obtain the mean and the covariance of the predictive distribution p(y * |X * , D) at the network output for any test sample (X * , y * ).The mean of p(y * |X * , D) represents the network's prediction, while the covariance matrix reflects the uncertainty in the prediction.

B. Bayesian Approximation and Variational Inference
We consider an RNN with L stacked layers assuming the parameters (weights and biases) are random variables and are shared across recurrent states.The parameters are represented by Ω = {W (l) } L l=1 , with biases augmented in weight matrices.The training dataset D is a set of N sequences, D = {X (n) , y (n) } N n=1 , where X (n) ∈ R τ ×K with τ denoting the length of the sequences and K is the input vector size (number of features) and y (n) ∈ R C with C representing the number of classes for classification tasks or the number of output neurons for prediction tasks.
We introduce a Gaussian distribution as a prior probability distribution over the network parameters Ω ∼ p(Ω).We assume the parameters are independent across layers to 1) extract uncorrelated features across different network layers and 2) develop a feasible optimization problem, as estimating the joint distribution of all layers is mathematically intractable in large ML models.Given the training sequences D and the prior distribution p(Ω), the estimation of the posterior distribution p(Ω|D) is typically intractable.The variational inference approximates the true posterior p(Ω|D) with a parametrized variational distribution q φ (Ω).The optimization problem is formulated by minimizing the Kullback-Leibler (KL) divergence, KL[q φ (Ω) p(Ω|D)], which is equivalent to optimizing the evidence lower bound (ELBO) objective (or loss) function L(φ; D) using gradient decent update rule during training of the Bayesian neural network [47] The ELBO loss function consists of two parts: 1) the negative expected log-likelihood of the training data given the network parameters; and 2) the regularization term, which is defined by the KL-divergence between the proposed variational distribution q φ (Ω) and the prior distribution p(Ω).

C. Variational Moments Propagation in TRUST-RNNs
The operations in an RNN cell include a matrix-vector multiplication followed by an element-wise non-linear activation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 1.A schematic layout of the proposed TRUST-RNN with the variational moments propagation (represented by the bell-shaped distribution).The bottom box shows an expanded view of operations performed inside a single RNN cell.The blue, green, and yellow colors represent the input, hidden state, and the non-linear activation function, respectively.In the proposed settings, the weights w i , the recurrent states s (t) , the output of the inner product b (between weights and the concatenated state and input) are all random vectors.function, e.g., hyperbolic tangent (Tanh).Fig. 1 illustrates the propagation of moments of the variational distribution through the recurrent states with an expanded view of operations performed within one RNN cell.The current state, s (t) at time t, is given by where x (t) ∈ R K×1 is the input vector at the time step t, U ∈ R H×K is the input-hidden weight matrix, W ∈ R H×H is the recurrent weight matrix, s (t−1) ∈ R H×1 is the hidden state at time t − 1, H is the number of hidden units and ψ is the nonlinear activation function (Tanh in this case).We concatenate the matrices U and W as one large matrix, W = [U W].Similarly, the vectors x (t) and s (t−1) are concatenated as a column vector, x = [x (t) s (t−1) ] T , where T represents the transpose operation.We eliminate the super-script t from x to simplify the notations.Equation ( 2) is re-written as Let w T i ∈ R 1×(K+H) be the i th row of the matrix W, where i = 1, . . ., K + H.We introduce a prior, i.e., Gaussian distribution, over the weight vector w i .The variational posterior will also be a Gaussian distribution, w i ∼ N (µ w i , Σ w i ) where the mean and covariance are estimated by optimizing the ELBO objective function (1) during training of the network.The input x (t) and the state s (t) are assumed to be mutually independent random vectors with the mean and covariance defined as x (t) ∼ (µ x (t) , Σ x (t) ) and s (t−1) ∼ (µ s (t−1) , Σ s (t−1) ), respectively.Although x (t) and s (t) may not necessarily follow Gaussian distributions, we assume that their distribution functions can be approximated by a mean vector and a variance-covariance matrix.We further assume that the weight vectors w i are independent of each other and also independent of the input x (t) and the state s (t) .The mean and the covariance of the concatenated random vector x are then given by Every element of b in (3) is the result of an inner product between two independent random vectors w i and x, that is, b i = w T i x.The mean and the covariance of the resulting random vector b are derived using Proposition 1.The proof of Proposition 1 is in Appendix A in the supplementary materials, available online.
Proposition 1: (mean and covariance propagation through an inner product of two independent random vectors) where i, j = 1, . . ., K + H, and tr represents the trace operator.

Non-linear Activation Function:
The non-linear activation function ψ operates element-wise on b.The mean and the covariance at the output of ψ are approximated using the first-order Taylor series [48] as where ∇ represents the gradient of the function ψ with respect to b and is the Hadamard product.The Equations in ( 6) hold true for any non-linear activation function that operates elementwise, including Tanh, sigmoid, or rectified Linear Unit (ReLU).

D. Variational Moments Propagation in TRUST-LSTMs
1) The Structure of an LSTM: The LSTM network was introduced by [49] to overcome the vanishing gradient problem in RNNs.An LSTM cell consists of four gates, i.e., input, forget, output, and gate gates and an additional state, referred to as the cell state or memory cell c (t) .The four gates, along with the two states (c (t) and s (t) ), control the flow of information inside the LSTM cell and help avoid gradient vanishing/exploding problem.The input gate i (t) controls the information from the gate gate, g (t) , that is read into the cell state c (t) .The forget gate f (t) removes the content of the cell state and the output gate o (t) reads the output from the cell state.Similar to the RNN cell, the input and hidden state vectors are concatenated together, x = [x (t) s (t−1) ] T .The input, forget, output and gate gates perform the following operations i (t) = ψ s ( ĩ ), where ĩ = W (i) x, , where õ = W (o) x, g (t) = ψ( g ), where g = W (g) x, where ψ s and ψ refer to the sigmoid and Tanh activation functions and W (i) , W (f ) , W (o) and W (g) are the weight matrices of the input, forget, output and gate gates, respectively.Note that we have dropped the super-script t from the vectors ĩ, f , õ, and Fig. 2. A schematic layout of TRUST-LSTM with a detailed view of the operations in an LSTM cell.The following color coding is used: blue color for the input {x (t) } τ t=1 , green color for the hidden state {s (t) } τ t=1 , red color for the concatenated input and hidden state x = [x (t) s (t−1) ] T , purple color for outputs of gates, grey color for the cell memory/state {c (t) } τ t=1 , and orange color for the activation functions.
g to simplify the mathematical notations.At time t, the cell state c (t) and the hidden state s (t) are updated as follows: 2) TRUST-LSTM: Fig. 2 presents the TRUST-LSTM network with detailed operations performed inside one cell.We first consider (w i i ) T , (w f i ) T , (w o i ) T and (w g i ) T ∈ R 1×(K+H) , which represent i th rows of the matrices W (i) , W (f ) , W (o) , and W (g) , respectively, where i = 1, . . ., K + H.We introduce Gaussian prior distributions over the weight vectors w i i , w f i , w o i , and w g i .The variational distributions are given by: The weight vectors are assumed to be independent of each other and independent of the input x.We write the individual elements of the vectors ĩ, f , õ, and g in ( 7) as follows: Each of these elements is the result of an inner product between two independent random vectors (weights-inputs vectors).Therefore, the mean and the covariance of ĩ, f , õ and g are derived using Proposition 1.

Gates Activation Functions in LSTM:
The mean and variancecovariance matrices at the output of the non-linear activation functions ψ s and ψ in (7) are derived using the first-order Taylor series approximation [48] Cell State/Memory c (t) : The derivation of the mean and covariance of the cell state c (t) requires propagating the moments through the Hadamard product.We rewrite (8) as c (t) = c1 +c 2 , where c1 = f (t) c (t−1) and c2 = i (t) g (t) .
(12) The three gates, f (t) , g (t) , i (t) , and the cell state, c (t−1) , are assumed to be uncorrelated to each other, resultantly c1 and c2 are also uncorrelated.Under this condition, the mean and covariance of c1 (and similarly for c2 ) are derived according to Proposition 2 (the proof is in Appendix B in the supplementary materials), available online.
Proposition 2: (mean and variance-covariance propagation through a Hadamard product) where diag(µ f (t) ) represents the diagonal matrix whose entries are given by the column vector µ f (t) .Finally, the mean and covariance of the cell state c (t) are derived as follows: Hidden State s (t) in LSTM: To find the mean and variancecovariance of the hidden state, we rewrite (9) as follows: The moments after the element-wise non-linear transformation in (15) are approximated using Taylor series Then, the moments of the hidden state s (t) in ( 15) are derived using Proposition 2 (propagation through the Hadamard product).

E. Variational Moments Propagation in TRUST-GRUs
1) The Structure of a GRU: The GRU network, introduced by [50], has a simpler internal structure than LSTM, i.e., two gates, a reset gate and an update gate.The reset gate r (t) filters out previously retained irrelevant information from the hidden layer, and the update gate z (t) controls the information added to the hidden state.The input and the hidden state from the previous time step are concatenated, x = [x (t) s (t−1) ] T , and linearly combined with weight matrices.Using sigmoid activation functions ψ s , the outputs of reset and update gates are given as the following: , where r = W (r) x, where W (r) and W (z) are weight matrices of the reset and update gates, respectively.Before we calculate the current hidden Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Fig. 3.A schematic layout of the proposed TRUST-GRU.One GRU cell is expanded in the bottom view.The following color scheme is used: blue for the input vector x (t) , green for the hidden states s (t) , red for the concatenated input-state vector x = [x (t) s (t−1) ] T , gray for the concatenated vector x = [x (t) s (t−1) r (t) ] T , purple for the output of Hadamard product between two vectors or an output of a non-linear activation function, and orange for the activation functions.
state s (t) , we need to build a candidate hidden state h (t) using the concatenation of the input x (t) with [s (t−1) r (t) ], that is where W (h) is the weight matrix of the candidate hidden state.Now, the hidden state s (t) is given by the following: 2) TRUST-GRU: Fig. 3 presents the general structure of a TRUST-GRU network and operations performed within one GRU cell.Similar to TRUST-LSTM, we derive the propagation of the mean and the variance-covariance matrix of the variational posterior distribution function through a single GRU cell.
Consider (w r i ) T , (w z i ) T and (w h i ) T ∈ R 1×(K+H) to be the i th rows of the matrices W (r) , W (z) and W (h) , respectively, where i = 1, . . ., K + H.We introduce a Gaussian distribution as the prior over the weight vectors w r i , w z i , and w h i .The variational distributions are given as: In our settings, the weight vectors are assumed to be independent of each other and of the input x and x.Every element of the random vectors r and z in (17) can be written as an inner product between two independent random vectors, such as ri = (w r i ) T x and zi = (w z i ) T x.Therefore, the mean and covariance of the random vectors r and z are derived using Proposition 1.
Gates Activation Functions in GRU: We approximate the mean and covariance at the output of the non-linear activation function ψ s in the reset and update gates, given in (17), using the first-order Taylor series as Hidden State s (t) in GRU: We start with the candidate state h (t) and the concatenation operation as defined in (18) and (19).By introducing a = s (t−1) r (t) , we can derive the mean and the covariance of a using Proposition 2. Thus, the mean and the covariance matrix of x in (18) are given as follows: The mean and covariance matrix of h (t) as defined in ( 19) are derived using Proposition 1 and then the first-order Taylor series approximation (following ( 21)).Finally, we rewrite the hidden state s (t) (defined in ( 20)) as, s (t) = s1 + s2 , where s1 = z (t) s (t−1) and s2 = (1 − z (t) ) h (t) .The mean and the covariance of s1 and s2 are derived using Proposition 2. Since s1 and s2 are correlated, the mean and the variance-covariance matrix of the current hidden state s (t) are derived as follows: where Σ s1 s2 is the cross-covariance matrix of the two random vectors s1 and s1 (detailed derivation of Σ s1 s2 is in Appendix C in the supplementary materials, available online).

F. Model Output and Loss Function
For the prediction tasks, the output of the network (RNN, LSTM, or GRU) consists of a fully-connected layer, ỹ = W (y) s (T ) , where W (y) is the weight matrix.We use Proposition 1 to propagate the mean and the variance of the output.
In the case of classification problems, the network's output layer has a softmax function ϕ, i.e., ŷ = ϕ(ỹ) after the fullyconnected layer.Even though the softmax function does not operate element-wise, we can use the Taylor series to approximate the mean µ ŷ and the covariance matrix Σ ŷ as follows [30] where J ϕ is the Jacobian matrix of the softmax function ϕ evaluated at µ ỹ.
In the ELBO loss function in (1), we assume a diagonal covariance matrix for the initial variational distribution and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
use the first-order Taylor series to approximate the expectation.Thus, the expected log-likelihood in (1) is written as where N refers to independently and identically distributed (iid) data points (i.e., sequences of data observations), y i t is the ground truth output of the i th data sequence (t = 1, . . ., τ time steps) and µ i ŷt and Σ i ŷt are the mean and the covariance matrix of the predicted output at time t, i.e., ŷt , of the TRUST model.
The regularization term in ( 1) is the KL-divergence between two multivariate Gaussian distributions, i.e., the variational posterior distribution and the prior distribution, defined over the network parameters for all hidden units H in the RNN, LSTM, or GRU network and the fully-connected layers [51].KL regularization terms for TRUST-RNN, TRUST-LSTM and TRUST-GRU, are explained in Appendix D in the supplementary materials.
RNNs, LSTMs and GRUs can be set up in many different ways depending upon the problem, and the dataset, such as one-to-one, one-to-many, many-to-one, many-to-many, and many-to-many with a bottleneck (encoder-decoder type) [52].The expected log-likelihood in the ELBO loss function presented in ( 26) is applicable for all arrangements.For the one-to-one and manyto-one arrangements, we have t = τ .The propagation of the mean and the variance-covariance, as well as the KL divergence regularization terms do not require any modification for handling different output arrangements.Moreover, the mean and covariance propagation through the TRUST models presented above can be easily extended to different variations of LSTM and GRU models, including stacked and bidirectional architectures [53], [54], [55], [56], [57], [58], [59], [60], [61], [62].

G. Back-Propagation Through Time (BPTT)
The back-propagation operation, referred to as BPTT in sequence-based models, involves computing the gradient of the ELBO loss function L(φ; D) with respect to the variational parameters φ.BPTT for the TRUST models is performed using libraries such as PyTorch, TensorFlow or JAX.For the TRUST models, the set of parameters φ includes the mean vectors and variance-covariance matrices of weights and biases in RNN, LSTM or GRU networks and in the output layer.Finally, the gradient ∇ φ L(φ; D) is used to update the parameters φ using the gradient descent update rule.

IV. EXPERIMENTS
This section explains the experimental settings used to evaluate the proposed TRUST-LSTMs and TRUST-GRUs, against the state-of-the-art Bayesian and deterministic homologs.We exclude the simple RNN architecture (i.e., without the gates structure) from the simulation because it is well-known that simple RNNs suffer from vanishing and exploding gradient problems due to long-term dependencies, which make RNNs very impractical [28], [63], [64], [65], [66].We focus on classification and univariate prediction (or forecasting) tasks in the experiments.The simulation can be easily extended to multivariate forecasting cases and other time-series tasks.In our experiments, TRUST models are compared with Bayes-by-backprop recurrent neural networks (BBB-RNNs) [21], variational or VAR-RNNs [23], DeepAR (DAR) [40] and deterministic (DET-RNNs) in both LSTM and GRU configurations using four different datasets.The original DAR model architecture in the literature includes only the LSTM network for prediction tasks [40].Thus, we compare with DAR-LSTM in our simulation of prediction datasets.The experiments include training, validating and testing 34 different neural networks, i.e., four models (TRUST, BBB, VAR and DET), two configurations (LSTM and GRU), four datasets and the DAR model on the two prediction datasets.The datasets include weather and power consumption datasets for the prediction tasks [31], [32], and ECG5000 and PeMS-SF for the classification tasks [33].Appendix E in the supplementary materials provides a detailed description of the datasets, available online.Table I presents the architectural hyper-parameters, including the number of layers, the number of hidden units in each layer, and the batch size for each dataset.These hyper-parameters are the same for all five models (TRUST, BBB, VAR, DAR, and DET) and both configurations (LSTM and GRU).Adam algorithm is used as the optimizer [67] with a decaying learning rate (LR) and polynomial schedule [68].The number of epochs, initial and final learning rate (LR), and the KL weighting factors are presented for TRUST models in Table I (columns 5 -8).For all other models, we fine-tune these four hyper-parameters to ensure the convergence of each model to its best performance.The KL weighting factor for the PeMS-SF dataset is set to 10 −3 for the TRUST-LSTM model and 0.01 for the TRUST-GRU model. 1

A. Robustness Analysis
We establish the robustness of TRUST models against additive Gaussian noise and adversarial attacks compared to BBB, VAR, DAR, and DET models.The robustness analysis is performed on the trained and validated models for respective datasets.First, we evaluate the performance of each model on clean test data (without noise).Then, different Gaussian or adversarial noise levels are added to the test examples to evaluate each model's performance.
Three levels of Gaussian noise are used, low, medium, and high, determined by the noise required to introduce sufficient to define the noise levels for each dataset (Table II).
The adversarial examples are generated using the fast gradient sign method (FGSM) and the basic iterative method (BIM) [69], [70].We use untargeted attacks for the prediction tasks and both targeted and untargeted attacks for the classification tasks.We use three levels of severity for both types of adversarial attacks.The severity of attacks is defined using as given below [70] where J is the ELBO objective function in (1), α is the step-size, k is the number of iterations for the BIM attack, and X BIM 0 = X.We choose α = 1 and set the maximum number of iteration to 100.The clip operation in (28)

B. Variance-vs-SNR Analysis
The proposed TRUST models provide the output prediction and uncertainty information simultaneously in the form of the predictive distribution's mean and variance-covariance matrix.The analysis of uncertainty under noisy conditions (when the networks are subject to Gaussian noise or adversarial attacks) provides insights into the network's performance after deployment and possible detection of models' failure due to changes in the input.
We perform a detailed analysis of the predictive variance for TRUST, BBB, VAR and DAR models at various levels of Gaussian noise and adversarial attacks.The variance analysis involves, first, testing the trained models on clean test data and then gradually increasing the noise level (Gaussian or adversarial) in the test data.The amount of noise at each level is measured using the signal-to-noise ratio (SNR).The average predictive variance is calculated for all the test examples at each noise level.
For the prediction tasks, the outputs of the TRUST models consist of a predictive mean value and a single variance value for each sample at the next time step.In contrast, for the classification tasks, the outputs contain a predictive mean vector and a variance-covariance matrix that shows uncertainty in all classes (diagonal elements) and the correlation between different classes (off-diagonal elements).We use the variance, i.e., the diagonal value corresponding to the predicted class, for our analysis.
The predictive variance for BBB and VAR models is the sample variance between different predictions using 20 forward passes (MC samples) through the respective network for each test example.We subtract the average predictive variance at zero noise (clean test examples) from the variance values at each noise level.The resulting average predictive variance values are plotted against the respective SNR values to produce variancevs-SNR curves.
The variance-vs-SNR curves in Section V are interpreted from right to left.The average predictive variance for test data, having very high SNR (low noise), is plotted as a point on the extreme right side of the graph.The addition of noise leads to a decrease in the SNR values, moving from right to left.The extreme left point presents average variance at the lowest SNR (highest noise).

C. Statistical Analysis
We perform statistical analysis to establish whether TRUST models perform significantly better as compared to BBB, VAR, DAR, and DET models.Given the non-normal nature of the data, i.e., RMSE and classification accuracy, we use the twosided Wilcoxon signed-rank test to perform pair-wise comparisons between TRUST and other models, e.g., TRUST-GRU vs. BBB-GRU or TRUST-LSTM vs. VAR-LSTM.We highlight the statistically significant differences in RMSE or classification accuracy of TRUST models compared to BBB, VAR, DAR, or DET models using a ( †) or a ( ).The former symbol shows statistical significance at the level of 99% and the latter at 95%.
In the variance-vs-SNR analysis, we aim to identify the noise level that results in a statistically significant increase in the predictive variance.Therefore, we perform pair-wise comparisons between the average predictive variance at each noise level and the variance at zero noise (clean test data) using the Wilcoxon signed-rank test.The significance level is 95%, and the point of the significant increase in the predictive variance is marked with a ( * ) on the variance-vs-SNR curves.

A. Performance Analysis and Robustness
This section evaluates the performance and establishes the robustness of the proposed TRUST models compared to BBB, VAR, DAR, and DET models.The Root Mean Square Error (RMSE) demonstrates the performance metric of the models on the prediction tasks, and the classification accuracy is the metric for the classification tasks.
1) Prediction Tasks: Table III presents RMSE values of the TRUST, BBB, VAR, and DET models in both LSTM and GRU configurations and the DAR-LSTM model for the weather and power consumption datasets.The RMSE values are reported for the test data before and after adding three levels of Gaussian noise and adversarial attacks.Lower RMSE values demonstrate better performance.We observe that increasing the level of Gaussian noise or adversarial attacks results in an increase in the RMSE for all models.However, at higher noise levels, TRUST Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III TEST RMSE OF TRUST-LSTM/GRU COMPARED TO BBB, VAR, DAR, AND DET MODELS. ALL MODELS ARE TESTED USING (1) NOISE-FREE TEST DATA,
(2) THREE LEVELS OF GAUSSIAN NOISE, AND (3) TWO TYPES OF ADVERSARIAL ATTACKS, FGSM AND BIM, WITH THREE LEVELS OF ATTACK SEVERITY models perform significantly better than all other models.The bold font refers to the model that significantly outperforms all tested models.
Weather dataset: In Table III(a), we observe that for the noise-free case, the proposed TRUST-LSTM/GRU models perform significantly better than VAR, DAR, and DET models yet slightly worse than BBB models.However, as the level of Gaussian noise increases, the TRUST models maintain significantly lower RMSE.On the other hand, TRUST models significantly outperform all other models (p < .01 for all pair-wise comparisons) for higher levels of FGSM and BIM adversarial noise.
Power consumption dataset: Table III(b) shows that TRUST-LSTM significantly outperforms all other models (p < .01)when Gaussian noise or adversarial attack is added to the test set.Yet, TRUST-GRU performs significantly better than all other models, only at high levels of Gaussian noise and adversarial attacks.
2) Classification Tasks: Table IV presents classification accuracy for TRUST, BBB, VAR, and DET models in both LSTM and GRU configurations for PeMS-SF and ECG5000 datasets.The classification accuracy is measured on the clean test data and after adding three levels of Gaussian and adversarial noise (using targeted and untargeted settings) to the test data.For the targeted adversarial attacks, the target class is class label 3. We notice that there are no statistical differences between the performance of TRUST and other models on the clean test dataset.However, when Gaussian and particularly targeted/untargeted adversarial

B. Uncertainty and Self-Assessment
We use the predictive variance, at the output of the TRUST models, as a quantitative metric to assess their performance without any additional data processing or computational burden.We refer to this as self-assessment since the TRUST models can ascertain whether their predictions are trustworthy based on the variance information.For example, high variance reflects high uncertainty or low confidence in the prediction.We notice that with the increasing noise level (or equivalently decreasing SNR) for both Gaussian and adversarial noise, the RMSE values of all models increase.However, TRUST models maintain lower RMSE, and the average predictive variance of the TRUST models increases significantly.The TRUST models can use this significant increase in the predictive variance to assess their own performance (self-assessment).Predictive variance in BBB, VAR, and DAR models does not show such behavior.
Weather dataset: With a decreasing SNR, we observe that the TRUST-LSTM variance significantly increases at SNR ≤ 8 dB for the Gaussian noise (Fig. 4(i)(a)) and for both types of adversarial attacks, FGSM (Fig. 4(i)(c)) and BIM (Fig. 4(i)(e)).The right sub-figures in Fig. 4(i) show that the RMSE increases from 0.024 at high SNR values to 0.063 for Gaussian noise (Fig. 4(i)(b)) and 0.2 for both FGSM and BIM (Fig. 4(i)(d) and (f)) at the ( * ) point.Thus, when the TRUST models are less accurate (higher RMSE values), they become uncertain (significant increase in the variance).The DAR variance increases at high levels of adversarial attacks (SNR ≤ 4 dB), failing to detect the attacks at multiple earlier levels.
In Fig. 4(ii), the significant increase in the TRUST-GRU variance is observed at SNR ≤ 8 dB for the Gaussian noise (Fig. 4(ii)(a)) and at SNR ≤ 14 dB for the FGSM (Fig. 4(ii)(c)) and BIM (Fig. 4(ii)(e)).The right sub-figures in Fig. 4(ii) show that the RMSE of TRUST-GRU increases from 0.028 at high SNR values to 0.063 for Gaussian noise (Fig. 4(ii)(b)) and 0.11 for both FGSM and BIM (Fig. 4(ii)(d) and (f)).The variance-vs-SNR curves of BBB and VAR models (Fig. 4(i) and (ii)) do not demonstrate a significant increase in the predictive variance with the increasing levels of Gaussian noise or severity of adversarial attacks.
In Fig. 5(ii), the significant increase in the TRUST-GRU variance is observed at SNR ≤ 30 dB for the Gaussian noise and BIM attack (Fig. 5(ii)(a) and (e)) and at SNR ≤ 16 dB for the FGSM (Fig. 5(ii)(c)).The RMSE of TRUST-GRU increases from 0.5 at high SNR values to 0.51 for Gaussian noise (Fig. 5(ii)(b)), 0.88 for FGSM (Fig. 5(ii)(d)) and 0.56 for BIM (Fig. 5(ii)(f)).We also notice that the average predictive variance for BBB and VAR models increases significantly with decreasing SNR for the Gaussian noise only (Fig. 5(i)(a) and (ii)(a)).However, the TRUST-LSTM/GRU models are more sensitive to noise, i.e., they detect the noise at higher SNR values than their BBB and VAR homologs.The variance-vs-SNR curves of BBB, VAR and DAR models do not demonstrate a significant increase in the predictive variance with the increasing severity of FGSM or BIM adversarial attacks (Fig. 5(i)((c) and (e) and (ii)(c) and (e)).
2) Classification Tasks: Fig. 6(i) and (ii) present the predictive variance (left sub-figures) and the classification accuracy (right sub-figures) plotted against SNR in the LSTM configuration for PeMS-SF and ECG5000.
Although the classification accuracy of BBB and VAR LSTMs severely declines with the low SNR values (for both Gaussian and adversarial noise), the variance-vs-SNR curves (Fig. 6(i)) do not demonstrate a significant increase in the predictive variance.
ECG5000 dataset: With a decreasing SNR, we observe that the TRUST-LSTM variance significantly increases at SNR ≤ 8 dB for Gaussian noise (Fig. 6(ii)(a)) and SNR ≤ 22 dB for targeted FGSM and BIM (Fig. 6(ii)(c) and (g)).For untargeted FGSM and BIM, the TRUST-LSTM variance significantly increases at SNR ≤ 36 dB (Fig. 6(ii)(e) and (i)).The right sub-figures in Fig. 6(ii) show that the TRUST-LSTM classification accuracy declines from 93% at high SNR values to ∼ 90% under Gaussian noise (Fig. 6(ii)(b)), ∼ 84% under targeted or untargeted FGSM and BIM at the ( * ) point (Fig. 6(ii)(d), (f), (h) and (j)).The variance-vs-SNR curves of BBB and VAR models (Fig. 6(ii)) do not demonstrate a significant increase in the predictive variance with the increasing levels of Gaussian noise or adversarial attacks.The variance-vs-SNR and classification accuracy-vs-SNR curves of TRUST, BBB and VAR models in the GRU configuration are provided in Appendix F in the supplementary materials.

C. Computational Comparison
The computational demand of the proposed TRUST models is comparable to that of DET and VAR homologs and significantly less than that of BBB and DAR models.Since the prior variational distribution has a diagonal covariance matrix, TRUST models add a single parameter (the variance) for every weight vector in a fully-connected (FC) layer.Thus, for an FC layer with H hidden units, the number of additional parameters is H.While for an LSTM or GRU layer, the number of additional parameters is H× the number of weight matrices.The first-order Taylor series approximation of the mean and the variance-covariance matrix after non-linear activation functions operates without additional parameters.
Table V presents the total number of parameters for one LSTM layer (4 gates), one GRU layer (3 gates), and one FC layer for TRUST, BBB, DAR, and VAR/DET models.We assume that the total number of hidden units is H = 100, the input vector size is K = 30, and the number of output neurons (or classes for classification problems) is C = 5.The increase in the number of parameters for TRUST models is ≈1.5% for the LSTM/GRU Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.layers and for the FC layer compared to DET/VAR models.The increase in the number of parameters of TRUST models is significantly lower than that of the BBB models, which require twice the number of parameters compared to DET/VAR models.The number of parameters for the DAR-LSTM model is comparable to the number of DET-LSTM parameters with an additional FC layer with two outputs for the mean and the examples for each one of the five datasets.We notice that the inference time of the TRUST models is higher than that of DET models but generally lower than the inference time of BBB, VAR and DAR models.The BBB, VAR, and DAR models require Monte Carlo sampling at the test time to compute average prediction and sample variance, which increases the inference time.Therefore, the robustness of TRUST models is maintained without a notable increase in the computational complexity and inference time, which allows the proposed models to be applicable to real-world applications.

D. Discussion
This paper introduces uncertainty estimation in deep sequence models, RNNs, LSTMs, and GRUs, focusing on time-series analysis.We propose the TRUST models based on Bayesian density propagation.TRUST-LSTMs and GRUs propagate the first two moments of the variational posterior distribution of the models' parameters and estimate the uncertainty in models' predictions via the variance of the predictive distribution.
A detailed analysis of the proposed TRUST models is performed using five different time-series prediction and classification datasets.The performance of TRUST models is compared with the state-of-the-art LSTM/GRU networks, including BBB, VAR, DAR, and DET, under various levels of Gaussian noise and two different types of adversarial attacks, i.e., FGSM and BIM in both targeted and untargeted settings.
Our analysis reveals that TRUST models maintain their performance and outperform other models (BBB, VAR, DAR, and DET) when subject to Gaussian noise or adversarial attacks.Furthermore, the predictive variance of the TRUST models has shown a statistically significant increase when the noise is high and the performance of models starts to decrease.In general, we do not observe an increase in the average predictive variance for BBB, VAR or DAR models, which indicates that the variance at the output of BBB, VAR and DAR models does not capture the uncertainty in the model decision.We believe that propagating moments of the variational distribution through the network layers transmits vital information about the strong and weak features of the data from the hidden states to the output layer.The second moment (i.e., the variance) of the variational distribution over the parameters acts as a filter on the input features and modulates these features according to their importance.This additional filtering of features via the variance of the variational distribution forces the predictive variance to statistically increase when these features are corrupted with noise or adversarial attacks.

VI. CONCLUSION
This paper proposes a new framework for TRustworthy Uncertainty propagation for Sequential Time-series analysis in recurrent neural networks, i.e., RNN, LSTM, and GRU networks, named TRUST-RNNs.TRUST models consider network parameters as random variables and approximate the variational posterior distribution by minimizing the evidence lower bound, which allows for estimating the output uncertainty.TRUST-RNNs propagate the first two moments of the variational distribution through the sequential network layers.The extensive experiments on various time-series classification and prediction datasets have demonstrated statistically significant robustness against Gaussian noise and adversarial attacks compared to the state-of-the-art Bayesian and deterministic RNNs.We have also shown that propagating uncertainty in RNNs and their variants inherently results in self-aware models that can assess their own performance and produce a statistically significant increase in the output uncertainty (measured by the predictive variance) in the presence of Gaussian noise and particularly adversarial attacks (targeted or untargeted).

Fig. 4 .
Fig. 4. Weather Dataset: The average predictive variance and RMSE, both plotted against SNR for TRUST, BBB, and VAR models in both (i) LSTM and (ii) GRU configurations and DeepAR (DAR) LSTM.Various levels of noise (Gaussian, FGSM or BIM) are added to the test data, and the SNR values are calculated and plotted on the x-axes in all sub-figures.The variance values are averaged over all test samples.Interpreting sub-figures from right to left: we observe a significant increase (indicated by * ) in the predictive variance of TRUST models with increasing noise levels or severity of adversarial attacks (or equivalently decreasing SNR); however, BBB, VAR and DAR models do not demonstrate such behavior.

Fig. 5 .
Fig. 5. Power Consumption Dataset: The average predictive variance and RMSE, both plotted against SNR for TRUST, BBB, and VAR models in both (i) LSTM (ii) GRU configurations and DeepAR (DAR) Various levels of noise (Gaussian, FGSM or BIM) are added to the test data, and the SNR values are calculated and plotted on the x-axes in all sub-figures.The variance values are averaged over all test samples.Interpreting sub-figures from right to left: we observe a significant increase (indicated by * ) in the predictive variance of TRUST models with increasing noise levels or severity of adversarial attacks (or equivalently decreasing SNR).BBB and VAR variance shows an increase when Gaussian noise is added; however, no significant increase in the variance is noticed when models are subject to FGSM or BIM attacks.

Fig. 6 .
Fig. 6.The average predictive variance and classification accuracy, both plotted against SNR for TRUST, BBB, and VAR LSTMs for (i) PeMS-SF and (ii) ECG5000 datasets.Various noise levels (Gaussian, FGSM, and BIM, targeted and untargeted) are added to the test data, and the SNR values are calculated and plotted on the x-axes in all sub-figures.The variance values are averaged over all test samples.The ( * ) indicates the statistical increase in the predictive variance when the noise increases or the SNR decreases.

TABLE I DATASETS
AND HYPERPARAMETERS USED IN THE EXPERIMENTS

TABLE II THE
LEVELS OF GAUSSIAN NOISE (STANDARD DEVIATION (SD)) AND THE SEVERITY OF ADVERSARIAL ATTACKS ( ) USED IN THE EXPERIMENTS FOR ALL FOUR DATASETS distortion in the test examples.We use standard deviation (SD) Table II provides values for the three levels of attacks applied to the test examples of each dataset.

TABLE IV THE
CLASSIFICATION ACCURACY OF TRUST, BBB, VAR, AND DET MODELS.ALL MODELS ARE TESTED USING (1) NOISE-FREE TEST DATA, (2) THREE LEVELS OF GAUSSIAN NOISE, AND (3) TWO TYPES OF ADVERSARIAL ATTACKS, FGSM AND BIM (TARGETED / UNTARGETED), EACH WITH THREE LEVELS OF ATTACK SEVERITY noise is added to the test data, the TRUST models significantly outperform BBB, VAR, and DET models.The bold font in Table IV refers to the model that significantly outperforms all tested models.

TABLE VI THE
INFERENCE TIME IN SECONDS FOR ALL MODELS IN BOTH LSTM AND GRU CONFIGURATIONS, CALCULATED USING CLEAN TEST EXAMPLES standard deviation.The VAR models have the same number of parameters as DET models because VAR models use dropout approximation to estimate the variational distribution without adding parameters.Table VI presents inference time measured in seconds and calculated using all the test