Causality-Aware Predictions in Static Anticausal Machine Learning Tasks

We propose a counterfactual approach to train “causality-aware” predictive models that are able to leverage causal information in static anticausal machine learning tasks (i.e., prediction tasks where the outcome influences the inputs). In applications plagued by confounding, the approach can be used to generate predictions that are free from the influence of observed confounders. In applications involving observed mediators, the approach can be used to generate predictions that only capture the direct or the indirect causal influences. Mechanistically, we train supervised learners on (counterfactually) simulated inputs that retain only the associations generated by the causal relations of interest. We focus on linear models, where analytical results connecting covariances, causal effects, and prediction mean square errors are readily available. Quite importantly, we show that our approach does not require knowledge of the full causal graph. It suffices to know which variables represent potential confounders and/or mediators. We investigate the stability of the method with respect to dataset shifts generated by selection biases and also relax the linearity assumption by extending the approach to additive models better able to account for nonlinearities in the data. We validate our approach in a series of synthetic data experiments and illustrate its application to a real dataset.


Causality-Aware Predictions in Static Anticausal
Machine Learning Tasks

Elias Chaibub Neto
Abstract-We propose a counterfactual approach to train "causality-aware" predictive models that are able to leverage causal information in static anticausal machine learning tasks (i.e., prediction tasks where the outcome influences the inputs).In applications plagued by confounding, the approach can be used to generate predictions that are free from the influence of observed confounders.In applications involving observed mediators, the approach can be used to generate predictions that only capture the direct or the indirect causal influences.Mechanistically, we train supervised learners on (counterfactually) simulated inputs that retain only the associations generated by the causal relations of interest.We focus on linear models, where analytical results connecting covariances, causal effects, and prediction mean square errors are readily available.Quite importantly, we show that our approach does not require knowledge of the full causal graph.It suffices to know which variables represent potential confounders and/or mediators.We investigate the stability of the method with respect to dataset shifts generated by selection biases and also relax the linearity assumption by extending the approach to additive models better able to account for nonlinearities in the data.We validate our approach in a series of synthetic data experiments and illustrate its application to a real dataset.Index Terms-Anticausal prediction task, causality, dataset shifts, out-of-distribution generalization, regression.

I. INTRODUCTION
C AUSAL modeling has been recognized as a potential solution to many challenging problems in machine learning (ML) [30], [37].Current approaches operating at the intersection between causality and ML can be roughly split into three different classes.The first focuses on the prediction of the consequences of different actions/interventions, aiming to improve decision making [6], [18], [38], [44].These approaches attempt to answer "what if" counterfactual questions, such as "what if I had treated a patient differently?"The second class focuses on the generation of invariant/stable predictions [4], [13], [16], [22], [23], [24], [31], [34], [40], [41], aiming to improve model generalization under dataset shifts [27], while the third class is largely concerned with the This work involved human subjects or animals in its research.The author confirms that all human/animal subject research procedures and protocols are exempt from review board approval.
Digital Object Identifier 10.1109/TNNLS.2022.3202151estimation of causal effects and only uses ML techniques as a tool to improve causal effect estimation [21].
In this article, our goal is to generate causality-inspired predictions that only leverage associations generated by the causal mechanisms that we are interested in modeling.To this end, we propose a simple counterfactual approach to train "causality-aware" predictive models, where we train and evaluate ML algorithms on (counterfactually) simulated inputs that retain only the associations of interest.For instance, in anticausal prediction tasks influenced by mediators and/or confounders where we are interested in the direct effects of the outcome on the inputs, we simulate counterfactual inputs containing only the associations generated by the direct causal effects (see Section IV-A for a simple illustrative example).This ability to generate learners that only leverage associations generated by the causal relations of interest are important in practice.For instance, in situations where confounding is unstable across the training and target populations (due to selection biases), the approach can be used to generate more stable predictions.Furthermore, in situations where the confounders and/or mediators represent sensitive variables (e.g., race), the approach can also be used to generate predictions that are free from the (direct or indirect) influence of the sensitive variables.
As described in detail in Section IV, our counterfactual inputs are generated using a type of soft intervention [11], [12], [20], [25] on the functional dependence of the inputs on the confounders.They differ in important ways from the counterfactual/potential outcome random variables that are commonly used in the statistics literature and are generated using atomic interventions (which set the values of the random variables to specific values, such as in Pearl's "do" interventions [28]).
We focus on linear models, where analytical results connecting covariances, causal effects, and prediction mean square error (mse) are readily available.Quite importantly, the approach does not require full knowledge of the causal graph describing the data generation process.It only requires partial domain knowledge about which variables represent potential confounders and/or mediators.(Noteworthy, we describe how we can always reparameterize a linear model in a way that the covariance generated by the causal relations among the inputs is pushed toward the input error terms (and similarly for the covariances among the mediators and the covariances among the confounders) so that we can safely generate counterfactual data without even knowing how these variables are causally related.)In practice, this is an important advantage in applications for which full domain knowledge is not available.
We also investigate the stability of the proposed approach with respect to (w.r.t.) dataset shifts [27].A standard assumption in supervised ML is that the training and test sets are independent and identically distributed.In practice, however, this assumption is often violated, and dataset shifts are commonly observed in the real world.At the same time, ML models are often capable of leveraging subtle statistical associations between the input (X) and outcome (Y ) variables in the training data, including spurious associations generated by confounders (C) and other sources of biases in the data.As a consequence, predictions from confounded learners are often unstable across shifted test sets and can fail to generalize.
We focus on dataset shifts generated by selection biases [1], [15], [17] affecting the joint distribution of the confounders and outcome variable, P(C, Y ).In real-world applications, selection biases often lead to the collection of nonrepresentative training sets and represent an important challenge for ML.We focus on the case where the test set can be shifted in unknown ways w.r.t.P(C, Y ), a setting where adjustment approaches are sometimes applied to the training data alone [7], [19] with the hope that training an unconfounded model will be enough to generate stable predictions in shifted test sets (or to generate deconfounded predictions that are no longer influenced by the sensitive variables).Using both analytical results and synthetic/real data experiments, we show that this is insufficient, and deconfounding both the training and test set inputs can produce more stable predictions.
Furthermore, in order to relax the linearity assumption, we extend the causality-aware approach to additive models [14] that are better able to handle nonlinear associations in the data.While we do not provide analytical results in the context of additive models, our empirical evaluations on synthetic data experiments illustrate how the additive modelbased causality-aware adjustment is still able to generate stable predictions under dataset shifts.
In summary, this article's contributions are fourfold: 1) we describe a novel counterfactual approach to train "causalityaware" predictive models, which leverages only the associations generated by the causal mechanisms of interest; 2) by leveraging a reparameterization of the linear structural causal models, we show that the approach does not require full knowledge of the data generation process; 3) we investigate the stability properties of the method w.r.t.dataset shifts generated by selection biases; and 4) we extend the causality-aware approach to more flexible additive models.
The remaining of this article is organized as follows.Section II describes related work.Section III presents notation and some key background definitions that are used throughout the text.Section IV presents the causality-aware approach.(To simplify the exposition, it first describes the method using univariate examples, before moving to the general multivariate case.)Section V presents an algorithmic description of the approach in the context of confounding adjustment.Section VI presents the stability investigations (both analytical and empirical results from synthetic data experiments).Section VII describes the extension to additive models alongside additional synthetic data experiments.Section VIII illustrates the application of the method to real data from a mobile health study on Parkinson's disease (PD).Finally, Section IX summarizes the contributions and presents further final remarks.

II. RELATED WORK
Causal approaches based on counterfactual thinking have been used in the context of ML applications to predict the outcomes of different actions, policies, and interventions using nonexperimental data [6], [18], [38], [44].The goal is to make "what if" predictions of the consequences of different actions in order to guide decisions.These approaches, however, are only applicable to causal prediction tasks [36], where the "treatment" variables correspond to the inputs of the ML model and are the causes of the outcome variable.Our approach, on the other hand, focuses on static anticausal ML tasks [36] where the outcome influences the inputs.
Our work is similar in spirit to invariant prediction approaches [4], [13], [16], [24], [31], [34] or stable prediction approaches [22], [23], [40], [41] in the sense that it can also be used to generate predictions based on the stable properties of the data without absorbing unstable spurious associations.Invariant prediction approaches, however, rely on multiple training sets to learn invariances, while the causality-aware approach (and other stable prediction methods) only requires a single training set.Some stable prediction approaches require, nonetheless, full knowledge of the causal graph [40] or can only be directly used in causal prediction tasks [22], [23], while the causality-aware method only requires partial knowledge of the causal graph and is suited to anticausal tasks.In the particular context of linear models, anchor regression [35] represents an alternative invariant/stable prediction framework, which also does not require full knowledge of the causal graph and can be applied to both causal and anticausal prediction tasks.(We compare the causality-aware approach against anchor regression in Section VI.) Supervised ML has also been extensively used to aid the estimation of causal effects, where it can potentially attenuate model misspecification issues.In particular, supervised ML has been used to improve the calculation of propensity scores, outcome models, and double-robust estimation (see [21] for a review).In this article, however, we take an opposite strategy where, instead of using ML to improve causal effects' estimation, we leverage (partial) causal knowledge to improve the stability/robustness of ML predictions or remove the influence of sensitive variables from the predictions.

III. PRELIMINARIES
Throughout the text, we let respectively, sets of inputs (features), confounders, and mediators, while Y represents the outcome (response) variable.The causality-aware counterfactual versions of X and M are represented, respectively, by X * and M * .We adopt the subscripts (or superscripts) tr and ts to represent the training and test sets, respectively.
Following [28] and [43], we adopt a mechanism-based approach to causation, where the statistical information Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.encoded in the joint probability distribution of a set of variables is supplemented by a directed acyclic graph (DAG), also known as a causal diagram, describing our qualitative assumptions about the causal relation between the variables.The causal diagram is also described by a set of structural causal models [28], where each variable is represented by a deterministic function of the variable's parents and a random noise term (modeling the aspects of the system that are not captured by the model), and where the random noise variables are assumed to be independent of each other.In the particular case of linear systems, the linear structural causal models are given by , where U j represents the error term, and the set of parents of V j is given by the variables Following [36], we denote prediction tasks where the outcome influences the inputs as anticausal prediction tasks, whereas tasks where the inputs influence the outcome are denoted as causal prediction tasks.Fig. 1(a) presents the DAG of a general anticausal predictive task, where X, C, and M are organized into arbitrary DAG subdiagrams, while Fig. 1(b) presents the general causal predictive task.

IV. PROPOSED APPROACH
Here, we describe the causality-aware approach.To facilitate the exposition, we present it first in the univariate case (see Section IV-A) before moving to the general multivariate case (see Section IV-B).In the multivariate case: 1) we describe the linear model reparameterization that allows us to compute the counterfactual inputs without knowledge about their causal relationships (see Section IV-B1); 2) we discuss parameter estimation via least squares and potential identification issues (see Section IV-B2); and 3) we present the multivariate extension of the univariate results (see Section IV-B3).Throughout this section, we describe the application of the causality-aware approach to the training set, where we have access to the outcome data.The application to the test set is described in Section V, where we assume that we do not have access to the outcome data (since this is the quantity we want to predict).

A. Univariate Case
For the sake of clarity, we first describe our approach in the special case where X, C, and M are each composed of a single variable.We describe how to use counterfactual reasoning to simulate features where the association between the response and the features is due exclusively to the causal effects of interest.For simplicity, throughout this section, we assume that the data are generated from a standardized linear model Fig. 2. Twin-network approach in the case where (a) direct or (b) indirect effect is of interest.so that the variances of X, C, M, and Y are equal to 1, and the direct causal effect of a variable Z j on another variable Z k is represented by the path coefficient [45], θ Z k Z j .
The anticausal task presented in Fig. 1(a) is represented (in the univariate case) by the set of structural equations, and U X are independent background (residual) variables.Using Wright's method of path analysis [45], we have that the total covariance (correlation) between X and Y can be decomposed into the contribution of the direct causal path, Y → X, the indirect causal path Y → M → X, and the spurious association generated by the backdoor path X ← C → Y .It is clear that the predictive performance of any ML model trained with data generated by this model will be biased by the influence of the confounder C since the learner will leverage the total association between X and Y during training (assuming that only X is used as an input of the ML model).Now, suppose that our goal is to build an ML model whose predictive performance is only informed by the direct influence of Y on X and is free from the influence of C, as well as, from the indirect influence of Y that is mediated by M. To this end, we need to simulate counterfactual data where the association between X and Y is due exclusively to the direct causal effect of Y on X.In other words, we want to simulate counterfactual feature data, X * , such that Cov(X * , Y ) = θ XY .In theory, this could be done by simulating data according to the twin network1 [5], [28] in Fig. 2(a), where the new counterfactual feature data, X * , are generated from the model X * = θ XY Y + U X .(In practice, we can estimate θ XY and U X by regressing X on C, M, and Y , and simulate the counterfactual feature data using X * = X − θXC C − θX M M = θXY Y + ÛX .In other words, we can employ a variation of Pearl's "abduction, action, prediction" approach to simulate deterministic counterfactuals [28], [29].)Direct calculation of the covariance between X * and Y shows that Similarly, if the goal is to generate counterfactual data where the association between X and Y is due exclusively to the indirect causal effect of Y on X, then we can generate data according to the twin network in Fig. 2(b) so that Note that our interventions differ from Pearl's atomic "do" interventions and can be seen as a type of soft or stochastic intervention, where the data generation process differs from the natural system only in the mechanism associated with the feature X. Related types of soft/stochastic interventions have been studied in [11], [12], [20], and [25].Node-splitting transformations in single-world intervention graphs [33] can also be used as an alternative intervention.

B. Multivariate Case
Here, we extend the results from Section IV-A to the multivariate case, where the nodes X, C, and M in Fig. 1(a) represent arbitrary DAG subdiagrams.However, first, we describe how we can always reparameterize linear structural causal models in a way that, in practice, we do not need to know how the DAG subdiagrams are organized in order to estimate the causal effects and the residuals employed in the computation of the counterfactual data.
1) Reparameterization in Linear Models: For linear structural causal models, we can always reparameterize any arbitrary DAG model to a simpler model where the covariance structure between the observed variables is "pushed" to the unobserved error terms.Fig. 3 provides an illustrative example of this well-known fact in the structural equations modeling literature [2], [42].
The DAG in Fig. 3(a) represents the actual data generation process for the variables X = (X 1 , X 2 , X 3 ) T , where the error terms U X = (U X 1 , U X 2 , U X 3 ) T are independent, whereas the DAG in Fig. 3(b) shows the reparameterized model with correlated error terms W The set of linear structural causal models describing the DAG in Fig. 3(a) is given by X = X X X +U X , which can be reparameterized 2 as X = W X , where W Next, we describe the above reparameterization for the arbitrary anticausal predictive task.From the DAG in Fig. 1(a), we have the joint distribution of the anticausal prediction task factors as where each component of this factorization is described, respectively, by the structural causal models where U C , U Y , U M , and U X are vectors of independent error terms with zero mean and finite variance; CC , M M , and X X represent, respectively, square matrices containing the path coefficients connecting the confounders among themselves, the mediators among themselves, and the features among themselves; and Y C , MC , MY , XC , X M , and XY , represent rectangular matrices of path coefficients connecting variables from separate sets.(For instance, XC corresponds to a n X ×n C matrix of path coefficients connecting confounder variables to input variables.)Using simple algebraic manipulations, we can rewrite the above linear structural models as Note that because model X = W X is just a reparameterization of model X = X X X + U X , we have that the association structure between the X j variables is still the same after the model reparameterization.We point out, as well, that, for any arbitrary DAG, the matrix (I − X X ) is always invertible.
To see why, note that we can always rearrange the order of the variables so that X X is a lower triangular matrix.Now, because the determinant of a triangular matrix is given by the product of its diagonal elements, and a triangular matrix is invertible if and only if none of its diagonal elements is zero, we see that (I − X X ) is always invertible since its diagonal elements are always equal to 1.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) Estimation of Causal Effects and Residuals in the Reparameterized Model: Estimation is performed as follows.First, we regress each feature X j , j = 1, . . ., n X , on the set of observed confounders and mediators using the regression equations, to estimate the causal effects γX j C k , γX j M k , and γX j Y and residuals ŴX j using least squares. 3We then generate counterfactual features by adding back the estimated residuals to a linear predictor containing only the causal effects of interest.For instance, in order to estimate the predictive performance that is separately due to direct or indirect causal effects, we generate counterfactual features using, respectively, the 4  Importantly, note that, for linear models, the Markov and faithfulness assumptions tie vanishing correlations and partial correlations to the graphical and causal structure so that the correlational structure can be used as a guide to the causal structure [43].Since the coefficients in a multiple regression model are proportional to partial correlations, it follows that, when we regress a given variable on its nondescendants, only the regression coefficients associated with the variable's parents will be nonzero (under the assumption that all parent variables have been measured).Hence, when we regress X j on C, M, and Y , only the coefficients associated with the parents of X j in the reparameterized model will be statistically different from zero (for large enough sample sizes).Therefore, in practice, we do not need to know beforehand which variables are the parents of X j in the reparameterized model.The parent set will be learned automatically from the data by the regression model fit (assuming, of course, the absence of unmeasured confounders 5 ). 3

) Multivariate Extension of the Results in Section IV-A:
The multivariate versions of the results presented in equations ( 1) and ( 2) are presented in Result 1 below.
Result 1: a) For causal effects generated by the paths in Y → X, if X * is given by 3 Here, we assume that the number of samples is larger than the number of covariates in the regression fits, and multicollinearity is not an issue too.Note that we do not need to assume Gaussian error terms. 4Here, to estimate the causal effects γM j C k and γM j Y and error terms ŴM j . 5Under the assumption that all the confounders and mediators are observed, we can identify the direct and indirect causal effects of the outcome on the inputs.If confounders are observed but the mediators are unobserved, we can still identify total causal effects.(For instance, we have that X , where XC = XC + X M MC and XY = XY + X M MY represent total causal effects.)On the other hand, if the mediators are observed, but some the confounders are unobserved, then neither the direct, the indirect, or the total causal effects are identifiable, and the predictions generated by the causality-aware approach will still be confounded.Finally, while, so far, we have discussed confounding of the feature/response relationship, it is also possible that the causal relations between features and mediators or between mediators and response are also influenced by confounders.If these confounders are unobserved, then we cannot identify the causal effects X M and MY .It is clear that, in the presence of unobserved confounding, the causality-aware predictions will be biased, whenever the causal effects of interest are not identifiable.b) For causal effects generated by the paths in Y → M → X, if X * is given by X * = X M M * + W X , and The above result, together with the estimation approach described in Section IV-B2, shows that, by generating causality-aware counterfactual inputs, X * , and then training and evaluating ML learners on this counterfactual data, we are able to leverage only the associations generated by the causal mechanisms of interest.Because the counterfactual data are estimated from the reparameterized model, the approach does not require full knowledge of the causal graph.It suffices to know which variables are confounders and which are mediators.

V. CONFOUNDING ADJUSTMENT-AN ALGORITHMIC DESCRIPTION
As described in Section I, the causality-aware approach can be used to generate stable predictions and remove the influence of sensitive variables from ML predictions.While both confounders and mediators can represent sensitive variables, in this section, we focus only on confounding adjustments.Here, we provide an algorithmic description of the method, while Section VI describes an investigation of the approach's stability under dataset shifts generated by selection biases.
When the goal is confounding adjustment, the causalityaware features are generated according to Algorithm 1. Remarks: Next we make a few remarks regarding Algorithm 1.
1) The algorithm requires test set confounding data but not the test set outcomes.Note that, for large sample sizes, and under the assumption that the causal effects are stable between the training and test sets, we have that βtr X j C i ≈ βts X j C i so that we can estimate the test set counterfactual features without using test set outcomes since X * j,ts = X j,ts − βtr Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) The algorithm does not require standardized data.(In Section IV, we present the analytical results for standardized data to simplify the derivations.)For unstandardized data, β U V represents an unstandardized path coefficient (i.e., a causal effect).3) As long as the inputs are numerical variables, the algorithm can still be used for classification tasks (with numerical and/or categorical confounders).4) In applications where we want to remove the influence of confounders and mediators, we compute the training and test causality-aware inputs as X * , for s = {tr, ts}.5) In applications where the mediators are unobserved or are not sensitive variables, we can use Algorithm 1 directly with the understanding that βX j Y will represent the total causal effect of Y on X j (i.e., the combination of the direct effect and the indirect effects transmitted through the mediators).

VI. STABILITY UNDER DATASET SHIFTS OF P(C, Y ) GENERATED BY SELECTION BIASES
In anticausal prediction tasks, dataset shifts in the joint distribution of the confounders and outcome variable, P(C, Y ), are often caused by selection biases.The confounded anticausal prediction task influenced by selection bias is described by the causal graph in Fig. 4, where the auxiliary variable S indicates the presence of a selection mechanism contributing to the association between C and Y .(Here, S represents a binary variable assuming value 1 if a selection mechanism is present, and the square frame around S indicates that our dataset is generated conditional on S being set to 1.Note that the application of the d-separation criterion [28] to the causal graph shows that, because S is a collider, we have that conditional on S = 1, and the additional path C → S ← Y is open and, therefore, contributes to the association between C and Y .) While it might seem intuitive that training a learner on an unconfounded training set will prevent it from learning the confounding signal and, therefore, lead to more stable predictions in shifted target populations, here, we show that adjusting the training data alone is insufficient, and better stability can be achieved by deconfounding the test set features as well.(Examples of approaches that only adjust the training data include preprocessing techniques to reduce discrimination in ML [7], [19].) To illustrate this point, we present an analysis of this issue using a toy linear model example.(Here, we focus on a toy model for the sake of simplicity.The general results associated with arbitrary linear models are presented in the Appendix.) , where V , for V = {C, Y, X}.Assume, without loss of generality, that the data have been centered.Let Ŷ = X ts βtr represent the test set prediction from a linear regression model, where βtr represents the coefficient estimated with the training data, and X ts represents the test set input.By definition, the expected mse is given by tr Var[X ts ] − 2 βtr Cov(X ts , Y ts ) where the expectation is w.r.t. the test set so that βtr is a fixed constant w.r.t. the expectation (see (11) for the general result.) For any confounding adjustment approach that does not adjust the test set inputs, we have that showing that Var(X ts ) and Cov(X ts , Y ts ) depend on Var(C ts ), Cov(Y ts , C ts ), and Var(Y ts ) (so that E[mse] will be unstable under dataset shifts of these quantities) (see ( 12)-( 14) for the general results).
On the other hand, we have that, for the causality-aware approach do not depend on Cov(Y ts , C ts ) or Var(C ts ) so that the E[mse] will be stable w.r.t.dataset shift on these two quantities (although, as shown by ( 4), it will be still influenced by dataset shifts on Var(Y ts )) (see ( 15)-( 17) for the general results).Note that this is true even when we apply a confounding adjustment to the training set (a situation where the βtr estimate is not influenced by the spurious associations generated by the confounder).This explains why it is not enough to deconfound the training inputs alone.While training a regression model using deconfounded features allows us to estimate deconfounded model weights, βtr , the prediction Ŷ = X ts βtr is a function of both βtr and the test set input, X ts .As a consequence, if we do not deconfound the test set inputs, the expected mse will still be influenced by the confounders (since, in anticausal prediction tasks, the original test inputs, X j,ts = β X j Y Y ts + β XC C ts + U X j , are still functions of the confounder variable).

A. Synthetic Data Illustrations
We illustrate the above points in synthetic data experiments where we compared the causality-aware approach against three alternative adjustment methods, namely, baseline, anchor regression [35], and no adjustment approaches.The baseline adjustment removes the causal effect of the confounders from the inputs in the training set alone (our goal is to illustrate how adjusting the training data alone is underoptimal in comparison to adjusting both the training and test sets).Anchor regression [35] represents a new estimation principle, which leverages exogeneous variables (the anchors) to improve prediction under dataset shifts while allowing for hidden confounding.The method provides an interpolation between the solutions of ordinary least squares (OLSs) and two-stage least squares, and similar to our proposed method, it is easy to compute and use in practice and does not require full knowledge of the causal graph-it only requires the identification of anchor variables (and can be used in both causal and anticausal prediction tasks).For completeness, we also report results based on the "no adjustment" approach, where no adjustments are applied to the training or test sets.(Note that the "no adjustment" approach corresponds to applying OLSs to the original data.) 6 The confounded data are generated from the model in Fig. 5, where we change the variances and covariance of the error terms C and Y in order to simulate the effects of selection biases in the joint distribution P(C, Y ). 7 The model is described by the following set of linear structural causal equations: 6 We point out that several approaches proposed in the stable prediction literature are not applicable in our illustrations.For instance, in the context of classification tasks, approaches such as invariant risk minimization [4] or invariant causal prediction [31] rely on training data from multiple training sets, while our approach focuses on a single training set.Furthermore, stable prediction approaches [22], [23], which only require a single training set, can only be applied in causal prediction tasks, while our illustrations focus on anticausal tasks.Note, as well, that while counterfactual normalization [40] can also be applied in anticausal prediction tasks, we have that its application to the particular model (see Fig. 5) used in our simulations would produce identical results as the causality-aware adjustment (since the counterfactual variable used by the counterfactual normalization method as the stable set for predicting Y corresponds to causality-aware input X * j = X j − βXC C in our example). 7Please note that we simulate data using the model in Fig. 5, rather than the model in Fig. 4, for computational convenience.Note that, in order to directly generate dataset shifts between the training and test sets using data from the model in Fig. 4, we would need the generate a large number of samples and then select smaller subsets with different degrees of association between the C and Y variables.It is just more convenient to simulate dataset shifts in P(C, Y ) by controlling the σ 2 C , σ Y C , and σ 2 Y parameters in the model presented in Fig. 5. where where the i jth entry of the covariance matrix U X is given by 1 for i = j and ρ |i− j | for i = j .
In our experiments, we simulate training and test set data as follows.
1) Sample the simulation parameters β X j Y , β X j C , and ρ from uniform distributions (whose ranges for each experiment are described in Table I Because the spurious associations between the response and the input variables generated by a confounder can either increase or decrease the strength of the total association between Y and X j (as described in the caption of Fig. 6), we have that confounding adjustment can either degrade or improve predictive performance in situations where the training and test sets come from the same distribution (while dataset shifts often lead to predictive performance degradation).Furthermore, because E[mse] also depends on Var(Y ts ) (as described in Section V), we performed the following four separate simulation experiments covering all possible combinations of these two factors.For each simulation replication, we generate a training set and nine test sets as described above, and then, the following holds.
1) For the causality-aware approach, we adjust the confounded training set and each of the nine confounded test sets, fit a regression model on the adjusted training set data, use the same trained model to predict on the nine confounded test sets, and evaluate the test set performances using mse.2) For the baseline approach, we adjust the confounded training set, fit a regression model on the adjusted training set data, use the trained model to predict on the nine confounded test sets, and evaluate the test set performances using mse.3) For the anchor regression approach, we adjust the training set using the finite sample estimator described in [35] for a grid of γ values ranging from 0 to 100 using the confounder as the anchor variable 8 , use the trained model to predict on the nine confounded test sets, and evaluate the test set performances using mse.4) For the "no adjustment" approach, we fit a regression model to the confounded training data, use the trained model to predict on the nine confounded test sets, and evaluate the test set performances using mse.Figs.7-10 report the results.In all figures, (a) reports the mse scores across 1000 simulation replications for the nine test sets (dots represent the mean across the 1000 replications, while the vertical bars represent one standard deviation around the mean) and (b) presents a comparison of the stability errors, defined as the standard deviation of the mse scores across the nine test sets in each simulation replication, so that lower 8 Explicitly, we adopted a grid of γ values of length 100, with the first 50 values equally spaced between 0 and 1 and the remaining 50 values ranging from 2 to 100 by increments of 2. For each fixed γ value, the finite sample anchor regression estimator was computed by first generating synthetic versions of the input and outcome variables according to Xtr = X tr − (1 − √ γ ) A X tr and Ỹ tr = Y tr − (1 − √ γ ) A Y tr , where A = C tr (C T tr C tr ) −1 C tr , and then applying OLSs to the Xtr and Ỹ tr data.(Note that anchor regression is another example of a method that only adjusts the training set.)While anchor regression requires the user to select the value of the tuning parameter γ , either by subject knowledge or by cross-validation, in our experiments, we simply report the best case scenario using the γ value that generated the smallest mse for each separate test set.(Hence, the results that we report for anchor regression are actually overoptimistic.) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.standard deviations indicate more stable predictions.(The boxplots present the stability errors across the 1000 replications.)Fig. 7 reports the results for the first experiment.Note that, in this experiment, we have that the spurious association generated by the confounder increases the strength of the association between the response and the features so that the causality-aware adjustment leads to a degradation in predictive performance in situations where there is no dataset shift between training and test sets [note how the blue mean mse for test set 1 is higher than the orange one in (a)].However, comparisons for the other test sets show how it is still desirable to perform confounding adjustments in order to avoid much larger degradation in performance generated by strong dataset shifts.
Fig. 8, on the other hand, reports the results for the second experiment where the spurious association generated by the confounder decreases the strength of the association between the response and the features.In this case, confounding adjustment leads to improved predictive performance in the absence of dataset shifts [now, note how the blue mean mse for test set 1 is lower than the orange one in (a)].Observe, as well, that, because, in experiments 1 and 2, we kept Var(Y ts ) constant across the test sets, we see perfect stability for the causality-aware approach.(Note that varying Cov(Y ts , C ts ), and Var(C ts ) has no influence on the stability of the results since the expected mse for the causality-aware approach only depends on Var(Y ts ), as illustrated by ( 15)-( 17).)Figs. 9 and 10 report the results from the third and fourth experiments, where we increased Var(Y ts ) across the test sets.As expected, we now observe instability in the causality-aware approach too.The causality-aware predictions, however, tend to be more stable than the predictions from the other approaches.
A comparison of the stability of the causality-aware and baseline adjustment tended to favor the causality-aware adjustment.(Interestingly, the baseline approach showed even greater instability than the no adjustment approach in experiments 1 and 2.) In all experiments, the causality-aware and the anchor regression methods outperformed the no adjustment (OLS) approach in terms of stability.(Note that the better performance of anchor regression relative to OLS is not surprising, given that OLS represents a special case of anchor regression when γ is set to 1, and in our experiments, we selected the γ value that minimized mse in a grid of 100 values, including γ = 1.) A comparison of the causality-aware and anchor regression methods shows that the causality-aware approach was more stable in all experiments and also tended to outperform anchor regression in terms of predictive performance (note the smaller mse scores throughout the experiments with the exception of test sets 1 and 2 in experiments 1 and 3, where the confounder increases the associations between inputs and outcome, and confounding adjustment degrades the predictive performance).At this point, it is important to clarify that anchor regression was developed to handle a different type of dataset shift than the shifts investigated in these experiments.Namely, anchor regression assumes shift interventions, i.e., interventions that shift numerical variables by a certain amount, which then propagates through the system.The present experiments, on the other hand, simulate a very different type of dataset shift generated by selection biases.Hence, it is not surprising that anchor regression is outperformed by the causality-aware approach in a setting that it was not developed to handle, but it is easily addressed by the causality-aware adjustment.In Section VI-A1, however, we investigate the performance of the causality-aware approach under conditions that violate the assumptions made by the methodology.We will see that anchor regression tended to outperform the causality-aware approach in these additional "failure mode" experiments.
1) Failure Mode Experiments: In Section VI-A, we presented experiments illustrating dataset shifts in the joint distribution P(C, Y ), where our key assumption that P(X | C, Y ) is stable across training and test sets still held.In this section, we present "failure mode" experiments that investigate how violations of this assumption impact the performance of the causality-aware approach.
To this end, we performed two additional simulation experiments where we generated data according to the simple linear structural causal models, and U X ), as represented by the causal graph in Fig. 11, where the causal effect of C on X in the test set, β ts XC , is different from the respective effect in the training set, β tr XC .(We did, however, still adopt stable causal effects of Y on X and of C on Y between the training and test sets, i.e., β tr XY = β ts XY and , and β ts XC set to 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, and 3 across the nine test sets.2) Failure Mode Experiment 2: , and β ts XC set to 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, and 3 across the nine test sets.As before, the failure mode experiments 1 and 2 represent, respectively, the cases where the spurious associations generated by C either increase or decrease the strength of the total association between X and Y in the training set.
Figs. 12 and 13 report the results comparing the four adjustment approaches evaluated in Section VI-A (which were implemented as described previously).In both experiments, anchor regression was more stable than the causality-aware approach and outperformed all methods.Observe, as well, that, in the failure mode experiment 2, even the no adjustment approach (i.e., OLS) outperformed the causality-aware approach, illustrating that the stability of P(X | C, Y ) is an important assumption of the proposed method.

VII. EXTENSION TO ADDITIVE MODELS
In order to add greater flexibility to our modeling approach, we can replace linear models with more flexible additivemodels [14], which are able to capture nonlinear relationships between the variables.In this section, we first describe how to extend the causality-aware approach to additive models and then use synthetic data experiments to provide Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
empirical evidence for the stability of causality-aware additive models.

A. Causality-Aware Additive Models
For the additive-model causality-aware approach, we fit the following additive model to the data: and compute the training set causality-aware features as where Û tr , and f tr represents the scatter plot smoothers estimated in the training data.The causality-aware test set features are computed as where f tr X j C i (C ts i ) represents the evaluation of the test set confounder data, C ts i , using the respective scatterplot smoother estimated in the training set.

B. Synthetic Data Experiments
Here, we evaluate and compare the stability of the causality-aware additive model adjustment against the linear model-based adjustment in two synthetic data experiments.Again, we evaluate the cases where the spurious associations generated by C either increase or decrease the strength of the total association between X j and Y in the training set.
The data for these experiments were generated, as described before in Section VI-A, with the exception that: 1) the linear model in (7) is replaced by the nonlinear additive model (10) for j = 1, . . ., p; 2) we set Var(C) = Var(Y ) = 1 in both experiments; and 3) we sample the β XC and β XY parameters according to β XC ∼ U (1, 3) and β XY ∼ U (1, 3) in the first experiment according to β XC ∼ U (−3, −1) and β XY ∼ U (1, 3) in the second.The additive models were fit with the gam package [14] in R [32].Fig. 14 reports the results from the first experiment.It shows that the additive model-based adjustment is more stable than the linear model-based adjustment [compare the blue boxplots in (a) against the green ones in (b), as well as in (c)].Observe that, as before, in this experiment, the spurious association generated by the confounder increases the strength of the association between the response and the features so that the causality-aware adjustments lead to a degradation in predictive performance in situations where there is no dataset shift between training and test set [note how the mse values for test set 1 tend to be higher for the causality-aware adjustment compared to the no adjustment approach in (a) and (b)].However, once again, the dataset shifts in the other test sets  lead to strong predictive performance degradation when no adjustments are performed.Fig. 15 reports the results from the second experiment.It again shows that the additive model-based causality-aware adjustment is more stable than the linear one.Observe as well that, in this experiment, the spurious associations generated by the confounder (in the test set 1 and training sets) decrease the strength of the association between the response and the features so that the causality-aware adjustments lead to an improvement in predictive performance in the absence of dataset shifts between training and test set [note how the mse values for test set 1 tend to be lower for the causality-aware adjustment compared to the no adjustment approach in (a) and (b)].
Finally, note that, in both figures, the additive model-based causality-aware adjustment generates lower mse values than the linear model-based adjustment, suggesting that the additive models provide a better fit to the additive model features generated from (10) compared to the linear models.

VIII. REAL DATA ILLUSTRATIONS
We also illustrate our stability results using real data from the PD Digital Biomarker (PDDB) Dream Challenge [39].In this ML challenge, participants were asked to submit feature sets extracted from raw accelerometer data collected by the walking activity in the mPower study [3], [26], where both PD participants and non-PD participants were asked to walk in a straight line for 30 s holding their smartphones in their pockets.The submitted feature sets were scored by the challenge organizers on their predictive ability to classify PD versus non-PD participants.Note that this application Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
represents an anticausal prediction task since having PD affects the gait of a person.In other words, the accelerations recorded by the smartphone accelerometers represent "symptoms" of the disease.
Challenge participants were free to generate feature sets using either signal processing techniques (by manually engineering features) or deep learning models (to automatically learn the features from the raw data).Here, we illustrate the methodology using a feature set generated by a deep neural network (DNN) model since the causality-aware approach can often be employed to deconfound feature representations learned by a DNN classifier [10].(To see why, observe that, by training a highly accurate DNN using softmax activation at the classification layer, we have that, by construction, the feature representation learned by the last layer prior to the output layer will fit well a logistic regression model as the softmax activation used to classify the outputs of the DNN is essentially performing logistic regression classification.As a consequence, the association between the feature representation learned by an accurate DNN and the labels can often be modeled by simple linear models.)In our experiments, we adopted the features learned by the top-performing deep learning team in the PDDB Dream Challenge.(This team got second place in the challenge.An ensemble method secured the first place).Observe as well that, as pointed out in Remark 3 in Section V, we can still apply the causality-aware adjustment whenever the input variables are numeric.
As pointed out by Chaibub Neto et al. [8], age is an important confounder in this dataset, and our goal here is to investigate the stability of the proposed approach under dataset shifts in the joint distribution of the age confounder and the PD labels.In order to generate test sets that are shifted w.r.t. the training data (denoted "shifted test set") and test sets that are not shifted (denoted "no shift test set"), we discretize the age confounder into three levels ("young age," "middle-age," and "senior-age") and then adopted the following data partitioning strategy to simulate the effects of selection bias on P(C, Y ).First, we split the original data (n = 4734) into a "shifted test set" (n = 900) and a "remaining dataset (n = 3834) by taking a subsample of the original data where the association between the (discretized) age confounder and the disease labels is flipped relative to the original data.Second, we split the remaining data into a training set (n = 2933) and an identically distributed test set, the "no shift test set" (n = 901) (see Fig. 16).
For our analyses, we performed 100 distinct data splits using this procedure.For each data split, we fit logistic regression classifiers and evaluated predictive performance using the area under the receiver characteristic curve (AUROC) metric.
In these experiments, we compared seven distinct adjustment approaches: 1) the linear model-based causality-aware approach; 2) the "linear model baseline" approach (where only the training set features are adjusted); 3) the additive model-based causality-aware approach; 4) the "additive model baseline" approach (where only the training set features are adjusted); 5) the matching approach (applied to the training set alone), where we remove the association between the (discretized) age confounder and the PD status labels by randomly selecting a perfectly balanced subsample of the training set; 6) the approximate inverse probability weighting approach (applied to the training set alone), where we remove the association between (discretized) age and PD labels by oversampling from the training data according to weights derived from standard propensity scores (estimated with logistic regression) in order to generate a perfectly balanced augmented training set; 7) the "no adjustment" approach, where no adjustment is applied to the training or test set.
Note that, for the additive model causality-aware and additive model baseline adjustments, we used the (continuous) age variable as the confounder, whereas, for all other adjustments, we used the discretized age as the confounder.
The results are presented in Fig. 17 and illustrate the better stability (i.e., the considerably smaller gap between the performances in the "shifted" versus "no shift" test sets) for both causality-aware adjustments relative to the other approaches (with the additive model causality-aware adjustment being the most stable).Note that, in this example, confounding adjustment reduces predictive performance.
Observe, as well, that, for the matching (and approximate inverse probability weighting approaches), we only adjust the training data since, in practice, we do not have access to the label data in the target populations where a classifier will be deployed (since the labels are the quantities we want to predict), but these methods require label data for performing the adjustment.Furthermore, because we are interested in the case where P(C, Y ) changes in unknown ways in the test sets, we cannot use simple balancing adjustments that balance the training data in a way that P(C tr , Y tr ) matches the target/test set distribution, P(C ts , Y ts ).Once again, the above results illustrate how adjusting the training data alone can be insufficient in anticausal prediction tasks, and adjusting both the training and test sets can generate more stable predictions.
Finally, despite the positive results of the proposed approach on this real data illustration, it is important to point out that the dataset shifts used in these illustrations were generated in an artificial manner and represent a strong shift from the training set to the test set.While our results suggest that the method might be effective in the real world under natural dataset shifts, further investigations are still warranted.

IX. DISCUSSION
This article makes several contributions.First, we describe a novel counterfactual approach to train "causality-aware" predictive models, which leverages only the associations generated by the causal mechanisms of interest.Second, by leveraging a reparameterization of the linear structural causal models (as described in Section IV-B1), we show that the approach does not require full knowledge of the data generation process.It suffices to know which variables are confounders and mediators, without knowing how these variables are causally related.This represents an important practical advantage of the method relative to alternative approaches that require knowledge of the full causal graph.Third, we investigate the stability properties of the method w.r.t.dataset shifts generated by selection biases.We show that E[mse] for adjustment approaches that fail to deconfound the test set features will be unstable w.r.t.shifts in Cov(C, Y ), even when the ML models are trained with unconfounded data [and there are no shifts in Var(Y ts )].This is an important observation that (we feel) is not well appreciated in the ML community.Fourth, we extend the causality-aware approach to more flexible additive models [14], which are able to capture nonlinear relationships between the variables.(It is important to point out, however, that, while this extension relaxes the linearity assumption, it still relies on the additivity assumption.Extension to more flexible nonlinear models that can handle nonadditive relationships between confounders, inputs, and outcome variables is still warranted.) As illustrated by the synthetic data experiments presented in Section VI, a key assumption made by the causality-aware approach is that the conditional distribution P(X | C, Y ) is stable between training and test sets.We point out, nonetheless, that this assumption is reasonable in several application domains.For instance, in health-related applications (such as in the real data illustrations in Section VIII) where the goal is to predict disease severity using the disease symptoms as inputs, and where demographic variables, such as race, gender, and age are potential confounders, the instability of P(X | C, Y ) implies that there are biological/physiological differences between the individuals in training and test set subpopulations (e.g., the effects of race, gender, and age on the symptoms of the disease would be different for individuals in different training and test sets).While such differences are possible in theory, in practice, dataset shifts in the P(C, Y ) seem to be more commonly observed in health applications since they can be easily generated by selection mechanisms during data collection (which is arguably a ubiquitous problem in applied ML work).
In this article, we have focused on anticausal prediction tasks.For causal prediction tasks, we can improve the stability of ML predictions by simply performing "residualization" of the model inputs, where we regress the input data on the observed confounders and use the residuals of the regression fits as the new inputs for the ML algorithms.Observe, however, that, for anticausal prediction tasks, the residualization approach performs the wrong confounding adjustment since, by failing to include the outcome variable as a covariate in the regression model, the residualization approach removes not only the direct causal influence of the confounders on the inputs but also the indirect influences that are mediated by the outcome variable [9].
Finally, one drawback of the causality-aware approach is its reliance on observed confounders.We leave extensions for dealing with unobserved confounders for future work.APPENDIX Consider the arbitrary anticausal prediction task model in Fig. 18, where the double arrows connecting the variables {U X 1 , . . ., U X p } (and {U C 1 , . . ., U C m }) represent the fact that these error terms are correlated. 9Without loss of generality, assume that the data have been centered so that the linear structural causal models describing the data generation process are given by C j = U C j , Y = i β Y C i C i + U Y , and X j = β X j Y Y + i β X j C i C i + U X j , for j = 1, . . ., p and i = 1, . . ., m.The causality-aware features are estimated as X * j = X j − i βX j C i C i and converge asymptotically to ) so that the approach will be stable against shifts in these quantities.

Fig. 1 .
Fig. 1.(a) DAG of a general anticausal prediction task (i.e., where the outcome affects the input variables).(b) DAG of a general causal prediction task (i.e., where the input variables influence the outcome variable).

Fig. 5 .
Fig. 5. Model used to simulate data for the synthetic data experiments.
will differ between training and test sets.Similarly, the covariance between C and X will also differ between the training and test sets since Cov(X, C) = β XC + β XY β Y C .(Observe, nonetheless, that, because we set β tr Y C = β ts Y C , it follows that P(C tr , Y tr ) = P(C ts , Y ts ) so that these additional experiments investigate a different type of dataset shift than the previous experiments.)Once again, we considered nine test sets, and each experiment was composed of 1000 replications, with both the training and test sets containing n = 1000 examples.The simulation parameter values for each of the additional experiments were selected according to the following.1) Failure Mode Experiment 1:

Fig. 14 .
Fig. 14.Additive model experiment Case where the confounder improves predictive performance in the absence of dataset shifts.(a) Linear model.(b) Additive model.(c) Stability error.

Fig. 15 .
Fig. 15.Additive model experiment 2. Case where the confounder degrades predictive performance in the absence of dataset shifts.(a) Linear model.(b) Additive model.(c) Stability error.

Fig. 16 .Fig. 17 .
Fig. 16.Associations between discretized age and PD status.Note how the association pattern is flipped in (c) relative to (a) and (b).(That is, most young participants tend to be non-PD cases in the training "no shift test" sets but tend to be PD cases in the test" set.)(a) Training set.(b) No shift test set.(c) Shifted test set.

TABLE I SIMULATION
EXPERIMENT PARAMETERS for j = 1, . . ., p, and where the error terms U C and U Y are distributed according to ).2) Given the fixed values for Var(C tr ), Cov(Y tr , C Fig.6.Assume that the data have been standardized so that the variances of the variables equal 1, and the covariances equal the correlations.In (a), we have that the total correlation between X and Y can be decomposed as Cov(X, Y ) = θ XY +θ XC σ Y C , where the contribution of the direct causal effect of Y on X is given by the θ XY component, while the spurious association contributed by C is quantified by the θ XC σ Y C component.In (b), on the other hand, we have that the total correlation between the adjusted feature, X * , and Y is given by Cov(X * , Y ) = θ XY .Now, observe that the spurious association contributed by C in the unadjusted data can either increase or decrease the strength of the correlation between X and Y , |Cov(X, Y )|, depending on the signs of θ XY , θ XC , andσ Y C .For instance, if θ XY > 0, σ Y C > 0, and θ XC > 0, then |Cov(X, Y )| = θ XY + θ XC σ Y C > θ XY = |Cov(X * , Y )|so that the decrease in the strength of Cov(X * , Y ) leads to a degradation of the predictive performance using the adjusted inputs X * .On the other hand, if θ XY > 0 and σ Y C > 0, but θ XC < 0 andθ XY > |θ XC σ Y C |, then we have that |Cov(X, Y )| = θ XY + θ XC σ Y C < θ XY = |Cov(X * , Y )|so that the increase in the strength of Cov(X * , Y ) leads to improved predictive performance when using the adjusted inputs X * .
(9)), and Var(Y tr ), and the sampled ρ value, sample the error terms U tr C and U tr Y according to(8)and the error terms U tr X according to(9).3)Simulate the (confounded) training set according to the models described in (5)-(7).4) Simulate nine distinct confounded test sets (indexed by ts k , for k = 1, . . ., 9).Each test set is simulated as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the training set except that the values for Var(C ts k ), Cov(Y ts k , C ts k ), and Var(Y ts k ) will be different from the respective training set values.Note that we allow Var(C), Cov(Y, C), and Var(Y ) to differ between training and test sets in order to generate dataset shifts in P(C, Y ).Observe, however, that we use the same sampled values of β X j Y , β X j C , and ρ in the generation of the training and test sets to maintain the stability of P(X | C, Y ).
Each of the four simulation experiments was composed of 1000 replications, where the synthetic data were generated according to simulation parameter values described in TableI.Both the training and test sets were composed of n = 1000 examples, and we adopted p = 5 (i.e., we simulated five inputs).Cov(Y, C), β XC , and β XY are positive in both training and test sets 1.Similarly, experiments 2 and 4 illustrate the case where confounding adjustment improves the predictive performance since σ Y C = Cov(Y, C) and β XC have opposing signs.
ts ) varies across the test sets.
let Ŷ = X ts βtr represent the prediction of a linear regression model, where X ts represents the test set features and βtr represents the regression coefficients estimated from the training set.By definition, the expected mse of the prediction is given by E[mse] = E[(Y ts − Ŷ ) 2 ] = E[Y 2 ts ] + E[ Ŷ 2 ] − 2E[ Ŷ Y ts ] = Var(Y ts ) + E[ Ŷ 2 ] − 2Cov( Ŷ , Y ts ) since E[Y ts ] = 0. Direct com-Cov(X j,ts , X k,ts ) and Fig.18.Confounded anticausal prediction task example.Cov(X j,ts , Y ts ) so thatE[mse] = Var(Y ts ) + Cov(X j,ts , Y ts ).(11)Next, we derive the expressions for Var(X j,ts ), Cov(X j,ts , X k,ts ), and Cov(X j,ts , Y ts ) and show that they still depend on Cov(Y ts , C i,ts ).Direct computation shows thatVar(X j,ts ) = σ 2 X j + β 2 X j Y Var(Y ts ) + X j C i β X k C i Cov(C i,ts , C i ,ts ) Cov(X j,ts , Y ts ) = β X j Y Var(Y ts ) + X j C i Cov(Y ts , C i,ts )(14)showing that these three quantities still depend on Cov(Y ts , C i,ts ) [in addition to depending on Var(Y ts ), Var(C ts ), and Cov(C i,ts , C i ,ts )].This observation implies that the E[mse] will still be unstable w.r.t.shifts in these quantities, even when the regression model is trained in unconfounded data (a situation where the estimates βtr j are not influenced by spurious associations generated by the confounders).The expected mse of models trained with test set features processed according to the causality-aware approach, on the other hand, does not depend on Cov(Y ts , C i,ts ), Var(C ts ), or Cov(C i,ts , C i ,ts ) since the approach also deconfounds the test set features.Note that direct computation of Var(X * j,ts ), Cov(X * j,ts , X * k,ts ), and Cov(X * j,ts , Y ts ) based on the causalityaware features, X * j,ts = β X j Y Y ts + U ts X j , shows that X j Y β X k Y Var(Y ts ) + Cov U ts X j , U ts Y ts = β X j Y Var(Y ts ) (17) no longer depend on Cov(Y ts , C i,ts ), Var(C ts ), or Cov(C i,ts , C i ,ts i β 2 X j C i Var(C i,ts ) + 2 i<i β X j C i β X j C i Cov(C i,ts , C i ,ts ) + 2 β X j Y i β X j C i Cov(Y ts , C i,ts ), (12) Cov(X j,ts , X k,ts ) = β X j Y β X k Y Var(Y ts ) + β X j Y i β X k C i Cov(Y ts , C i,ts ) + β X k Y i β X j C i Cov(Y ts , C i,ts ) + i i β i β