Targeted-BEHRT: Deep learning for observational causal inference on longitudinal electronic health records

Observational causal inference is useful for decision making in medicine when randomized clinical trials (RCT) are infeasible or non generalizable. However, traditional approaches fail to deliver unconfounded causal conclusions in practice. The rise of"doubly robust"non-parametric tools coupled with the growth of deep learning for capturing rich representations of multimodal data, offers a unique opportunity to develop and test such models for causal inference on comprehensive electronic health records (EHR). In this paper, we investigate causal modelling of an RCT-established null causal association: the effect of antihypertensive use on incident cancer risk. We develop a dataset for our observational study and a Transformer-based model, Targeted BEHRT coupled with doubly robust estimation, we estimate average risk ratio (RR). We compare our model to benchmark statistical and deep learning models for causal inference in multiple experiments on semi-synthetic derivations of our dataset with various types and intensities of confounding. In order to further test the reliability of our approach, we test our model on situations of limited data. We find that our model provides more accurate estimates of RR (least sum absolute error from ground truth) compared to benchmarks for risk ratio estimation on high-dimensional EHR across experiments. Finally, we apply our model to investigate the original case study: antihypertensives' effect on cancer and demonstrate that our model generally captures the validated null association.


I. INTRODUCTION
HE growing availability of routinely collected, administrative clinical records databases with linked Electronic Health Records (EHR) with numerous variables describing large numbers of individuals on a population level provide an encouraging opportunity to conduct observational studies in the medical domain [1].These administrative EHR datasets such as Clinical Practice Research Datalink (CPRD) in the UK offer a multitude of temporal and static variables in addition to data linkages to other datasets [2] ultimately providing substantive variables for confounding adjustment.missing in our observational dataset, ground truth RR is inaccessible in our study of this association thereby making model comparisons difficult.Thus, we first construct semisynthetic derivations of our observational dataset, and then apply our model against statistical and deep learning benchmarks on several experiments to identify the model with best RR estimation.Second, to test our model in situations of limited data, we demonstrate the utility of T-BEHRT compared to other models in finite-sample estimation experiments.
Lastly, after validating our model on semi-synthetic derivations of routine clinical observational data, we apply our model to investigate the effect of ACEIs on cancer relative to other drug classes.Where traditional statistical models have demonstrated conflicting results, these associations have been deemed null in numerous RCTs [10], [11] with narrow confidence intervals, across a wide range of patient groups, for multiple cancer subtypes.

A. Background
Traditionally, statistical models like logistic regression have been staple models for observational causal inference in epidemiology.However, these models have known limitations [12].First, they require careful manual feature engineering -useful for modelling known confounders but impractical for those unknown or interacting variables.Furthermore, due to the nature of estimating causality with finite samples of populations, these models are susceptible to finite-sample bias in causal estimation [13].One solution proposed is to adjust solely for the variables that are associated to exposure by propensity score modelling [14]; however, propensity score-based methods require correct specification of the exposure prediction model, which may not be guaranteed [15].More recent work in semi-parametric estimation theory -namely "doubly robust" estimation theory -circumvents problems of these modelling strategies.These doubly robust estimators rely on the consistency of either prediction of propensity score or prediction of outcome to produce unbiased causal effect estimates [16], [17], and examples such as Targeted Maximum Likelihood Estimation (TMLE) and derivatives such as the Cross Validated TMLE (CV-TMLE) have been prolifically used to explore causal inference problems of average treatment effect (ATE).
More recently, there have been advances in deep learning for causal inference.Models like TARNET, Dragonnet, CEVAE, and others have been tested on synthetic and semi-synthetic derivations of static tabular data.[7], [18]- [20] However, many of these datasets have limited number of variables and complexity.[21], [22] Furthermore, these models have not been specifically tested in data and experimental settings involving routine multivariate, comprehensive EHR data.And even though deep learning can incorporate multimodal variables, few approaches firstly model both temporal and static variables for causal inference and secondly, develop environments to objectively test proposed solutions against benchmarks.Lastly, the considerable literature of deep learning for causal inference investigates ATE, conditional ATE, and Individualised Treatment Effect (ITE) almost exclusively; methods have rarely been evaluated for robustness in RR -a metric preferred by clinicians since RR captures risk relative to baseline risk (i.e.risk in the control cohort).
Our model, T-BEHRT combines advances in deep EHR modelling and causal inference by utilising an expanded BEHRT for modelling temporal and static EHR data.Furthermore, T-BEHRT models propensity thus allowing for estimation correction with CV-TMLE in order to mitigate finite-sample estimation bias.Lastly, we model auxiliary unsupervised tasks in tandem with learning the causal objective to ultimately aid causal adjustment processes.Previous works [23]- [25] demonstrate auxiliary unsupervised learning 1) adds an additional inductive bias ultimately improving generalizability and 2) helps to learn representations shared or beneficial for the main task -confounding adjustment and causal inference.We hypothesize that incorporation of unsupervised representation learning alongside propensity modelling will help provide more accurate estimates of RR.

A. Dataset and patient selection
For our investigations, we used a cut from CPRD, which has been described previously [5]: It entails records from 1 January 1985 up to 31 Dec 2015 and is linked to national administrative databases including hospitalisations (Hospital Episode Statistics, or HES), and death registration (from Office of National Statistics).
The data for the investigations was restricted to patients in the database who met the following criteria: (1) registered with the general practice for at least 12 months, (2) aged ≥16 years at registration, (3) registered with the practice considered providing 'up-to-standard' data to CPRD, (4) individual data marked by CPRD to be of 'acceptable' quality for research purposes (as determined by CPRD), and (5) registered with a practice that provided consent for linking the data with national databases for hospitalisations and death registry.We mapped diagnoses and medication codes to a homogenized format for machine readability.This led to a dataset of 6,777,845 patients, which was used for general representation learning (shown in Fig. 1) for deep learning models.
For our causal inference investigation (i.e., investigating the effect of antihypertensive on incident cancer), a dataset containing five subpopulations had to be selected -one for each class of antihypertensives: ACEIs, diuretics, CCBs, Beta Blockers (BBs), and Angiotensin II Receptor Blockers (ARBs).Patients were selected in one of these groups based on first class of antihypertensive medications recorded before 2009 and if free of cancer report before this first prescription; the year 2009 was chosen conveniently to have sufficient 'follow-up' time for the occurrence of potential cancers.The date of this first prescription was defined as 'baseline' (a date between 1985 and 31 December 2008).Patients were then followed up from baseline until cancer diagnosis (including cancer diagnoses as cause of death) or end of five-year follow-up period.The learning period included the entire patients' medical records up to a random point between 6 and 12 months before baseline; this is to account for any potential inaccuracies in timing of prescription (or decision to prescribe) and to avoid possibility of antihypertensive prescription itself influencing the model training."CPRD Product codes" are used for identifying classes of antihypertensives and the set of codes were obtained from a dataset published by University of Bristol [26].Codes for cancer are found in Supplementary Table 1 and derived from clinically established publication of codes [27].

B. Semi-synthetic data derivation
Data generation of sequential, temporal variables is a difficult task, and currently, there is no medically validated method of generating realistic EHR medical history.Thus, we utilised given medical history in observational data to exclusively simulate binary factual and counterfactual outcomes.
Inspired by other semi-synthetic data simulations [20], [28], intuitively, we first modelled the association between a medical history variable  !(e.g.some diagnosis/medication) and exposure  ! with the empirical propensity in the dataset:  != ( != 1| ! ).If associated to an exposure ( !≠ 0.5), we generated the conditional outcomes,  "#$ ! and  "#% ! as a function of  ! and exposure  != 1 and  != 0 respectively.In this way, semi-synthetic outcomes arose from an association between  ! and exposure and  ! and the outcome.Thus, the relationship between exposure and outcome is confounded by  ! .While the empirical RR -the proportion of the outcome in one exposure group divided by the same in the other -would yield confounded causal conclusions, effectively adjusting for  !would yield identifiable causal association between exposure and outcome.
In addition, to test model adjustment potential in situations of varying confounding intensity, we weighted the contribution of the confounding with a  factor: greater the  implies greater the confounding.More details of the semi-synthetic data generative process and functions modelled are in Supplementary Methods: Semi-synthetic data generation.
In our work, we present investigations in semi-synthetic data utilising two forms of confounders: persisting and transient confounding.We define persisting confounding as confounders that are assigned at birth and persist through one's life course; e.g.ethnicity, sex, genes, and other variables assigned at birth that associate to variables later in age.We define transient confounding as confounders that manifest at a point or period of one's life effecting events downstream in time; e.g.disease diagnoses, age itself, prescriptions, and other variables not assigned at birth.These two distinctions of confounding are presented in this work because they naturally capture prevalent forms of confounding seen in population health databases.[29] A visual depiction can be found in Fig. 2.
From our observational dataset, we investigated two exposure groups -ACEIs and Diuretics and noticed female sex was associated to the Diuretics exposure status and thus, chose it to be a persistent confounder and generated conditional outcomes.For another pair of exposures: ARBs and CCBs, we identified association of incidence of at least one of heart failure, hypertension, ischemic heart disease, and diabetes mellitus to CCBs.Thus, we named occurrence of at least one of these diseases as "cardiometabolic diseases" and utilise it as a transient confounder for the second set of semi-synthetic data experiments.For confounding intensity, we chose  values: [1,5,10] and [25,50,75] for experiments with sex and cardiometabolic disease as confounder respectively totalling six experiments on semi-synthetic data.In sum, with this confounding generation method, model confounding adjustment ability will be tested with two forms of confounding at multiple  intensities.
On the confounding dataset with cardiometabolic disease confounding at level=75, we additionally conducted finitesample causal estimation experiments.Since estimators for finite-sample estimation are known to be unstable in many cases (e.g.inverse probability weight based estimators) despite asymptotic guarantees [30], we wished to assess our model for finite-sample estimation ability.Thus, we investigated the estimation ability of our proposed model and other deep learning models by applying the models on random subsamples of this dataset: 2.5%, 5%, 10%, 25%, 50%, and finally, the entire dataset.

C. Feature Selection and Pre-Processing
The modalities of CPRD considered for modelling were sex, region, diagnoses from both primary and secondary care, medications, systolic blood pressure (BP) measurements, and smoking status -marked chronologically across patient timelines.
We mapped Read codes from primary care and ICD-10 codes from secondary care to 1,471 unique ICD-10 diagnostic codes [31], [32] to homogenize disease codes in the dataset; unmapped codes were included for completion.Furthermore, we mapped medication codes to 426 codes in the British National Formulary (BNF) [33] coding format.Systolic BP measurements were grouped into 16 categories based on prespecified boundaries ([90-116], (116,121], (121,126], … (181,186], >186).Furthermore, we utilised calendar year and age information for the temporal modalities.Each patient p had np encounters, or instances of modalities: diagnoses, medications, and systolic BP measurements.Smoking status at baseline, region, and sex were static variables included in modelling.

D. Proposed Model Development
Our model, T-BEHRT, utilises a modified BEHRT extractor to capture both static and temporal medical history variables and captures initial estimates of RR.Downstream, we use CV-TMLE to correct for bias in initial RR estimate and compute corrected RR (Fig. 3).
Intuitively, T-BEHRT first extracts latent EHR features from static covariates and fixed sub-sequences of medical history with BEHRT.Second, the model predicts propensity of exposure and conditional outcome using these grasped features similar to the Dragonnet model.Third, by additionally conducting auxiliary unsupervised learning, the model trains on reconstruction of both static and temporal data with two-part Masked EHR modelling (MEM).
The propensity prediction model is modelled as 1-hidden layer multilayer perceptron (MLP) and for each conditional outcome, we use a 2-hidden layer MLP with Exponential Linear Unit (ELU) activation.
With parameters , propensity prediction head ( ! ), and conditional outcome prediction heads, ( !,  ! ) for input  ! and exposure  ! for patient i, the loss:  8 ( ! ; ) = (( !,  ! ; ),  ! ) + (( ! ; ),  ! ) Next, we conduct MEM for two-part unsupervised learning: (1) temporal variable and (2) static variable modelling.The first part -unsupervised learning on temporal data -functions similarly to masked language modelling (MLM) in Natural Language Processing.[34] In MLM, the model receives a combination of masked, replaced, and unperturbed tokens (temporal or textual data) and the task is to predict the masked or replaced encounters.We do the same but additionally enforce another constraint: when replacing encounters, we do not replace encounters with those that define the exposure or outcome -antihypertensives and cancer in the current set of experiments.With encounter j for patient i represented as  !,' , masked/replaced encounters represented as  C !,' , BEHRT feature extractor B, temporal unsupervised prediction network M, neural network parameters  ()(*"+,-, we develop objective function: For the second part of the MEM, static data modelling, we chose using VAE for unsupervised learning due to cumulative literature empirically demonstrating its strength in representation learning in addition to the utilisation of VAE structures in other causal deep learning models such as CEVAE.[18] As stated, we model region, smoking status at baseline, and sex as static variables; the variables are embedded in high dimensional categorical embeddings and the information is concatenated to the BEHRT encounter embeddings.Thus, the BEHRT model functions as feature extractor for temporal variables and encoder for the VAE.The temporal variables interact with the static variables through the multi-head self-attention mechanism of the BEHRT architecture.[5] For training the VAE, similarly to the temporal modelling, we mask some variables as input, and use a variablespecific decoder to decode the variable (if masked).Specifically, for static variable  /,! of a total of  static variables patient i,  0 "#$ (  | ! ) representing the encoder, and  0 %&$ G , Q !H representing the multivariate Bernoulli decoder for variable v, the VAE loss is: The complete objective function to be minimized is:  " , ,  " !"# ,  " With hyperparameter  for weighting the contribution of the unsupervised MEM loss terms.

E. Benchmarks and causal estimation
Before pursuing the causal investigations with deep learning modelling, we pre-trained contextualised EHR embeddings and network weights through MEM on the pretraining dataset.This MEM task generally trains weights on all patients in CPRD before progressing to causal modelling (6,777,845 patients in Fig. 1).
For the semi-synthetic investigations and first routine clinical investigation, we implemented statistical and deep learning models to serve as benchmarked comparison models for causal inference.The benchmarks include Bayesian Additive Regression Trees (BART) [35], Logistic Regression (LR) and L1/L2 regularization variants, and logistic regression with Targeted Maximum Likelihood Estimation (TMLE).[36] We chose the covariates for these models to be baseline age, smoking status, sex, region, incidence of 33 curated disease groups, and additionally prescription of four additional medications groups.inclusion of baseline variables in epidemiological observational studies is standard practice, we specifically include the disease and medication groups to enable a fairer comparison to deep learning modelling.Furthermore, diagnoses and medications are known to be confounders in observational studies, so adjustment of these variables is important for causal estimation.To ensure that the diagnoses and medication groups are medically valid clusters of diseases and medications respectively, we utilised groups compiled by past medical research [26], [27].A deeper explication of statistical model development is given in Supplementary Methods: Statistical model development.
To serve as deep learning benchmarks, we implemented staple models in causal deep learning literature for estimating average exposure effect: TARNET, TARNET + MEM (i.e. with unsupervised MEM component), and Dragonnet with BEHRT feature extractor and the embedding format presented in Fig. 3A.We initialised these models with pretrained weights.After implementing and evaluating benchmarks, we implemented T-BEHRT with pre-trained network weights where applicable and pursue modelling of semi-synthetic data investigations.
For the semi-synthetic data experiments, we did not feed variables denoting cardiometabolic disease and sex respectively as input; we wish the statistical and deep learning models to infer confounding from remaining input variables.In routine clinical data, the observational studies would often not have access to all confounding variables -thus, important to test models' ability to adjust for confounding given limited input variables.
For all investigations, we conducted experiments with fivefold cross validation causal estimation.We calculated RR on the test dataset for each fold as advised by Chernozhukov et al [16] and compute 95% Confidence Intervals (CI) over the five folds.We computed RR defined by naïve estimator on a finite sample:  8 =  Z [@ A (C,$)] [@ A(C,%)] [ for TARNET, TARNET-MEM, LR (and L1/L2 regularization variants), and BART.For Dragonnet and T-BEHRT, we use CV-TMLE for estimation of RR.For more information on the CV-TMLE method, advantages over TMLE, and implementation, please refer to Supplementary Methods: CV-TMLE.For models that utilised predicted propensity scores, we conducted propensity score trimming and exclude patients with predicted propensity score greater than 0.97 and less than 0.03 [37] before pursuing RR calculation.
We identified the superior model by identifying the model with least Sum Absolute Error (SAE) over the three Beta values for each confounding experiment.We give the Standard Error (SE) for the SAE; this was calculated using additive propagation of error.[38] For deep learning models, we also demonstrate change of SAE as modular additions are incorporated in the model.

F. Implementation
We developed all statistical and deep learning models on python.The deep learning models were implemented with Pytorch [39] Hyperparameters for the BEHRT feature extractor are found in Supplementary Table 2.For training all deep learning models, we used the Adam optimizer [40] with exponential decay scheduler (decay rate=0.95) to ensure training convergence.For TARNET-MEM and T-BEHRT, we pre-trained 5 epochs on exclusively the MEM task before initiating joint MEM-causal task training.

A. Population Statistics
In the dataset for the investigation of antihypertensives on incident cancer, we identified 186,709, 150,098, 128,597, 28,991, and 21,970 patients for ACEIs, BBs, CCBs, diuretics, ARBs respectively totalling 516,365 patients.We demonstrate population statistics in Table 1.Cancer incidence

B. Semi-synthetic Data Experiments
In the semi-synthetic experiments on confounders cardiometabolic diseases and sex, we tested the T-BEHRT models against several statistical and deep learning benchmarks.In Fig. 4A/B, we show SAE with SE measures calculated over all -specific semi-synthetic data experiments.We include more detailed experimental results in Supplementary Table 3.
We found that our proposed model, T-BEHRT, outperforms all given deep learning and statistical model solutions in terms of SAE whilst maintaining narrow SE as shown in panels A and B (Fig. 4).Additionally, across both experiments, we show that deep learning models for EHR benefit from inclusion of propensity score modelling.This is seen by superior performance of both Dragonnet and T-BEHRT in comparison with TARNET, which does not require propensity score modelling.However, by investigating the inclusion of various modules appended to the chassis of TARNET shown in our module inclusion analysis (Fig. 4C), we see that inclusion of MEM might improve RR estimation in a parallel way; the TARNET model with inclusion of MEM (SAE reduction of 0.463) does approximately as well as Dragonnet + CV-TMLE (SAE reduction of 0.371) averaged over experiments of persistent transient confounders.Regardless, the improvement in combining both MEM and propensity modelling and forming T-BEHRT demonstrates greatest SAE reduction of 0.676.
In the finite-sample estimation experiments shown in Fig. 5, we demonstrate T-BEHRT outperforms other models in RR estimation in individual and across data subsamples.While improvement of T-BEHRT over Dragonnet is less pronounced than over other models, panel B shows that T-BEHRT still demonstrates superior RR estimation performance with respect to the deep learning benchmarks.Furthermore, we show that inclusion of MEM aids more precise estimation of RR; TARNET-MEM and T-BEHRT perform better than TARNET and Dragonnet respectively across all sizes of dataset.However, we note the application of CV-TMLE is more important than MEM in smaller datasets as seen by superior performance of Dragonnet + CV-TMLE as opposed to TARNET + MEM.Furthermore, models equipped with CV-TMLE maintain relatively stable SAE across subsampling fractions while TARNET and derivatives suffer in RR estimation in smaller datasets.Lastly, we see as dataset size increases, SAE across models begin to converge.Theoretically, as the number of samples increases, we would be slowly mitigating the finite-sample bias, and thus, the performance of TARNET and derivatives should be similar to those of models assisted by propensity modelling also noted by Shi et al [7].

. Experiments on semi-synthetic data with sex (A) and cardiometabolic disease (B) as confounders; module inclusion analysis of causal modules (C).
We show Sum Absolute Error (SAE) between ground truth risk ratio (RR) and estimated RR with standard error measures in both panels.The x axis is shown by the models implemented on these datasets, and the y axis is the SAE (lower is better).We present the numerical value and standard error measures underneath the model names.In C, we present the transformation from TARNET into other benchmarks and our proposed model (deep learning models).We show change in average SAE across experiments of transient and persistent confounding in green as model transforms by including varying modules.We apply our model on the routine clinical data study of effect of ACEIs on incident cancer with respect to other antihypertensive drug classes and demonstrate the results in Fig. 5. Across all four drug class comparisons, while the empirical RR often tends away from null implying a preventive or harmful effect, we show that our model's 95% confidence interval for RR covers the null hypothesis (1.0 RR) across almost all drug class comparisons with exception of CCBs.

IV. DISCUSSION
In this paper, by utilising large scale comprehensive EHR and deep learning methods, we have developed a causal model for observational causal inference for clinical sciences.We have validated our model against benchmarks across six semisynthetic experiments in addition to a finite-sample estimation study and found T-BEHRT to demonstrate superior estimation in the cases of persistent and transient confounding.Finally, we apply our model to a routine clinical data observational study.
Our work has contributions to the field of EHR based deep learning research.First, the model consolidates multiple static and temporal data embeddings into a unified embedding structure thus allowing adjustment over multiple datatypes in rich EHR.Second, T-BEHRT conducts novel MEM unsupervised learning using MLM and VAE-based representation learning in tandem with the causal inference objective.We demonstrate the benefits of unsupervised learning in the context of average RR estimation in multiple experiments as well.In our assessment, this is the first work conducting causal inference incorporating unsupervised learning on multiple data types.Third, we introduce CV-TMLE estimation correction for less biased RR estimation for deep learning causal models on EHR data.While the utility of propensity modelling and CV-TMLE is as effective as MEM modelling for RR estimation in larger dataset sizes, we found that in our finite-sample estimation experiments, the former is far more critical for accurate RR estimation.Finally, in our observational study, we show that our model can be applied to test a clinical hypothesis in an observational setting.
Our work has some limitations and scope to grow as well.First and most fundamentally, we note that EHR data may not completely capture confounding variables in our observational studies.Latent confounding can be investigated in future works with latent variable modelling techniques [8] to rich EHR.Furthermore, we note that we have included the data modalities of diagnoses, medications, smoking, sex, and systolic BP; however, better confounding adjustment might manifest with full utilisation of the modalities that rich databases like CPRD have to offer.In this work, we have included sex as an explicit feature in deep learning modelling to adhere to conventions of adjustment in clinical/epidemiological research; however we note, it is shown that BEHRT naturally captures sex [5] without explicit inclusion as a variable.Also, in terms of data curation, we have allocated patients into an exposure group based on first prescription of class of antihypertensives.Subgroup investigations stratified by intensity and duration of drug class should be additionally pursued in future studies.Lastly, T-BEHRT estimated null in most drug comparisons in the routine clinical data study, but we note that our model finds the comparison to CCBs to deviate from the null.While findings from the RCTs generally demonstrate that antihypertensives have null effect on cancer, the evidence regarding CCBs is still conflicting and further research is required [11].Lastly, in our modelling, we cannot entirely rule out residual confounding.A variety of variables may be affecting outcome may be unadjusted for (explicitly or through latent representation modelling) and as discussed, further modality inclusion is necessary in future work.

V. CONCLUSION
To conclude, we have developed a deep learning model for EHR data for superior and reliable estimation of RR.T-BEHRT has performed optimally in semi-synthetic data experiments with both, persistent and transient confounding and can be applied to an observational study on routine clinical data.Thus, in the future, this model should be further tested and applied to investigate other causal hypotheses questions using routine EHR.

A. Supplementary Methods: Semi-synthetic data simulation
Data generation of sequential, temporal variables is a difficult task, and currently, there is no medically validated method of generating realistic EHR medical history, exposure assignment, and outcome.Thus, we limit data generation to factual/counterfactual outcome generation and utilise routine clinical data components: (1) medical history with balanced exposure groups and (2) exposure status for an observational investigation of an association.In order to create this semi-synthetic dataset, we first form the dataset for the investigation: effect of antihypertensives on incident cancer allowing us access to components (1) and (2).Since outcome generation using only (2), exposure status would be unconfounded and thus too rudimentary a function to test the tolerance of models, we force the relationship between treatment and outcome to be confounded.Since confounding often manifests partly due to imbalanced variables between exposure groups, we find an imbalanced variable, Zi in (1), routine medical history.We then force this imbalanced variable, Zi to be a confounder and generate conditional outcome from a sampling function: ,  ! is the exposure for patient i,  ! is the outcome for patient i,  is the sigmoid function, and , the magnitude of confounding.Variables a, m, and c are coefficients to terms weighting their importance in the function.Intuitively, we first model the association between a variable  ! and exposure (( !| !)) with  ! .Next, we generate  ! with two variables: the variable  ! and exposure (( !| !,  !)).In this way, we form an association between  ! and exposure and  ! and the outcome; with association to both, we force  ! to be a confounder in this data generating process.This process synthesizes controlled confounded observational data; by generating the outcome with this function, we control confounding with a confounder  ! .Thus, in this way we can generate factual/counterfactual outcomes and consequently ground truth RR.Lastly, we can modify  value to vary the degree of confounding in the data generation process.

B. Supplementary Methods: CV-TMLE
After using T-BEHRT to compute initial estimates, we use CV-TMLE [41] for correction of these estimates.We refer readers to the source material for theory behind TMLE and the cross validated form: CV-TMLE [17], [41].In brief, the original formulation of the CV-TMLE algorithm requires k targeting steps for each of the k folds for each of the iterations pre-defined in the iterative version of TMLE.However, Levy forms a simpler construction of the CV-TMLE which is less computationally cumbersome; the advised method is to pool all the initial estimates across folds and compute corrected estimates vis-à-vis a standard TMLE update step [41].Albeit procedurally different, the original formulation and Levy's more recent formulation of CV-TMLE are identical mathematically and in function.According to our research, this is the first work utilising CV-TMLE paired with deep learning methods.
CV-TMLE provides a host of benefits to observational causal inference.As recommended, by Chernozhukov [42], with validation in k-fold framework, CV-TMLE is a form of TMLE which is robust to issues of fold-wise overfitting whilst conducting k-fold cross validation [41].Furthermore, previous works show that the CV-TMLE estimator provides more robustness than other cross-validated estimators (e.g.CV-AIPTW) in the case of violations of the assumption of overlap [43].
The TMLE was developed using two logistic regression models -one for outcome prediction and the other for exposure prediction.The outcome prediction model adjusted for covariates and exposure variable listed above, and the exposure prediction model used just the covariates.The TMLE algorithm was fit and tested using five-fold validation.The TMLE RR estimates were calculated on the testing dataset in each fold and mean RR estimate and 95% (CI) for estimates were derived.

Fig. 1 .
Fig. 1.Representation learning data selection pipeline.We use Clinical Practice Research Datalink (CPRD) and extract diagnoses, medications, blood pressure, smoking, and sex records.We homogenize codes from ICD-10 and Read to one format.Unmapped Read codes were kept for completeness.

Fig. 2 .
Fig. 2. Confounding forms investigated in semi-synthetic data investigations.We pursue semi-synthetic data generation with confounding variable Z.The two types we define are persistent and transient confounding.The first is a potential confounder assigned at birth and effect's all variables in one's life course.Transient confounding is a potential confounder manifesting at a certain point of time after birth and only confounding downstream associations.

Fig. 3 .
Fig.3.Targeted BEHRT and Embedding Structure.A. Above, the model is shown.Generally, an input x (static and temporal variables) is fed to a feature extractor, which outputs a dense latent state (for EHR modelling, this feature extractor is BEHRT).The output of the final layer of the BEHRT feature extractor is fed to the Masked EHR modelling (MEM) prediction head to predict any masked encounters.The last token's latent state (TN+1) is fed to a Variational Autoencoder (VAE) neural network to predict masked static variables.The latent state of the first token (T1) is fed to a pooling layer to predict propensity and conditional outcomes with multiple prediction heads.The loss consists of the unsupervised loss from two MEM componentstemporal (temp) and static (static) unsupervised data training -and the supervised loss of the propensity and factual outcomes.B. Below, the embedding structure for modelling rich EHR data is shown here.Clinical encounters timestamped by age/year/relative position are converted to vector representations and fed to model as temporal variables.Static data variable embeddings: patient sex, region in UK, and smoking status are concatenated to the temporal variable embeddings.

Fig. 4
Fig.4.Experiments on semi-synthetic data with sex (A) and cardiometabolic disease (B) as confounders; module inclusion analysis of causal modules (C).We show Sum Absolute Error (SAE) between ground truth risk ratio (RR) and estimated RR with standard error measures in both panels.The x axis is shown by the models implemented on these datasets, and the y axis is the SAE (lower is better).We present the numerical value and standard error measures underneath the model names.In C, we present the transformation from TARNET into other benchmarks and our proposed model (deep learning models).We show change in average SAE across experiments of transient and persistent confounding in green as model transforms by including varying modules.

Fig. 5 .
Fig. 5. Finite-sample experiments on semi-synthetic data.A. We conduct experiments on finite subsamples of the semi-synthetic dataset for cardiometabolic confounding (Beta 75).The subsampling fraction of the dataset is shown on the x axis.The y axis shows error from ground truth risk ratio (RR).The models: TARNET (and with Masked EHR Modelling), Dragonnet + CV-TMLE, and T-BEHRT estimate RR on the fractional samples of the dataset.The point estimate is the mean value on five-fold cross validation and the error bars represent 95% confidence intervals for those point estimates of RR.B. Sum Absolute Error (SAE) across the seven subsamples of the dataset are shown for each model (Denoted by colour) is shown.The four models are represented by the four bars with interval defined by Standard Error (SE) and colour scheme is the same as part A.

Fig. 6 .
Fig. 6.Application of T-BEHRT on routine clinical data: Effect of ACEI on incident cancer with respect to BBs, CCBs, Diuretics, and ARBs.This forest plot has four parts; one for each of the antihypertensive groups.We show CV-TMLE risk ratio (RR) estimates with 95% Confidence Intervals (CI) on our T-BEHRT model.In addition, we show empirical RR in the observational cohort selected for these experiments.The ground truth is assumed to be 1.0 (null) validated by multiple RCTs.BBs: beta blockers; CCBs: calcium channel blockers; ACEIs: angiotensin-converting-enzyme inhibitors; ARBs: angiotensin receptor blockers; RR: risk ratio

Table 1 ,
we have ICD-10 th revision codes for cancer stratified by type.

Table 2 ,
we have hyperparameters used for the BEHRT architecture models.Provided in Supplementary Table3, we have the raw risk ratio estimates across the semi-synthetic data experiments.SUPPLEMENTARY TABLE 3 RISK RATIO ESTIMATES ACROSS SEMI-SYNTHETIC DATA EXPERIMENTSThis table shows the risk ratio and standard deviation (five-fold) for statistical and deep learning models over the two semi synthetic experiments with cardiometabolic diseases and sex as confounders (top and bottom respectively).Over various values of Beta, confounding experiments are conducted.Ground truth risk ratio is calculated and displayed for both experiments.Risk ratio and 95% confidence interval for each model is presented in the table.The sum absolute error from ground truth risk ratios for models over all the confounding experiments and 95% confidence interval is shown in the far-right column.Bolded models are best statistical and deep learning models.LR: Logistic Regression; LR-L1; Logistic Regression with L1 penalty; LR-L2; Logistic Regression with L2 penalty; TMLE: Targeted Maximum Likelihood Estimation; BART: Bayesian Additive Regression Trees; T-BEHRT: Targeted BEHRT