An Explainable Transformer-Based Deep Learning Model for the Prediction of Incident Heart Failure

Predicting the incidence of complex chronic conditions such as heart failure is challenging. Deep learning models applied to rich electronic health records may improve prediction but remain unexplainable hampering their wider use in medical practice. We aimed to develop a deep-learning framework for accurate and yet explainable prediction of 6-month incident heart failure (HF). Using 100,071 patients from longitudinal linked electronic health records across the U.K., we applied a novel Transformer-based risk model using all community and hospital diagnoses and medications contextualized within the age and calendar year for each patient's clinical encounter. Feature importance was investigated with an ablation analysis to compare model performance when alternatively removing features and by comparing the variability of temporal representations. A post-hoc perturbation technique was conducted to propagate the changes in the input to the outcome for feature contribution analyses. Our model achieved 0.93 area under the receiver operator curve and 0.69 area under the precision-recall curve on internal 5-fold cross validation and outperformed existing deep learning models. Ablation analysis indicated medication is important for predicting HF risk, calendar year is more important than chronological age, which was further reinforced by temporal variability analysis. Contribution analyses identified risk factors that are closely related to HF. Many of them were consistent with existing knowledge from clinical and epidemiological research but several new associations were revealed which had not been considered in expert-driven risk prediction models. In conclusion, the results highlight that our deep learning model, in addition high predictive performance, can inform data-driven risk factor identification.


Introduction
Heart failure (HF) remains a major cause of morbidity, mortality, and economic burden. 1 Despite recent evidence suggesting improvements in the quality of clinical care that patients with HF receive, and favorable trends in prognosis, 2 the incidence of HF has changed little. 3Indeed, as a consequence of population growth and ageing, the absolute burden of HF has been increasing, with incidence rates similar to the four most common causes of cancer combined. 3These observations reinforce the need for fuller implementation of existing strategies for HF prevention and further investigations into risk factors.Several statistical models have been developed to predict risk of incident HF; however, the predictive performance of these models has been largely unsatisfactory. 1e growing availability of comprehensive clinical datasets, such as linked electronic health records (EHR) with extensive clinical information from a large number of individuals, together with advances in machine learning, offer new opportunities for developing more robust risk-prediction models than conventional statistical approaches. 4,5Such data-driven approaches can also potentially discover new associations that are less dependent on expert knowledge.However, empirical evidence of robust prediction of complex chronic conditions, such as HF, has been limited.Prominent deep learning architectures have only shown modest performance in large-scale, complex EHR datasets. 6Furthermore, due to their high level of abstraction, these deep learning models have typically had poor explainability, which limits their trustfulness and contribution to risk factor discoveries and wider application in clinical settings. 7 this study, we aimed to develop and validate a model for predicting incident HF, leveraging state-of-the-art, deep-sequential architecture applied to temporal and multi-modal EHR.

Dataset
We used UK Clinical Practice Research Datalink (CPRD), one of the largest deidentified longitudinal population based-EHR databases nationally representative in terms of age, sex, and ethnicity. 8This was used with approval from International Scientific Advisory Committee (ISAC protocol: 17_224R2).It contains primary care data from general practices (GP) since 1985, and links to secondary care and other health and area-based administrative databases 4,8 (e.g.Hospital Episode Statistics 9 ).
We selected GP records that met certain quality standards for research (as assessed by CPRD) and those with full data-linkage with secondary care to retain a complete patient history between 1985 and 2015.Male and female patients were included with records from 16 years of age.Afterwards, a two-step process was applied to select a cohort for general representation pre-training and a sub-cohort for incident HF prediction (Figure 1).Here, incident HF is defined as the first HF diagnosis code for each patient during the study period.
Firstly, we included patients with at least five visits to ensure sufficient contextual information for representation learning.This cohort included 1,609,024 patients in total and is referred to as dataset A. Secondly, to develop models for predicting incident HF, we selected a subset of dataset A with richer medical information.More specifically, we kept patients with i) at least 10 visits to their GP or hospital, ii) at least three years of records, iii) at least 10 unique codes been recorded.For patients with at least one diagnosis of HF, information up to six months before the first record of HF was used for learning; our process ensures no HF diagnoses in learning period.For each patient without HF, we randomly select a time stamp and consider all records prior as the learning period.This led to selection of a cohort of 100,071 patients, with 13,050 (13%) cases of incident HF, henceforth referred to as dataset B.

Model architecture
We used the BEHRT model architecture reported by Li et al 10

inspired by
Transformer 11 models and extended it to meet the study objectives.In brief, the model captures disease and medication associations within their temporal context (Figure 1E), to bolster predictive performance.BEHRT works robustly with large-scale, sequential data and outperforms other classical machine and deep learning models on subsequent visit prediction tasks (appendix section 1 for more information). 10 included encounter (disease and medication), age, and calendar year as input information.Each of the three modalities are represented by a trainable embedding matrix 10 , which is a two-dimensional matrix with each instance as a vector (Figure 1C).Each encounter with its respective age and calendar year layers are summed to form a single predictor in the model.

Statistical analysis
To assess and validate model performance for HF prediction, we selected 60%, 20%, and 20% of the patients in dataset B as training, testing, and external, held-out validation cohorts, respectively.Five-fold cross-validation was applied for internal validation.Model performance was assessed using area under the receiver operating characteristic (AUROC) and precision-recall curve (AUPRC) with 95% confidence interval (CI) over five folds.
We conducted ablation study to assess importance of different data modalities (diagnoses (D), medications (M), age (A), and calendar year (Y)) by alternatively removing each of them.More specifically, we devised six experiments with the following combination of modalities as input into the models: D, DA, DAY, DM, DMA, and DMAY, with letters corresponding to modalities; also, a visualized analysis for time-related modalities (i.e.age and year) is included in appendix section 2. We also replicated state-of-the-art deep learning model, RETAINEX 12 , which has outperformed other models such as RETAIN, logistic regression, Deepr, Deep Patient, and eNRBM 12 on our dataset and compared it directly with our model.
Next, we aimed to develop ways of quantifying encounter contributions to the prediction of incident HF, as a way of making models explainable.For this, we used a perturbation technique inspired by work of Guan et al 13 on the summed, predictor embeddings to represent the disease or medication as seen in Figure 1H.The fundamental concept is to measure change in predictive probability after perturbing the input to indicate the contribution of predictors.If large perturbations of predictors minimally change outcome probability, then the predictors are unimportant for prediction.However, if minimal perturbation greatly changes outcome probability, the respective predictors are highly important.In this work, we proposed an asymmetric loss function to prioritize encounters that enhanced HF/non-HF predictions.More details can be found in appendix section 3.
In case of repeat diagnoses/prescriptions for the same patient, we only considered the contribution for the first recorded instance.We calculate the average contribution for a particular disease/medication across all HF and non-HF patients (Figure 1H).We analyzed predictors where their respective disease codes occurred in at least 5% of patients to ensure sufficient training of the input variables.Additionally, we considered established risk factors from statistical models for HF prediction even if prevalence was less than 5%. 14,15We calculated the relative contribution (RC) with 95% CI 16 of a particular disease/medication by dividing the average contribution in HF patients by the average contribution in non-HF patients with value greater than 1.0 and less than 1.0 implying that the disease/medication is positively associated with HF and non-HF respectively.

Dataset preparation
Our dataset included diagnostic codes (299 codes in Caliber 17 ) and medications (426 codes) at each encounter as well as patient age in months and calendar year.Both disease and medication codes with unknown mapping were mapped to an "UNKNOWN" category.Of the 100,071 patients for incident HF prediction (dataset B), the median age at baseline was 70 years, 52% were men, 65% had history of hypertension, 9% a prior myocardial infarction, and 5.1% an ischemic stroke.The median duration of follow-up (date of the first record to baseline) was 9 years.More details for training, test, and external validation set, as well as code phenotyping and specifically, heart failure phenotypying processes can be found in Table S1, Figure S1, and appendix section 5.

Model performance
We implemented BEHRT to predict incident HF 10 .Our model, with inclusion of all four modalities (DMAY), showed the best performance for incident HF prediction, with an AUROC of 0.93 and AUPRC of 0.70 in the external validation cohort (Figure 2).A comparison of our model with RETAINEX 12 (in the full cohort) showed noticeable improvement in predictive ability -3%/9% absolute improvement in AUROC/AUPRC (Figure 2). Figure 2 also shows that model with calendar year as the contextualizing variable demonstrates substantial improvements in terms of AUROC/AUPRC than one with age, thus, indicating calendar year to be more informative for contextualizing predictors than chronological age (visualized analysis found in appendix section 6).Lastly, after evaluating BEHRT (DMAY) on combined test and validation of external validation dataset, we found the best performing quintiles of predictive probability were the first and fifth with quintile-AUROC of 0.84 and 0.71, respectively.Based on model predictive performance, we formed Dataset C for contribution analysis with patients that fall into the first and fifth AUROC quintile (Figure S3).Tables S2 and S3 show RC for diseases and medications, respectively.We found diseases like bacterial diseases, lower respiratory tract infections, myocardial infarction, and pleural effusion and medications such, as "corticosteroid /antibacterial drugs", "bronchodilators", and "acne and rosacea drugs" were all positively associated with HF.
Furthermore, our age-stratified RC analysis for the ten most important (i.e.highest contribution) diagnoses (Figure 3A) and medications (Figure 3B) showed the contributions were generally higher for those aged 50-60 years and lower in older ages implying little contribution of the risk factors individually to HF in older ages (consistent with evidence from epidemiological studies).For some predictors (e.g.left bundle branch block), however, CIs were too wide to allow firm conclusions about any differential RC by age (Tables S4 and   S5).
Additionally, many of the medications showed a high RC to HF (Figure 3B) are treatments for diagnoses (Figure 3A).This implies the model identified diagnoses and treatments that were at least contemporaneous and often causally associated.For example, dermatitis may be treated with corticosteroids, which may be linked to cardiovascular risk 18 as may be depression, and therefore its treatments 19 .And "cough preparations", "bronchodilators" and "antibacterial drugs" are often linked to lower respiratory tract infection, asthma and chronic obstructive lung disease. 20,21This could signal delayed diagnosis of HF due to misattribution of HF symptoms to respiratory diseases 21 .Or it might be that direct effects of drugs such as non-steroidal anti-inflammatory drugs (NSAIDs) are at least in part responsible for causing HF. 22 Moreover, we captured RC for established HF risk factors (Table S6).Ischemic stroke, myocardial infarction, diabetes, hypertension, and atrial fibrillation/flutter all had RC>1.The age-stratified RC for these diseases (Figure 4/ Table S7) showed a similar pattern to the top ten diagnoses and medications analysis, again implying limited discriminatory contribution of these predictors individually in older people.We further explored factors that were strongly associated with non-HF (lowest RC values) with RC <1.To illustrate, Figure 5A and Table S5 show that treatments for established risk factors for HF, such as hypertension, diabetes and atrial fibrillation, including digoxin, were associated with a lower risk of HF.  for established risk factors stratified by prescription status.The first row (baseline) shows relative contribution of all patients with the disease denoted by column (left Venn circle shaded).Subsequent rows describe relative contribution calculated on the subpopulation of patients who have not taken the drug denoted in a particular row (left Venn subsection of circle shaded).The red color denotes increase in relative contribution with respect to baseline disease relative contribution."-" denotes no measurement due to insufficient size in those subgroups (too few patients with diabetes not receiving hypoglycemic therapy).
To disentangle the relationships between disease and medication pairs, we repeated the analyses only for patients who were not treated for a particular established risk factor.We Figure 4 (suggesting that people diagnosed with hypertension in older age are at lower risk of HF, whereas the opposite is found in younger people).Upon further investigation, we see that 74.5% of patients with hypertension were treated with "hypertension drugs," which had lowest RC values (Table S5).Repeating the analysis in people who were not treated for hypertension, we indeed find no evidence of a 'protective' effect of hypertension in older age groups; rather, the general trend of RC in older ages converging to 1.0 is preserved and are shown to be: 1.06; 95% CI (0.82,1.37) and 0.98; 95% CI (0.78, 1.23) for the 70-75 and 75-80 age group respectively.
Lastly, we stratified by calendar year-groups to investigate historical trends in treatments in Figure 6A.In Figure 6B, between 1990 and 2010, we show the number of times a medication was first prescribed in patients from Dataset A (not counting repeat prescriptions).
Throughout the 1990's, timolol, a beta blocker, was a common topical treatment for glaucoma 24 but with known cardiovascular side-effects such as bradycardia with potential to exacerbate HF 25 .With the introduction of new medications in the 2000's, the use of ophthalmic timolol started to decline 26 (Figure 6B).BEHRT captures this change over time in Figure 6A: treatments for glaucoma prior to 2000 had a high RC to HF while forms of treatment after 2000 had little contribution.Specifically following 2000, BEHRT identifies that treatment for glaucoma -namely, prostaglandin analogues -has <1 RC, indicating potentially protective effects on HF incidence.Prostaglandins and analogues such as prostaglandin I2 27,28 and others 29 have vasodilating properties with the potential of reducing cardiovascular risk, although large-scale randomized trials to investigate preventative effects of this vasodilator are currently lacking. 27,28Also, for digoxin, we note that prescription of digoxin wanes following 2005.However, RC for this positive inotropic drug remains stable in any year strata, thus further lending support to the hypothesis that digoxin could play a role in prevention of HF. Discussion of calendar year temporal RC trends of analgesics are provided in appendix section 7. We investigated the contribution and importance of various modalities and found that diseases and medications were strong predictors.Also, compared to age, calendar year improved patient representation substantially.We further developed an explainable framework for discovering factors contributing to risk of HF.This confirmed the relative importance of several established risk factors 15 and provided insights into medications that might negatively or positively contribute to the HF prediction.
Our work has several novel discoveries and methodological improvements.Our expansion of perturbation-based techniques 13 provides a method for making deep learning models explainable with findings supporting the importance of context in prediction.The RC method identified several disease-medication pairs, with one or both members of the pair being potential risk factors for cardiovascular diseases and specifically, HF in some cases.
Also, we saw that our findings were broadly consistent with prior knowledge of HF and corroborated disease and medication risk factors of the disease.
With this approach, we included many potentially predictive variables not previously included in epidemiological studies.In addition, although age is usually incorporated as a risk factor for risk prediction, 6,12 our analysis found that incorporating calendar year provided additional and stronger information for accurate prediction of incident HF.A potential explanation for this observation was provided in our perturbation analysis stratified by year.
This showed that occurrence of medications over different years made quantitatively different contributions to disease prediction, which would be missed if temporal context was not included in the models.Changes in such predictors over time or more subtle changes in disease patterns for instance due to advances in technologies leading to more accurate and frequent diagnosis are well known to clinicians.BEHRT enables incorporation of such information for better prediction.With regards to disease-medication contexts, a cursory analysis might lead to false conclusions that for instance treatments for hypertension increase the risk of HF.However, this conclusion is biased by indication; the correct interpretation is that the medication serves as a proxy for hypertension, which has a strong effect on HF incidence.Overall, the RC analysis illuminates potential protective medications that warrants further analysis in future studies.
Additionally, through age and year stratified analysis, BEHRT demonstrates medications with potentially preventative effects on HF.In the case of digoxin and prostaglandin analogues, the stable RC <1 in both age and year stratification signals potentially preventative effects of these drugs.However, as with standard statistical models, a causal interpretation should be made with great caution.Rather, our method provides a way of making models explainable and generates hypotheses, which depending on the totality of evidence from this work and other sources, should provide the impetus for additional confirmatory studies.

Study Limitations
Our study has some additional limitations.The phenotyping method for diagnoses maps codes to 299 disease categories 30  We developed a superior model for prediction of incident HF using routine EHR data providing a promising avenue for research into prediction of other complex conditions.
Incorporating BEHRT into routine EHR could alert clinicians to those at risk for more targeted preventive care or recruitment into clinical trials.In addition, we highlight a datadriven approach for identification of potential risk factors that generate new hypotheses requiring causal exploration.We note that there are several medications which contribute negatively to the HF prediction.Not only are many of them used to treat established risk factors of HF, but others have not been tested for such an indication and might provide a starting point for drug repurposing studies.The model and analysis could be applied to more deeply phenotyped populations for discovery of new disease mechanisms and patterns in other complex conditions.

Supplementary Material 1. Details for BEHRT model
In NLP literature and BERT, 1 words in sentences are considered "tokens" and sentences are separated from one another with a separation element.Similarly, we conceptualized medical events such as diagnoses and medications in a doctor/hospital visit as encounters (or tokens) and separate visits by a separation element ("SEP").Similar to the original BERT model, we implemented an annotation ordering the sequential medical history data.Furthermore, we added layers of information that involve age and calendar year of each encounter.
Thus, the total input comprised of three layers of information for each and every encounter: the encounter itself (diagnoses and/or medications), age, and calendar year.

Visualised importance analysis for age and year
To further validate and visualise the contribution of two time-related modality (i.e.age and year), we conducted similarity measurement (cosine similarity 2 ) across the summation of embeddings for encounters and calendar years/ages to observe if there is any significant difference for their representations.The larger the distance, the higher the dissimilarity, and the higher dissimilarity can directly imply the importance of a modality.
Embedding here means a matrix that is trained to represent each element in a modality, such as hypertension in diagnose and 2011 in year.year and age.We chose four diseases with high occurrence in the dataset to ensure these embeddings were welltrained.Calendar year showed substantial dissimilarity across years (from 0 to 0.8).By contrast, the dissimilarity as consequence of variation in age in months was less pronounced (from 0 to 0.3).In other words, representation of diseases (or predictors) among individuals was more sensitive to variation in year in which they were recorded than to the age of the patient.This suggests that calendar year or 'birth cohort effect' was more informative for the incident HF prediction than chronological age, as it captures more contextual timesensitive information necessary for HF prediction.

Details of perturbation methods
Additionally to Guan et al.'s method 3 , we developed an asymmetric loss function to prioritize learning perturbations in addition to the information entropy-based loss term.Shown in the equations below is the full description of the loss function conducted over one patient's medical history.To describe Eq. ( 1) and (2) in words, for heart failure patients, we prioritize perturbations that increases the outcome probability than those that decreases it; and we prioritize the opposite for non-heart failure patients.
To do so, we penalize with the alpha(y,x',s) constant.Asymmetric losses are often used in scenarios where an error in one direction (perhaps positive) is more costly than an error in the opposite direction. 4,5is perturbation method delivers learned ϵ = [ϵ1, ϵ2, …, ϵn] with ϵ1 per predictor i -the trained, allowable variance for the predictor, with maximum variance defined by a user defined hyperparameter (set to 0.5 in our work).To assess contribution of predictor, we transform ϵ1 to 0.5-ϵ1 to reflect the inverse relationship: the lower the ϵ1, the higher contribution to heart failure prediction and vice-versa.As seen in Fig 1H, we first establish patient-level contribution, or 0.5-ϵ1 for a particular encounter (disease/medication) in time.

Discussion of Dataset C
The encounter contribution analysis was conducted on a subset of patients from the external testing and validation datasets where BEHRT (data modalities: DMAY) performed best named Dataset C. Since BEHRT has learnt the latent space most effectively for these patients, the resulting contribution analysis on these patient medical histories are therefore reliable.We avoided selecting incorrectly predicted patients for this analysis; even if the perturbation method assigns theoretically correct derivations of predictor contribution to output probability, when conducting population-wise analysis and age-stratified analysis as described, misclassification of the outcome would invalidate results for this particular analysis.Thus, the data from these best-performing groups were selected for contribution analysis.

Details of disease and medication phenotyping
In our study, we used all diagnostic codes and medications at each encounter as well as patient age in months and calendar year.Encounters represent each individual's time-stamped recording of a diagnosis or medication.
For medication specifically, we included all available prescription records as new encounters in the dataset.
Because of how CPRD records medications in six-week increments, medications make up the largest number of encounters.
In Code 7 in Hospital Episode Statistics).We mapped these codes using a dictionary given by CPRD.This led to 56,624 unique codes, which were then mapped to 299 clinically meaningful disease categories, using Caliber, a previously published and clinically-validated phenotyping method. 8Medication codes were classified using the British National Formulary hierarchical coding format. 9We used codes at the section level prevalent in the population leading to 426 unique medication group codes.Both disease and medication codes with unknown mapping was mapped to an "UNKNOWN" category.The "heart failure" phenotype is defined by Caliber to be a collection of Read and ICD-10 codes.We looked at the diagnoses codes strictly (as opposed to codes of historical diagnoses).The codes are found at: https://www.caliberresearch.org/portal/phenotypes/heartfailure,using incident codes from primary (Read) and secondary care (ICD-10).

Details of model training
The model was implemented using PyTorch 10 .We applied Bayesian optimization 11 for model hyperparameter tuning on the number of layers, hidden size and intermediate size.After 20 iterations of searching the parameter space, we chose the optimal hyperparameters for the model, with number of layers: 4, hidden size: 120, number of attention heads: 6, and intermediate size: 108.We pre-trained BEHRT's weights using the Masked Language Modelling 1 pre-training task using the dataset A: we randomly masked some encounters in the medical history of the patient, and predicted the masked encounters.This task is unsupervised and undertaken to let the model gain a general understanding of the predictors and their temporal relationship in the longitudinal data.After pretraining the model, we implemented the model for the heart failure prediction task on dataset B.

Temporal trends of analgesics
For, analgesics, we see a similar declining trend as calendar years increase.In Fig 6B, we first show that prescription of analgesics was more non-opioid than opioid prior to 1996.We see that BEHRT parallels this generational shift in drug component prescription through RC; non-opioid analgesics are primarily composed of NSAID's and generally increase the risk of cardiovascular events 12,13 .Thus, RC prior to 1995 is shown to be quite high.Tracing the gradual change in majority preference to favour opioid based analgesics following 1995, RC captures this change in prescription behaviour and as a result, attenuates in the following decades.

Figure 2 .
Figure 2. Model performance evaluation and temporal modality analysis.A and B represent 5-fold internal validation; C and D represent external validation.A and C present receiver operating characteristic (ROC) curves and B and D present precision-recall curves (PRC).D (diagnosis), M (medication), A (age), and Y (year) represent the modalities being used for model training and evaluation.

Figure
Figure 3: Age-stratified relative contribution analyses for diseases and medications identified by model.A and B represent top 10 relative contributions of diseases and medications to heart failure prediction respectively.X and y axis represent age groups in year and relative contribution (mean and shaded 95% confidence interval), respectively.Relative contribution equals to 1 implies equal contribution to both heart failure and non-heart failure predictions.Figures show that the relative contribution of risk factors to heart failure prediction tends to attenuate in older age.The strongest contributors in the medication category tend to have a corresponding disease, supporting the relatedness of the medication-disease pairs.The black dotted line denotes 1.0 relative contribution.

Figure 4 :
Figure 4: Age-stratified relative contribution analyses for established risk factors.X and y axis represent age groups in year and relative contribution.Relative contribution greater than 1 implies more contribution to heart failure prediction while less than 1 implies more contribution to the non-heart failure prediction.Each relative contribution is presented with a mean and 95% confidence interval.The black dotted line denotes 1.0 relative contribution.

Figure 5 :
Figure5: Lowest age-stratified relative contributions for medications and contextuality of medications and diagnoses analysis.A, Bottom 10 relative contributions of medications to heart failure prediction.X and y axis represent age groups in year and relative contribution (mean and 95% confidence interval).The black dotted line denotes 1.0 relative contribution.B, relative contribution (mean; 95% confidence interval) for established risk factors stratified by prescription status.The first row (baseline) shows relative contribution of all patients with the disease denoted by column (left Venn circle shaded).Subsequent rows describe relative contribution calculated on the subpopulation of patients who have not taken the drug denoted in a particular row (left Venn subsection of circle shaded).The red color denotes increase in relative contribution with respect to baseline disease relative contribution."-" denotes no measurement due to insufficient size in those subgroups (too few patients with diabetes not receiving hypoglycemic therapy).
show in Figure5Bthat in untreated subgroups, there is a general increase in the RC for each disease (20 of 25 cases).This change in RC in people with or without treatment might also explain some of the unexpected patterns observed.For instance, although the overall and agespecific patterns of RC scores of established HF risk factors roughly concur with pre-existing epidemiological evidence, hypertension is shown to have <1 RC in ages, 70-80 years in

Figure 6 :
Figure 6: Year-stratified relative contribution of medications.A, relative contributions (mean and shaded 95% confidence interval) of three medications to heart failure prediction stratified by year.X and y represent year group and relative contribution respectively; a line is included to denote the 1.0 relative contribution.B, frequency of drugs by component in different year group in Dataset A. X and y represent the year group and counts of first-time prescriptions of drug component to patients respectively.Individual drug components are represented with bars in different color.
losing information in the original granularity of the disease encoding and potentially biased by an expert's preferences.Also, the current work used limited information available within EHR and validated the model prediction performance on one dataset, CPRD.Future studies could explore whether other records, such as measurements, blood tests, and other demographics information (e.g.ethnicity, sex) will help improve model accuracy and explainability as well as model transferability to other datasets.Additionally, during the cohort selection, we kept patients with sufficient records to make robust predictions.This can potentially compromise model's generalizability for prediction in low risk groups who have fewer clinical encounters.

Figure 1 .
Figure 1.Central Illustration.(A, D) Dataset A is used for general contextual unsupervised pretraining.(B, E) With additional cohort selection criteria, we further design Dataset B for incident heart failure prediction.(C) Predictors are represented as summed embeddings.Codes not phenotyped are shown as "UNK"; "D#" and "M#" represent hypothetical diagnoses and medications respectively."SEP" represents visit separation.(F) We assessed the utility of modalities for predictive performance.(G) We form Dataset C by selecting patients who are predicted most accurately (AUROC bins) for relative contribution analysis.(H) We develop population-based relative contribution by aggregating the individual-level predictor contributions.

Fig
Fig S2A and Fig S2B show two cosine similarity matrices for each pair of instances in embeddings of calendar

: 3 :
perturbed input encounter embeddings (original represented by ) : number of encounters in this patient's medical history : output state of original input (without perturbation) : output state of perturbed input : weight hyperparameters ( ); if , then loss function is symmetric.: heart failure label : mean squared error described in Guan et al. asymmetric weight function : information entropy based loss function described in Guan et al. 3

3: Age-stratified relative contribution analyses for diseases and medications identified by model
the UK, primary care and secondary care use different coding systems for diagnosis (i.e.Read Code 6 in GP and 10 th revision of International Statistical Classification of Diseases and Related Health Problems [ICD-10]