A Transformer-Based Model Trained on Large Scale Claims Data for Prediction of Severe COVID-19 Disease Progression

In situations like the COVID-19 pandemic, healthcare systems are under enormous pressure as they can rapidly collapse under the burden of the crisis. Machine learning (ML) based risk models could lift the burden by identifying patients with a high risk of severe disease progression. Electronic Health Records (EHRs) provide crucial sources of information to develop these models because they rely on routinely collected healthcare data. However, EHR data is challenging for training ML models because it contains irregularly timestamped diagnosis, prescription, and procedure codes. For such data, transformer-based models are promising. We extended the previously published Med-BERT model by including age, sex, medications, quantitative clinical measures, and state information. After pre-training on approximately 988 million EHRs from 3.5 million patients, we developed models to predict Acute Respiratory Manifestations (ARM) risk using the medical history of 80,211 COVID-19 patients. Compared to Random Forests, XGBoost, and RETAIN, our transformer-based models more accurately forecast the risk of developing ARM after COVID-19 infection. We used Integrated Gradients and Bayesian networks to understand the link between the essential features of our model. Finally, we evaluated adapting our model to Austrian in-patient data. Our study highlights the promise of predictive transformer-based models for precision medicine.

) is an infectious disease caused by severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) that arose in December 2019.Since its emergence, 628 million people have been infected, and 6.58 million have died (https://coronavirus.jhu.edu/map.html,accessed 25.10.2022).In such pandemic circumstances, healthcare systems face a tremendous challenge as they can quickly collapse under the burden of this unprecedented crisis.Despite taking countermeasures such as testing, lockdowns, and vaccinations, the pandemic temporarily put immense stress on global healthcare systems.The use of decision support systems such as patient-level risk models can assist with the critical tasks of quickly and efficiently identifying high-risk patients so that the existing resources are best distributed and vulnerable patient subgroups are effectively protected.
Structured Electronic Health Records (EHRs) offer great opportunities for the efficient development of such risk models as they are routinely collected in many healthcare systems in large quantities.They contain data on diagnoses, prescriptions, procedures, and quantitative clinical measurements, such as vital values from bedside monitoring.Additionally, demographic data such as age, gender, and geographical region may be included.Models trained on such data could be used to better understand risk factors, such as comorbidities and medications, in addition to predicting a patient's risk of severe disease development.However, these data present significant challenges due to their high dimensionality, heterogeneity, temporal dependence, sparsity, and irregularity, making them difficult to fully exploit [1].
Furthermore, the coding of diagnoses is frequently biased for economic reasons.Since there is no unique mapping of a physician's diagnosis to a coding scheme such as ICD, there is a tendency to select the code that delivers the greatest economic benefit from among several possible codes.Concerning medications, it is noteworthy that categorization often occurs at the product level as opposed to the chemical substance level and that several medications may contain the same chemical substance.

TABLE I COMPARISON OF EXMED-BERT WITH OTHER MODELS
In the past, many ML approaches have been taken to work with structured EHR data.Simpler methods often limited the time information and just worked with a one-hot encoding (OHE) of diagnoses and prescriptions, which allowed the application of standard ML techniques, such as logistic regression, random forest (RF), XGBoost (XGB), and Bayesian methods [2].Recently, more studies focused on the use of time-series information.Methods for such an approach include autoencoders, convolutional neural networks [3], or sequential models like recurrent neural networks (RNN) [4] or transformer-based models [5], [6], [7], [8], [9].Transformer-based models originate from natural language processing (NLP) and have recently gained much attention since they have achieved excellent results in many areas [10], [11], [12], [13].A principal advantage of transformer models is the ability to train them in a parallel fashion and to weigh different parts of a time series differently due to their inbuilt attention mechanism.Transformer-based models typically undergo two-stage training: pre-training for generic representation learning and transfer learning (fine-tuning) for applicationspecific prediction.This approach enables sharing pre-trained models, often based on large datasets like the entire Wikipedia or protein sequences, with a broader community.These models can then be fine-tuned for various unforeseen tasks, highlighting the transformer-based approach's versatility and strength.
Variants of the Bidirectional Encoder Representations from Transformers (BERT) [14] model have recently been applied to structured EHR data.For instance, Shang et al. developed a graph-augmented transformer model named G-BERT to encode the medical history of single medical appointments and used the generated embeddings for a medication recommendation task [9].Later, Li et al. developed BERT for EHR (BEHRT), which generated a patient embedding based on the history of diagnoses and used it for disease prediction in different time windows [5].Since BEHRT -like most transformer-based models -is limited with respect to the maximum sequence length, the authors later developed a hierarchical BEHRT variant (HI-BEHRT), which can process longer medical histories [6].Meng et al. presented another model in 2021, the Bidirectional Representation Learning model with a Transformer architecture on Multimodal EHR (BRLTM), which employed a strategy similar to BEHRT but incorporated a larger vocabulary, including diagnoses, medications, and procedures [8].Med-BERT, another transformer-based model for structured EHR data, is closely related to BRLTM, but it features an even larger vocabulary and slightly different training objectives [7].A comprehensive comparison of these models is provided in Table I.Unfortunately, none of the above-mentioned models is publicly available in a pre-trained form and thus not usable for the broader community.
Our contribution is an extension of the Med-BERT approach by including information about prescribed medications and demographic information such as state of residence, gender, and age as well as quantitative clinical measurements.We pre-trained our model, named ExMed-BERT, on 987,846,612 EHRs collected between 2010 and 2021, stemming from 3.5 million US patients in the IBM Explorys Therapeutic dataset.As a showcase, we subsequently used data from 80,211 COVID-19 patients to develop ML models for predicting the risk of acute respiratory manifestation (ARM) within three weeks after a confirmed COVID-19 diagnosis.This time frame was chosen because, on the one hand, a COVID-19 infection typically lasts 10 to 14 days.On the other hand, the timestamp of the COVID-19 diagnosis provided in the data may only be accurate up to a weekly resolution.The aim was thus to capture a serious event that could be time-wise related to the previously reported infection.
We compared our ExMed-BERT models with the three baseline models, which included the RNN-based RETAIN model [15], as well as two models (RF [16] and XGBoost [17]) that ignore time information.We then used explainable AI methods to gain insights into the underlying mechanisms of our models.A specific contribution is the use of Bayesian networks (BNs) to disentangle the relationship between most predictive features.Finally, we explored how our ExMed-BERT models could be adapted to external data from an Austrian hospital group (KAGes) via transfer learning strategies.Opposed to previous work, we make our ExMed-BERT model available to the scientific community.

A. General Overview
The work in this article consists of four phases (Fig. 1): 1) Pre-Training of transformer-based model for structured EHR data: Initially, we prepared a dataset of large-scale claims data and pre-trained a transformer-based model called ExMed-BERT for structured EHR data.2) Development of risk models for COVID-19 disease progression: Subsequently, we used our newly trained model to develop risk models for predicting severe COVID-19 disease progression -namely ARM -and compared their performances with RF, XGB, and RETAIN models.3) Interpretation of developed risk models: Then, we used the Integrated Gradients approach in conjunction with Bayesian Networks to offer detailed explanations for model predictions.4) Evaluation of the adaptation of our models to data obtained from an Austrian hospital group within a transfer learning approach.In the following, we describe our approach in more detail.

B. Data Preprocessing 1) Preparation of Data for Modeling:
This study used the IBM Explorys Therapeutic dataset (https://www.ibm.com/products/explorys-ehr-data-analysis-tools), which comprises EHRs and insurance claims from 4.5 million patients from all over the USA from 2010 until mid of 2021.Records consist of prescribed drugs, diagnoses, performed procedures, and a few quantitative clinical measures (e.g., blood pressure).We focused on demographic data and drugs, diagnoses, and available quantitative clinical measures.We excluded patients with fewer than five observations.This led to a reduced dataset of 3.5 million patients with 987,846,612 recorded diagnoses and drugs, which we used for pre-training a transformer model (details described later).The intent behind the pre-training of a transformer model is to learn a suitable vector representation of timestamped structured EHRs, irrespective of any later clinical use case.The fit of the model to a dedicated clinical endpoint is then performed within a subsequent fine-tuning/transfer learning step, for which we selected only patients with a confirmed COVID-19 diagnosis defined by the use of the International Classification of Diseases (ICD10) [18] code U07.1 or a set of Logical Observation Identifier Names and Codes (LOINC) [19] codes (see Supplementary Section A) (n = 80,211).We corrected the diagnosis or observation dates of the records by subtracting seven days to get an approximation of the index date of infection.Then we focused on the ARM endpoint, which was defined if at least one of the following diagnoses appeared within three weeks after the COVID-19 infection was reported (n = 10,743): r Pneumonia due to coronavirus disease 2019 (J12.82)r Acute bronchitis due to other specified organisms (J20.8)r Unspecified acute lower respiratory infection (J22) r Bronchitis, not specified as acute or chronic (J40) r Acute respiratory distress syndrome (J80) r Respiratory failure, not elsewhere classified (J96) r Other specified respiratory disorders (J98.8) For fine-tuning, we used one year of medical history of the COVID-positive patients prior to their infection.Patients who fulfilled these criteria for the ARM endpoint were labeled as positives.Supplementary Fig. A.1 depicts the filtering process in further detail.
To identify negatives while adjusting for the potentially confounding effects of age and gender, we used the technique of Inverse Probability of Treatment Weighting (IPTW) [20], [21], [22].We used the Python package psmpy [23] (version 0.2.8) to calculate propensity scores (PS), and subsequently, the IPTW weights for each patient sample were calculated by the following equation and used in the fine-tuning process.

2) Mapping of Drug and Diagnosis Codes:
The IBM Explorys Therapeutic dataset includes information about diagnoses encoded as ICD9 and ICD10 codes and administered or prescribed drugs as RXNorm [24] identifiers.To harmonize the two versions of ICD diagnosis codes, we mapped them to Phecodes provided by the Phenome-wide association study (PheWAS) [25].Due to the lower number of Phecodes, the problem of a non-unique mapping between a physician's diagnosis and the ICD coding scheme is reduced.Hence, we reduced potential coding biases and the feature space from 59,709 to 1,850 codes.Similarly, we mapped the provided RXNorm identifiers (RxCUI) to the fourth level of the Anatomical Therapeutic Chemical (ATC) [26] classification system for chemical compounds and thus addressed the sparse use of some RxCUIs by reducing the feature space from 23,801 to 630 codes.
3) Input Representation for the Models: For the Random Forest (RF) and XGBoost (XGB) models, we employed a onehot encoding approach to represent all categorical features.In this scheme, a diagnosis or drug recorded in the one-year medical history was denoted as one and zero otherwise.We applied the same encoding technique to the state of residence and sex variables.However, the data formatting requirements for the RETAIN model and our transformer-based model significantly Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
differ from those of the RF and XGB baseline models.Owing to their design, which is tailored to handle sequential data, these models necessitate the representation of each patient's entire medical history as a sequence.As illustrated in the lower part of Fig. 2, we created separate sequences for each modality, with each element of the sequence being an integer corresponding to one vector of the embedding matrices.Quantitative clinical measures were only considered during the fine-tuning phase.

C. Model Training and Evaluation
In this section, we provide a comprehensive overview of the methods used in our study, detailing the pre-training and fine-tuning of various models across multiple experiments.We begin by outlining the structure and pre-training of our novel transformer-based model before describing the development of machine learning-based risk models using RF, XGBoost, RE-TAIN, and our new model.However, we note that recent models such as BEHRT or Med-BERT could not be included in our comparison as the pre-training data and models are not publicly available.Finally, we describe the integration of quantitative clinical measurements with the patient's medical history.While previous works have extensively described the model architectures, we refer interested readers to publications by Vaswani et al. and Devlin et al. for Transformer-based models [14], [27], Breiman for Random Forest [16], Chen and Guestrin for further details on the XGBoost approach [17], and Choi et al. for RETAIN [15].

1) Basic Model Structure and Pre-Training of ExMed-BERT:
We followed a similar strategy as the Med-BERT article and focused on an extension of the BERT embedding layer.In addition to the diagnoses that were included in the Med-BERT model, we further extended Med-BERT by adding information on prescribed drugs, the patient's sex, state of residency, and age.We denote our model as Extended Med-BERT (ExMed-BERT).As shown in Fig. 2, we used different embeddings to accommodate the five feature modalities.Diagnoses and drugs were represented in one embedding via Phecodes and ATC codes.The sex and state embeddings contained static information.The age sequence contained the patients' age encoded in months, and lastly, the visit sequence was used to distinguish between each visit in a sequence.Since the order of drugs and diagnoses within one visit was random, we passed on a serialization embedding.Similar to Med-BERT, we did not use CLS and SEP tokens in our input sequences.
We used the same hyperparameters and training objectives as Med-BERT and pre-trained the model on the entire information of the 3.5 million patients in the pre-training cohort.If sequences exceeded the maximum sequence length of 512 diagnosis and drug codes, we split the sequences and processed the samples individually.We used the following joint training objectives to pre-train our model: r Masked language modeling (MLM): This task is identical to the BERT approach and we followed the Med-BERT strategy in masking only one of the codes at a time.In 80% of the cases, the masked code was replaced with [MASK], Fig. 2. Overview of the model structure.Compared to BERT or other transformer-based models, we employed a multimodal embedding layer for structured EHR data comprising drug, diagnosis and visit information and information about a patient's sex, state of residence, and age.After embedding, the input is passed through 6 transformer layers before a final representation of a patient's medical history is generated with an FFN, LSTM, or GRU head.Subsequently, these patient representations were either concatenated with the quantitative clinical data or directly passed through an FFN head for classification.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in 10% it was replaced with another code and in the remaining 10%, it remained unchanged.The model's task was to predict the correct code based on the information provided by the remaining sequence.

r Prediction of prolonged length of stay (PLOS) in hospital:
As Rasmy et al. [7], we also predicted whether a patient had a prolonged stay in a hospital (>7 days) throughout his or her medical history.This task requires assessing the severity of a patient's health condition throughout their medical history.
2) Machine Learning-Based Risk Models: In this study, we aimed to predict severe COVID-19 disease progression by developing ML-based risk models to determine whether a patient would develop ARM within three weeks of their COVID-19 diagnosis.To achieve this, we utilized one year of the patient's medical history prior to diagnosis.To adjust for potential confounding effects of age and gender, we employed the IPTW approach described before.
During the fine-tuning of our ExMed-BERT model, we evaluated different classification head variants.We trained three different models using a feed-forward network (FFN), long short-term memory (LSTM), and gated recurrent unit (GRU) head.We split the data into training, validation, and testing sets in a stratified manner (70/10/20%) and used Bayesian hyperparameter optimization (optuna [28], version 2.10.0) to tune model parameters such as the learning rate, batch size, warmup ratio, weight decay, and in the case of RNNs, also the number of RNN layers.Similarly, we trained RF, XGB, and RETAIN classifiers and optimized several model hyperparameters.A detailed list of all optimized parameters can be found in Supplementary Table B.1.
For RETAIN, we used a Keras-based implementation 1 with slight modifications described in the next section.The input was similar to our ExMed-BERT model, using PheWAS and ATC codes and visit/date information but excluding a patient's age, sex, and state of residency, as these were not part of the original RETAIN approach.

3) Combination With Quantitative Clinical Measurements:
In this study, we also assessed the integration of diagnosis and prescription codes with numerical clinical data, such as blood pressure readings.Our analysis focused on data documented in the two weeks leading up to the revised index date, excluding features with over 60% missing data.Given the considerable sparsity of this data, we were left with only eight features for our investigation: weight, body mass index (BMI), body surface area (BSA), height, body temperature, diastolic and systolic blood pressure, and heart rate.The number of patients with available numerical data for each feature is displayed in Table II (n = 23,949).To impute the numerical data for all patients, we employed a Random Forest (RF)-based approach, utilizing only the training data (missingpy [29], version 0.2.0).
The imputed numerical clinical features were combined with one-hot encoded (OHE) data for the RF experiments.In the case of the XGBoost (XGB) model, no prior imputation was carried 1 [Online].Available: https://github.com/Optum/retain-keras

TABLE II CHARACTERISTICS OF THE PRE-TRAINING AND FINE-TUNING DATASETS
out.Instead, we fused the numerical clinical features with the OHE data, relying on the imputation mechanism built into XGB.
In terms of model modifications, no alterations were required for the RF and XGB models.However, we needed to adjust the ExMed-BERT and RETAIN model architectures to accommodate numerical clinical features.Specifically, we utilized the same ExMed-BERT model as before to produce a patient's medical history embeddings.We then combined these vector-based representations with the numerical input before feeding it into a final classification head.The overall model architecture and the concatenation approach are detailed in Fig. 2. Similarly, we produced the latent encodings in the RETAIN model and combined them with the numerical clinical features before the classification layer.

D. Model Interpretation 1) Feature Importance:
To better understand the ExMed-BERT models, we used the Integrated Gradients (IG) [30] approach to determine which drugs and diagnoses had the strongest influence on the model predictions.The IG method is an axiomatic model interpretability technique that awards, in the case of the ExMed-BERT models, an attribution score for each diagnosis or drug in the medical history.Next to an input sample (x ∈ R n ), the IG method requires a baseline input (x ∈ R n ), which we constructed using a sequence of padding tokens.The IGs are then approximated by summing the gradients at points along the path from the specified baseline to the input using the following formula: (2) Here, F is a differentiable function (F : R n ⇒ [0, 1]) that represents our ExMed-BERT model.We performed 50 steps to approximate the integrated gradients.
Initially, we computed IG attributions for all patients in the test dataset.Based on these, we calculated the mean absolute attribution for each diagnosis and drug that occurred at least ten percent of the time to identify the top features for each model.Subsequently, we calculated partial dependency scores using the top 20 features.To do so, we first calculated the probability for each patient for a specific endpoint using our fine-tuned ExMed-BERT models; we refer to this probability as p r .The data for each of the top 20 features were then permuted individually by exchanging the respective diagnosis or drug codes with a PAD token.Subsequently, the modified data was used as input for our models to calculate the probability p m .Finally, a fold change for each feature was calculated using the probabilities obtained for actual (p r ) and modified data (p m ) to estimate the effect (fold change; FC) of certain features on the model's prediction: 2) Unraveling Feature Dependencies: To better comprehend the numerous interactions and dependencies between the most influential features, we developed BN models.BNs are probabilistic graphical models that can represent complex multivariate distributions with many variables.They can be graphically depicted with nodes representing random variables and edges expressing conditional statistical relationships.Let G = (V, E) be a directed acyclic graph and {X v |v ∈ V } a set of random variables indexed over nodes in V. Then for any BN B = (X, G): where pa v denotes the parents of v ∈ V according to the graph structure G.Because of their ability to model (potentially causal) relationships between variables, BNs are frequently employed in many areas of science, including system biology and medicine.
In this work, we learned the graph structure G of a BN for the 100 most important features (according to the IG method) using the R package bnlearn [31] (version 4.7).We used a one-hot encoding for the respective features to indicate whether it was present in the one-year medical history, similar to the data preparation for the tree-based models.We also provided the patients' age, sex, and endpoint status.The tabu algorithm [32], [33] was used for BN structure learning.This was performed within a non-parametric bootstrap sampling scheme: We randomly subsampled n = 80,211 patients with replacement for 1000 times, and for each bootstrap sample we performed a complete network structure learning.We then focused on edges occurring in over half of the 1000 network architectures acquired from the non-parametric bootstrapped samples.

E. Transfer Learning on Austrian Hospital Data
1) Overview of the Data: Data from the Austrian hospital group consisted of pseudonymized in-patient records of 6,335 COVID-19-positive patients, out of which 385 suffered from ARM within a 3-week follow-up period after the initial visit to the hospital.The medication prescriptions were already encoded in ATC, but as ICD9/10 codes were used for diagnoses, these were mapped to Phecodes, akin to the procedure described earlier for the IBM Explorys dataset.
2) Transfer Learning of ExMed-BERT: We continued training the ExMed-BERT model for the ARM endpoint for only five epochs on the Austrian hospital data.This was done due to computational constraints.For the same reason, we did no substantial hyperparameter tuning but used the optimal hyperparameters discovered on the IBM Explorys data.We used 5-fold cross-validation to account for the small amount of available data.Alongside the ExMed-BERT model, we trained a new RF model for comparison.

III. RESULTS
In this study, we predicted severe COVID-19 disease progression based on a patient's medical history.We begin by presenting the pre-training results of our newly created ExMed-BERT model.Then, we show the performances of the developed risk models, and lastly, we interpret our models using an explainable AI methodology.

A. Model Pre-Training
We utilized MLM and PLOS as training objectives for pretraining of the ExMed-BERT model.

B. Evaluation of Risk Models
Following pre-training, we developed and evaluated risk models for predicting the ARM endpoint.Initially, we considered only the medical history without additional quantitative clinical measures.As shown in Table III, all ExMed-BERT models performed better than the RF, XGB, and RETAIN variants on unseen test data.Without quantitative clinical data, the ExMed-BERT models scored roughly 78% AUROC for the ARM endpoint, and the AUPR varied between 36.7% and 38.2%.The RF model, on the other hand, only achieved an AUROC of 73.4% and an AUPR of 29.1%.The XGB model had a slightly lower AUROC of 72.4% and AUPR of 28.2%.The RETAIN model achieved the lowest performance, with an AUROC of 68.5% and an AUPR of 26.8%.
The results of nearly all models improved when quantitative clinical measurements were integrated.The ExMed-BERT model with the GRU classification head integrating quantitative data gave the overall best result, with an AUROC of 79.8% and an AUPR of 38.7%, which is significantly higher than all other models.
When only patients with fully recorded quantitative clinical measurements were used, all models performed worse.That means the potential negative effect of imputing missing values was far less than the benefit of including additional data.

C. Model Explanation
To better model predictions, we used an explainable AI methodology -namely IG -to calculate attribution scores for all features in the best-performing ExMed-BERT model.We calculated the IG attributions and used them to identify the 20 most important features by ranking them based on their mean absolute value.Fig. 3 shows all the IG attributions and FC scores, which are in agreement with each other.We found that the presence of diagnoses for chronic airway obstruction, congestive heart failure, cough, dementia, edema, obesity, shortness of breath, spondylosis, and type 2 diabetes in the medical history has a large impact on the prediction of a patient's risk for ARM.Similarly, the prescriptions of angiotensin II receptor blockers, biguanides, dihydropyridine derivatives, and thiazides have a substantial positive impact on our models' predictions.
Of course, these prescriptions and diagnoses could be correlated with each other, and thus, not all of them might have a direct impact on the ARM endpoint.Hence, we learned the graph structure of a BN to determine how the significant diagnoses or drugs could be related to one another.The overall network structure is provided as a graphml file, an XML-based data format for graph representation, as supplementary material to this article.Fig. 4 shows two excerpts of the BN graph structure.Fig. 4(a) focuses on Angiotensin II receptor blockers and their relationship to other drugs and diagnoses.Angiotensin II receptor blockers are used to treat hypertension, kidney diseases, and heart failure [34].Furthermore, our graph shows a connection to essential hypertension and several ATC subgroups, namely ACE inhibitors, Dihydropyridine derivates, HMG CoA reductase inhibitors, and Thiazides.
Fig. 4(b) depicts morbid obesity and other diagnoses and drugs in its immediate neighborhood.There is a link to the class of Biguanides, which includes the drug Metformin, commonly used to treat diabetes [35].Furthermore, morbid obesity is linked to hypertension, type 2 diabetes, obstructive sleep apnea, and obesity.
We aimed for an understanding of the statistical and potential causal effects of those features on the endpoint, which were either among the 20 most important features or sink nodes in the BN.The latter are nodes without outgoing connections and, therefore, do not influence any other features, according to our BN analysis.For each of those features, we performed a univariate logistic regression analysis while using IPTW case weights to correct for potential confounding effects of age and gender.Our analysis shows significant effects of several prior diagnoses on the ARM onset, namely, type 2 diabetes, obesity, dementia, cardiovascular diseases, and respiratory diseases (see Table IV).These morbidities have previously been reported as risk factors for severe COVID-19 disease progression [36], [37], [38], [39], [40], [41], and also the underlying molecular mechanisms have been discussed [42], [43].Besides, we found significant effects between constipation, screening for malignant neoplasms, and infectious/parasitic diseases and the ARM onset.This might be explained by the fact that such procedures are more frequently executed in older patients with bad health conditions, resulting in a higher risk of severe COVID-19 progression.In the same type of patients, constipation is also a frequent problem, e.g., due to lifestyle.

D. Transfer Learning on Austrian Hospital Data
Fine-tuning of Ex-MedBERT on a small set of pseudonymized in-patient data from an Austrian hospital group resulted in a prediction performance almost identical to the one observed for an RF trained de novo on the same data (Supplementary Table B.2).At the same time, prediction performances were significantly lower than the ones observed on IBM Explorys Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(AUC ≈ 60 %).We will elaborate on potential reasons in the subsequent discussion.

IV. DISCUSSION
Pandemics such as COVID-19 pose immense challenges to global healthcare systems.Utilizing patient-level risk models to support doctors and clinics is one way to maximize the use of available resources.Following previous research, we trained a transformer-based model on structured EHR data in this study.In contrast to the prior approaches, such as BEHRT or Med-BERT, we incorporated additional data modalities and developed risk models for COVID-19 disease progression.Prediction performances achieved by our ExMed-BERT model are altogether superior to those reported by Lazzarini et al. [44] for the closely related endpoint of acute respiratory distress syndrome (ARDS) [44].The authors trained an XGB based on US administrative claims data from 290,000 patients and achieved an AUROC of 69% and an AUPR of 7%.For comparison, using data from intensive care units (ICUs), Bendavid et al. reported an AUROC of 83% for an XGB trained to predict the initiation of invasive mechanical ventilation [45], and Singhal et al. achieved an AUROC of 89% for predicting the onset of ARDS [46].Importantly, ICU data are structurally and content-wise very different from the data used in our study, which comprises in-patient as well as out-patient information over a more extended period (here: one year), but only contains limited quantitative information.Altogether, our findings align with previous studies [5], [7], [8], showing that transformerbased models are well-suited for structured EHR data similar to ours.Even without additional quantitative information, our ExMed-BERT outperformed the RF, XGB, and RETAIN models.With the inclusion of quantitative clinical measures, our ExMed-BERT models further increased in prediction performance.For that purpose, we proposed a novel approach to combine quantitative clinical measures with the embeddings of EHR codes learned by ExMed-BERT, which resulted in the overall best-performing model.Our results thus demonstrate the importance of combining diagnosis and prescription codes with quantitative clinical measures for developing risk models.Even though these quantitative clinical measures were only taken at a single time point in the two weeks preceding the COVID-19 infection and not for every patient, using these data could provide better performance.
Using a combined strategy consisting of feature importance analysis, BN structure learning and statistical hypothesis testing, we were able to identify diagnoses and prescriptions that have a significant impact on model prediction and may causally influence the endpoint.Our analysis supports that socioeconomic and psycho-social health risks play an important role in addition to well-known risk factors such as obesity, diabetes, cardiovascular diseases, and dementia, which have already been reported as known risk factors for severe COVID-19 disease progression in several studies [36], [39], [47], [48], [49].This confirms the validity of our approach, which can be applied to other datasets as well.
Our work demonstrates the potential of a transformer-based pre-training/fine-tuning strategy to develop risk models for precision medicine.This strategy provides the chance to perform transfer learning of our model on data from other organizations and thus use the pre-trained ExMed-BERT as a basis for future model development.Our experiment with data from an Austrian hospital group demonstrated the potential as well as the limitations of such an approach: The data from the Austrian hospital group only comprises in-patient information, and the number of patients is far smaller than during the fine-tuning phase on the IBM Explorys data (6,335 patients instead of 80,211).Furthermore, the ratio of ARM-positive patients is significantly lower (6.1% instead of 13.4%).Notably, there could also be different medical coding practices in the two countries.Finally, constraints on the technical equipment within the Austrian hospital group only allowed us to fine-tune our model for a small number of epochs and without hyperparameter tuning.Due to all these factors, our ExMed-BERT model fine-tuned on the Austrian data achieved a performance that was comparable to an RF model trained de novo on the same data but significantly lower than prediction performances achieved on US data.We thus conclude that having a sufficiently large dataset with a number of patients in a range comparable to the IBM Explorys data would be a prerequisite to obtaining better models in a transfer learning setting.Furthermore, appropriate technical equipment is important.Finally, the integration of in-patient and out-patient data is required, at least for our model.
Another limitation is the lack of previously published transformer-based models, such as BEHRT or Med-BERT, and the associated data, which hindered direct comparison with our model.As the pre-training of transformer-based models is computationally extremely expensive, it is often not feasible to run comprehensive ablation studies.Despite these limitations, our model was rigorously assessed by comparing it against established approaches (RF, XGBoost, and RETAIN), and its potential was adequately demonstrated.By making our model publicly available, future studies can use it as a foundation for further development.

V. CONCLUSION
Our work demonstrates the potential of customized transformer-based models for analyzing structured EHR data.We showed that it is possible to integrate quantitative clinical data into such models, which can significantly improve prediction performance.Furthermore, we introduced a general approach for explaining ExMed-BERT model predictions.
Transfer learning strategies open the possibility of leveraging our pre-trained ExMed-BERT model for the prediction of clinical endpoints different from the one addressed within this article.For that purpose, we allow users to apply for access to our pre-trained ExMed-BERT model on https://doi.org/10.5281/zenodo.7324178or by sending an email to the corresponding author.Our code is available at https://github.com/SCAI-BIO/ExMed-BERT.

A
Transformer-Based Model Trained on Large Scale Claims Data for Prediction of Severe COVID-19 Disease Progression Manuel Lentzen , Thomas Linden, Sai Veeranki , Sumit Madan , Diether Kramer, Werner Leodolter, and Holger Fröhlich Abstract-In situations like the COVID-19 pandemic, healthcare systems are under enormous pressure as they can rapidly collapse under the burden of the crisis.Machine learning (ML) based risk models could lift the burden by identifying patients with a high risk of severe disease progression.Electronic Health Records (EHRs) provide crucial sources of information to develop these models because they rely on routinely collected healthcare data.However, EHR data is challenging for training ML models because it contains irregularly timestamped diagnosis, prescription, and procedure codes.For such data, transformer-based models are promising.We extended the previously published Med-BERT model by including age, sex, medications, quantitative clinical measures, and state information.After pre-training on approximately 988 million EHRs from 3.5 million patients, we developed models to predict Acute Respiratory Manifestations (ARM) risk using the medical history of 80,211 COVID-19 patients.Compared to Random Forests, XGBoost, and RETAIN, our transformerbased models more accurately forecast the risk of developing ARM after COVID-19 infection.We used Integrated Gradients and Bayesian networks to understand the link between the essential features of our model.Finally, we evaluated adapting our model to Austrian in-patient data.Manuscript received 12 December 2022; revised 17 May 2023; accepted 15 June 2023.Date of publication 22 June 2023; date of current version 6 September 2023.This work was supported by the Fraunhofer I. INTRODUCTION C ORONAVIRUS disease 2019 (COVID-19

Fig. 1 .
Fig. 1.Study overview.First, we pre-trained a transformer model on about 988 million EHR records from 3.5 million patients.Then, we developed patient-level risk models for COVID-19 disease progression.Next, we interpreted our developed risk models using Integrated Gradients in conjunction with Bayesian networks.Finally, we evaluated the possibility of adapting the models to external data.
After 4.5 M steps (epoch 37), the MLM accuracy increased to around 51% and the PLOS F1 score to 70%.Following the inclusion of 61 missing ATC codes and the corresponding changes to the embedding, we began training for 750 K steps.Finally, we achieved an MLM accuracy of 67% and a PLOS F1-score of 66% (epoch 42, Supplementary Fig. B.1).

Fig. 3 .
Fig. 3. Integrated Gradients Attributions for ExMed-BERT GRU.Depicted are all calculated fold changes (FC) and IG attributions for the 20 most important features for the prediction of ARM onset.The dashed blue lines indicate neutral attributions.Everything greater than the neutral value positively affects the prediction and vice versa.

Fig. 4 .
Fig. 4. Selected features and the direct neighborhood in the inferred Bayesian network.The numbers indicate the bootstrap strength of the respective edges in percentage.That means a bootstrap strength of 100 indicates that the corresponding edge has been found in each of the 1000 BN reconstructions learned from different bootstrap samples.

TABLE III EVALUATION
RESULTS OF THE RISK MODELS FOR PREDICTING ARM

TABLE IV FEATURES
WITH A SIGNIFICANT STATISTICAL EFFECT ON ARM