Hi-BEHRT: Hierarchical Transformer-Based Model for Accurate Prediction of Clinical Events Using Multimodal Longitudinal Electronic Health Records

—Electronic health records (EHR) represent a holistic overview of patients’ trajectories. Their increasing availability has fueled new hopes to leverage them and develop accurate risk prediction models for a wide range of diseases. Given the complex interrelationships of medical records and patient outcomes, deep learning models have shown clear merits in achieving this goal. However, a key limitation of current study remains their capacity in processing long sequences, and long sequence modelling and its application in the context of healthcare and EHR remains unexplored. Capturing the whole history of medical encounters is expected to lead to more accurate predictions, but the inclusion of records collected for decades and from multiple resources can inevitably exceed the receptive field of the most existing deep learning architectures. This can result in missing crucial, long-term dependencies. To address this gap, we present Hi-BEHRT, a hierarchical Transformer-based model that can significantly expand the receptive field of Transformers and extract associations from much longer sequences. Using a multimodal large-scale linked longitudinal EHR, the Hi-BEHRT exceeds the state-of-the-art deep learning models 1% to 5% for area under the receiver operating characteristic (AUROC) curve and 1% to 8% for area under the precision recall (AUPRC) curve on average, and 2% to 8% (AUROC) and 2% to 11% (AUPRC) for patients with long medical history for 5-year heart failure, diabetes, chronic kidney disease, and stroke risk prediction. Additionally, because pretraining for hierarchical Transformer is not well-established, we provide an effective end-to-end contrastive pre-training strategy for Hi-BEHRT using EHR, improving its transferability on predicting clinical events with relatively small training dataset.

In this section, we provide more information on modalities hat are not commonly included in the modelling.More specifcally, we will introduce procedure and test.
1) Procedure: Procedure is CPRD linked data collected rom Hospital Episode Statistics (HES) Admitted Patient Care EHS APC) data.It is recorded at the point of admission to, or ttendances at NHS healthcare providers.All procedure inforation is coded using the U.K. Office of Population, Census nd Surveys classification (OPCS) 4.6, and procedures that are ot covered by OPCS code is not included in the system.Each ecord in the system is specified with a start and an end date, s well as event date.We used OPCS code and event date to tructure the timeline of a patient's EHR history for modelling.

C. Model Stratified By Baseline Age
We evaluated model performance stratified by the baseline age.The comparison was conducted on three subgroups of patients: 1) patients with baseline age between 35 and 50 years old (young adult); 2) patients with baseline age between 50 and 70 years old (middle-aged adult), and 3) patients with baseline age 70-90 years old (older adult).Table IX shows that the hierarchical BEHRT model has better performance across all subgroups, and it substantially outperforms for BEHRT model on HF and diabetes risk prediction tasks, especially for patients with younger age.

D. Size and Overlap of Sliding Window
For Hi-BEHRT model, we used sliding window to segment the raw EHR into segments.As shown in Table X when window size is relatively small (i.e., 50), the size of the stride does not have significant impact in terms of predictive performance, and the bigger stride size can potentially decrease the number of segments and reduce model complexity.However, for the larger window size (i.e., 100), the stride size becomes more important, and some level of overlap between segments is necessary.Without any overlap for window size 100, the AUPRC decreases 4% comparing to the model with stride size 50.Additionally, the analysis shows that not larger window size always the better choice.For instance, AUPRC of window size 100 without overlap decreases 2% comparing to AURPC of window size 50 without overlap.Without overlap, larger window can lead to shorter length in the segment level, and a balance between window size and length of segment might be more preferred in the hierarchical structure.

E. Hyper-Parameter Tuning
We set up hierarchical BEHRT with similar hyper-parameters as the BEHRT model and used it as a reference model to tune the hidden size and intermediate size of the Transformer.More specifically, we applied grid search for hidden size among [90,150,240] and intermediate size among [108,256].All experiments were conducted on the 5-year HF risk prediction task.Table XI shows that hidden size 150 and intermediate size 108 can achieve similar performance as the model with larger size.

F. Evaluation for Multiple Levels of Hierarchy
In this section, we investigated how the number of levels of hierarchy in Hi-BEHRT can influence the model performance in risk prediction.Specifically, we compared the performance of Hi-BEHRT with two and three levels of hierarchy.This is because each additional level can substantially reduce the sequence length.For instance, a sequence with maximum length 1225 would reduce to sequence length 118 with window size 50 and stride size 10 after the first level of hierarchy and would further reduce to 7 after the second level of hierarchy.Therefore, our dataset limited the number of levels we can investigate, and it would not make sense to investigate Hi-BEHRT with more than three levels of hierarchy.We encourage future work to replicate our work to more comprehensively investigate Hi-BEHRT with more levels of hierarchy.In our experiment, we only modified the feature extractor and kept the total number of layers in feature extractor the same for both comparators.More specifically, the two-level Hi-BEHRT had one level of hierarchy with four layers of Transformer for the extractor while the three-level Hi-BEHRT included two levels of hierarchy with a two-layer Transformer for each hierarchy.Both comparators used window size 50 and stride size 10 and the rest parameters were the same as reported in the manuscript.The results show that both models achieved AUROC 0.96 and AUPRC 0.76 for HF risk prediction, and there is no material difference between two-level and three-level Hi-BEHRT in our dataset.
Test is recorded in the CPRD test table and coded as Read code.It includes information on history/symptoms, examination/signs, diagnostic procedures, and laboratory procedures.In the experiment, we only used the information in the Read code level, which represents what examinations or procedures are carried out.More detailed quantitative information was excluded.

TABLE VI CODES
USED TO IDENTIFY PATIENTS WITH DIABETES IN HOSPITAL DISCHARGE RECORDS AND GENERAL PRACTICE RECORDS

TABLE VII ICD
-10 CODES USED TO IDENTIFY PATIENTS WITH CKD IN HOSPITAL DISCHARGE RECORDS AND GENERAL PRACTICE RECORDS TABLE ICD-10 CODES USED TO IDENTIFY PATIENTS WITH STROKE IN HOSPITAL DISCHARGE RECORDS AND GENERAL PRACTICE RECORDS