GenHPF: General Healthcare Predictive Framework for Multi-Task Multi-Source Learning

Despite the remarkable progress in the development of predictive models for healthcare, applying these algorithms on a large scale has been challenging. Algorithms trained on a particular task, based on specific data formats available in a set of medical records, tend to not generalize well to other tasks or databases in which the data fields may differ. To address this challenge, we propose General Healthcare Predictive Framework (GenHPF), which is applicable to any EHR with minimal preprocessing for multiple prediction tasks. GenHPF resolves heterogeneity in medical codes and schemas by converting EHRs into a hierarchical textual representation while incorporating as many features as possible. To evaluate the efficacy of GenHPF, we conduct multi-task learning experiments with single-source and multi-source settings, on three publicly available EHR datasets with different schemas for 12 clinically meaningful prediction tasks. Our framework significantly outperforms baseline models that utilize domain knowledge in multi-source learning, improving average AUROC by 1.2%P in pooled learning and 2.6%P in transfer learning while also showing comparable results when trained on a single EHR dataset. Furthermore, we demonstrate that self-supervised pretraining using multi-source datasets is effective when combined with GenHPF, resulting in a 0.6%P AUROC improvement compared to models without pretraining. By eliminating the need for preprocessing and feature engineering, we believe that this work offers a solid framework for multi-task and multi-source learning that can be leveraged to speed up the scaling and usage of predictive algorithms in healthcare.

P ATIENT medical records which are regularly accumulated in the form of Electronic Health Records (EHR) have opened up new opportunities for data-driven models, which can improve the quality of patient care.With the rapid adoption of artificial intelligence (AI) in healthcare, healthcare providers continue to develop models for different applications such as predicting patient outcomes [1]- [3], optimizing effective hospital operations [4], [5] and diagnosing diseases [6]- [8].
Until now, traditional model development methods have been constrained by their reliance on task-specific feature engineering, wherein preprocessing techniques are predominantly tailored for individual tasks or applications.For instance, predictive modeling tasks for patient care or benchmarking, and quality improvement require this approach.Consequently, each health system or research institute is compelled to employ its own data experts to meticulously preprocess medical records to suit specific tasks.This process can be time-consuming and expensive, ultimately restricting the range of potential applications [9].
Furthermore, this problem is exacerbated by the increasing number of tasks that require excessive overheads for the hospitals to develop and the managing of each task-specific model.Moreover, the increasing number of tasks significantly burdens hospitals in terms of developing and managing taskspecific models.For example, clinicians may need to simultaneously perform various prediction tasks, such as mortality and readmission, for the same patient.To address this challenge, a comprehensive framework is required that can be applied to multiple tasks [10] with minimal preprocessing, thereby minimizing the need for a meticulous design of input features.
This problem contributes to the inequality in healthcare AI, as algorithms are developed and used by large (typically academic data centers) with access to large data and research capabilities.In reality, typical EHR datasets do not follow a single data format, particularly across geographies and multiple EMR providers.Each health system could store data according to its own needs, which consequently requires a level of manual harmonization.
Specifically, different EHR systems adopt different medical code standards (e.g., ICD-9, ICD-10, raw text), and use distinct database schemas to store patient records [11]- [13]. 2 These discrepancies in medical codes and schemas prevent healthcare Fig. 1.The conventional approach for building predictive models uses domain-specific knowledge to preprocess data for each hospital (or health system) and task.In contrast, our proposed framework uses text input features, eliminating the need for preprocessing and feature engineering specific to each hospital.This allows us to train a unified model for two multi-source learning scenarios: 1) Conventional supervised learning for multi-task learning and 2) Self-supervised pretraining with unlabeled data.By employing transfer learning, our framework allows each trained model to conduct transfer learning in any hospital, irrespective of data format differences, thereby ensuring general adaptability across healthcare systems.
institutions from conducting multi-source learning, such as fine-tuning a model that has been previously trained on a different EHR dataset (i.e., transfer learning) or developing a unified model with data pooled from multiple hospitals (i.e., pooled learning).
In summary, the major challenges encountered by current healthcare prediction models are as follows: 1) models are specifically developed for each prediction task via feature engineering with task-specific domain knowledge, and 2) procuring a large amount of unified data is difficult, which is a critical problem for developing the aforementioned general-purpose multi-task prediction model.The main objective of this study is to propose a framework that addresses these two challenges.

Related work
Previous healthcare prediction models with EHR have been focused on increasing the prediction performance by utilizing domain knowledge and various architectures such as recurrent neural networks (RNN) [1], [14], convolutional neural networks [15], and transformer-based models [16]- [19].Although each study makes a distinct contribution, none address the two major aforementioned challenges.Multi-Task Learning MIMIC-Extract [20], for example, performs domain-knowledge-based feature engineering, such as grouping semantically similar concepts into a clinical taxonomy as data structures that are directly usable in common multi-task time-series prediction pipelines.Based on handcrafted features, McDermott et al. [21] proposed a benchmark for ten healthcare predictive tasks (multi-task learning) and reported their prediction performances.Because of their specialized nature, these approaches are designed to work exclusively for specific datasets, making them inapplicable to multiple EHR datasets that may vary in diversity and heterogeneity.
An alternative approach, proposed by Rajkomar et al. [22], involves a framework that incorporates all features of the EHR, that is, all column values in all the EHR tables.This allows the same model to be used for four different tasks.However, since this approach uses Fast Healthcare Interoperability Resources (FHIR) [23], which is a form of Common Data Model (CDM), to manually standardize different EHR data into a uniform format, there is a significant overhead for multi-source learning.This process of standardizing EHR formats demands considerable domain knowledge and requires extensive manual efforts, making the integration of a large number of datasets into diverse formats impractical.

Resolving EHR Heterogeneity without manual efforts
To address the lack of scalability in previous works, AutoMap [24] conducts medical code mapping via selfsupervised learning using a predefined medical ontology.This study aims to develop a solution to the current lack of a Fig. 2.
Overview of GenHPF.On the top, a patient's medical events occur over time.Each medical event M i consists of event-related features A k i , including feature names and their values.These features, prepended with event type e i , are converted to corresponding descriptions and tokenized into a sequence of sub-words.Then, an event encoder f converts the sequence (i.e., event input) into an embedding m i , which is then passed to the event aggregator g, which then makes a prediction ŷ .
unified EHR system through a direct code-to-code mapping of two different medical institutions.However, since AutoMap requires standardized medical ontology, manual efforts is still necessary.
In another study, DescEmb [25] aimed to overcome the heterogeneity of medical codes by utilizing the clinical descriptions linked to each code, thereby partially enabling multisource learning.Despite its text-based embedding to avoid the manual code mapping process, this approach still necessitates domain experts to conduct EHR system-specific preprocessing to select compatible and meaningful features from the EHRs.Overcoming schema heterogeneity across different institutions poses a challenge when selecting universally applicable features with consistent formats from multiple datasets.None of the aforementioned studies adequately address the dual challenges of utilizing multi-task models on heterogeneous EHRs.Self-supervised pretraining in EHR Self-supervised learning (SSL), which involves pretraining on large-scale unlabeled datasets and fine-tuning for prediction tasks, has demonstrated success in various applications [26]- [28] including predictive models based on EHR [29]- [34].Previous studies on SSL using EHR data have primarily focused on pretraining and finetuning models exclusively for identical EHR systems, limiting their applicability to other EHR systems.As the proposed framework resolves EHR heterogeneity, training it via SSL produces a general-purpose pretrained model that can be finetuned for any task in any EHR system.
This study makes three contributions.To address both challenges (task-specific model development process and EHR heterogeneity) simultaneously, we propose General Healthcare Predictive Framework (GenHPF) (Figure 1), which is applicable to multiple patient record systems.GenHPF resolves heterogeneity in medical codes and schemas by converting medical records into a hierarchical textual representation while incorporating as many features as possible.This framework reflects the common data structure of medical records, allowing different structures to be utilized without code and schema harmonization processes.
Second, to demonstrate the efficacy of GenHPF empirically, we conduct extensive experiments using three publicly available EHR datasets with different schemas (MIMIC-III, eICU, MIMIC-IV) for the twelve clinically meaningful prediction tasks.Our framework achieves comparable or higher prediction performances on single-domain learning compared with other frameworks, while consistently outperforming all other frameworks in terms of pooled learning and transfer learning.
Lastly, we combine several SSL methods with GenHPF, demonstrating the best practices that provides benefits to GenHPF as a self-supervised pretraining method with unlabeled data.This will enable researchers and engineers in this field to use a pretrained GenHPF as a general-purpose foundation model for diverse prediction tasks, regardless of the EHR schema.Our findings provide insights for further research on the multi-source learning of EHR. Figure 1 overviews the proposed framework.

A. Structure of Electronic Health Records
This section describes and summarizes the EHR structure and notations used throughout this paper.In typical EHR data, each patient P can be represented as a sequence of medical events [M 1 , . . ., M N ], where N is the total number of events throughout the entire patient visit history.The i-th medical event of a patient M i can be expressed as a set of eventassociated features {A 1 i , . . ., A |Mi| i }.Each feature A k i can be seen as a tuple of a feature name and its value where N and V are each a set of unique feature names (e.g., {"drug name", "drug dosage", . . ., }) and feature values (e.g., {"vancomycin", "10.0", . . ., }), respectively.
In addition, each medical event M i has its corresponding event type e i ∈ E which denotes the type of the event (e.g., E = {"lab test", "prescription", . . ., }).Lastly, since the recorded time is also provided with M i , we can measure the time interval t i between M i and M i+1 .

B. General Healthcare Predictive Framework
In this section, we present GenHPF, a general framework for EHR-based prediction based on the following three principles, and describe how to implement each principle: (1) text-based embedding, (2) employing the entire features of EHR, and (3) medical event aggregation.Figure 2 depicts the overall architecture.
Text-based embedding.A conventional EHR embedding method begins by assigning a unique embedding for each element in V via a linear map (i.e., lookup table) f V [17], [21], [22], [35], [36], so that v k i can be converted to a vector v k i ∈ R dv , typically followed by pooling multiple feature values (v 1 i , v 2 i , . ..) to obtain m i ∈ R dm , the embedding of M i . 3.This conventional embedding, however, usually requires a different f V for each medical institution due to the V heterogeneity, namely each institution using different V's.For example, MIMIC-III [13], an open-source EHR data, uses the ICD-9 diagnosis codes for recording diagnostic information, while eICU [11], another open-source EHR data, uses in-house diagnosis codes.Therefore, the conventional embedding is not the most suitable foundation on which to build a general EHR framework.
DescEmb [25] proposed to resolve this problem by suggesting a text-based embedding, where hospital-specific feature values are first converted to textual descriptions (e.g., "401.9"→ "unspecified essential hypertension"), then a text encoder paired with a sub-word tokenizer is used to obtain m i [37].With this approach, the model can learn the language of the underlying medical text rather than memorize a unique embedding for each hospital-specific feature value, thereby overcoming the V heterogeneity as the same text encoder can be used for all institutions that use the same language.At this point, we adopted this code-agnostic embedding method and extended it by utilizing feature names as well as the feature values, which is s(t(n) + t(v)) as the event representation.
We extend the previous approach by applying the text-based embedding philosophy to event types e i and feature names n k i , in addition to feature values v k i , as follows: where S is a sub-word tokenizer, and f is an event encoder that takes a sequence of sub-word tokens and returns m i .Note that f can be a pretrained language model as in DescEmb, or a randomly initialized transformer encoder, or even a single-layer RNN.Although f can be implemented with any sequence encoder (e.g., a pretrained language model as in DescEmb), we use 2-layer transformer in this work.
Employing the entire features of EHR.
To develop a general predictive framework, in addition to the V heterogeneity, we must consider the schema heterogeneity, namely each medical institution using a different database schema.When developing a conventional predictive model, medical domain experts are typically involved to define M ′ i ⊂ M i , a subset of task-specific features among M i according to each EHR system.This process must be carried out repeatedly 3 Previous EHR embedding methods do not typically use the feature name n k i whenever they encounter a different EHR schema.Moreover, in multi-source learning, medical domain experts must select and match compatible features between distinct EHR systems.For instance, in the Lab event of eICU, the feature named "labResult" should be paired with the "VALUENUM" feature in MIMIC-III's LABEVENTS event.Assessing database schemas of multiple sources and matching compatible features, although inevitable in a conventional approach, is timeconsuming and prone to human errors.
Therefore, to leverage multiple heterogeneous EHR sources, features that share the same meaning must be matched.To avoid this costly procedure, our framework exploits the entire features of medical events, effectively resolving the schema heterogeneity.As described in Eq.II-B, the entire set of features in medical events is embedded into one unified embedding m i .Since this approach utilizes all features, feature selection is not required.Additionally, in multi-source learning, our framework is not constrained by the features that are present in each schema since both the name n k i and the value v k i of the feature are used.A formal comparison of the conventional approach, DescEmb [25] and our approach for obtaining m i is provided below: Conventional approach: where pool is typically implemented as a concatenation or summation of the elements.Note that GenHPF differs from previous approaches in that it is the only approach to exploit all available information in a medical event, including the event type, all event names, and all event values.Therefore, GenHPF provides a general solution applicable to any EHR system with a different schema, making it schema-agnostic, without requiring medical domain knowledge.DescEmb [25] still cannot resolve this since it exploits only the feature value v k i .This approach does not take into account the need for the model to learn the semantics of column names, thereby necessitating only the selection of compatible features.Medical event aggregation.To leverage the EHR structure characteristics, where P consists of a sequence of M i and each M i consists of a set of A k i , we design a hierarchical model consisting of the event encoder f , and the event aggregator g.
As each M i is converted into m i according to Eq. II-B, we can obtain p ∈ R dp , the vector representation of P as follows: where g is an embedding function that takes a sequence of event embeddings, and t is a timestamp which is applied as following [38], imposing the weight for attention according to the time interval between adjacent events.Note that g can be implemented with any sequence encoder, such as a Transformer encoder or a single-layer RNN.Then, feeding p through a softmax layer (sigmoid layer if binary prediction) will give us the final prediction ŷ.
In addition, p can be obtained by employing a flattened model architecture rather than a hierarchical one, where subword tokens from all features of all medical events are passed to the sequence model h at the same time.We confirm that the hierarchical approach, which reflects the structure of EHR data, indeed outperforms the flattened approach.

C. Self-supervised pretraining
Building upon the premise that self-supervised pretraining may enhance downstream task performance, the proposed framework enbales multiple heterogeneous EHRs to be used during the self-supervised pretraining process.Our investigation focuses on determining the efficacy of various selfsupervised pretraining approaches when applied to GenHPF.In this study, we test four well-known SSL methods as follows: SimCLR [26]: We execute a two-step process of (1) EHR data augmentation and (2) contrastive pretraining inspired by SimCLR.For data augmentation, we create a pair of views per patient by halving the time-series data based on the number of events and randomly masking the tokens in the events at a fixed ratio.The contrastive pretraining objective is to maximize the similarity of the representation vectors created from two views of the same patient (i.e., positive pair) while minimizing the similarity of the vectors created from the views of different patients (i.e., negative pair) in accordance with the SimCLR settings [26].Wav2Vec 2.0 [27]: We execute the Wav2Vec 2.0 [27] pretraining process, which consists of (1) feature encoder output quantization and (2) contrastive learning on mask-selected patient event timesteps.During the quantization stage, continuous latent vectors (i.e., event encoder outputs) are quantized via mapping the vectors to discrete entries of a trainable codebook.Gumbel softmax is used to map each latent vector to the codebook entries.During the second stage, a proportion of the latent vectors are randomly masked before being fed into the event aggregator.For each mask selected position, the overall pretraining objective is to maximize the similarity between the event representation vectors (i.e., event aggregator outputs) and their corresponding quantized vector, while minimizing the similarity with other quantized vectors.The loss terms are followed as defined in Wave2Vec 2.0.We use the event encoder as the feature encoder instead of the convolutional blocks used in the original study.MLM and SpanMLM [28], [39]: For MLM pretraining, we randomly mask a fixed ratio of tokens among the whole patient event history, and the pretraining objective is to predict the masked tokens based on bidirectional attention.For SpanMLM pretraining, we apply event-level random masking, where all tokens included in the sampled events are masked, which is intended to learn the context of the EHR time-series event by learning the event itself rather than simply learning the partial random masked sub-word of the description.Note that both MLM and SpanMLM are based on predicting the raw text (i.e.tokens), which prevents us from using the hierarchical textual representation (Figure 4).Therefore we use a flattened textual representation for these two methods; the Methods section describes this representation further.

A. Datasets
We use three publicly available datasets; MIMIC-III [13], MIMIC-IV [12], and eICU [11].The MIMIC-III database consists of clinical data of over 40,000 patients admitted to the intensive care units (ICU) at the Beth Israel Deaconess Medical Center.MIMIC-IV is an enhanced version of MIMIC-III that incorporates additional data sources, including admission date.The eICU consists of ICU records from multiple US-based hospitals, with 140,000 unique patients.All three datasets contain patient medical events including lab tests, prescriptions, and input events (e.g., drug injection), which are processed as inputs for the experiments.Each event is marked with a timestamp.We build patient cohorts of patients over the age of 18 years who remained in an ICU for over 24 hours.To ensure reliable experiments and analyses, we randomly split each dataset into training, validation, and test sets in an 8:1:1 ratio.
Minimal preprocessing applicable to any EHR is performed in three steps.First, we eliminate features whose values consisted only of integers.This approach ensures that all continuous-valued features (e.g., lab test results) and textual features (e.g., lab test names) are used, while omitting features such as the patient ID.Second, we split numeric values digit by digit and assign a unique token to each digit place, a method known as digit place embedding which was first introduced in DescEmb [25].Subsequently, we tokenize all features and prepare them as text input features using bio-clinical-bert tokenizer [40].Table I summarizes the general characteristics of the three datasets including the size and feature dimensions.The embedding method for each feature is either a categorized feature (code-based embedding) or is the text itself.
For the pretraining dataset, we prepare an unlabeled dataset, employing multiple ICUs without an observation window, and sampled medical events with a maximum length of 150, except for the test set of the downstream task.For medical sequences exceeding 150 events, we shift the starting point of sampling by 30 events, thereby altering the sample while maximizing data inclusion.

B. Prediction tasks
To fairly evaluate our framework for varous healthcare predictive tasks, we utilize open-source prediction tasks that can be applied in an ICU setting.We adopt eight prediction targets (Mort, LMort, Readm, Los3, Los7, Dx, Fi_ac, and Im_disch), as described by McDermott et al. [21].Additionally, to demonstrate the efficacy of GenHPF in a broader range of tasks, we formulate four prediction targets for lab values, which serve as proxy indicators for sepsis or acute kidney injuries [41].All tasks are based on ICU stays, and the performance is evaluated using the area under the area under the receiver operating characteristic curve (AUROC).Each task is defined as follows: • Mortality (Mort) (binary): A sample is labeled positive for mortality if the discharge state was "expired" within a prediction window of 48 hours during the stay.In addition, for a longer-term prediction mortality prediction, we use death within 2 weeks (abbreviated as LMort).
• Length-of-Stay (LOS) (binary): The length of stay prediction for ICU stays can be categorized into two cases: determining whether a given stay lasted longer than 3 days (LOS3), and determining whether it lasted longer than 7 days (LOS7).• Readmission (Readm) (binary): Given a single ICU stay, we consider a positive case of an ICU stays followed by another (readmission) during the same hospital stay.Using medical event information from the initial 12 hours after ICU admission, we apply a 12-hour timegap across all tasks.timegap, which is designed to exclude any data close to the prediction time, is implemented to maintain the task challenges and prevent potential data leakage.Excluding any ICU stays shorter than 24 h allows for both a 12-h observation window and a 12-h gap.For diagnosis, we categorize diagnosis labels into 18 distinct classes based on the CCS ontology [42].We use ICD9 for MIMIC-III, ICD10 for MIMIC-IV, and text format diagnostic labels for the eICU.We perform an additional mapping process for Fi_ac and Im_disch owing to the different labeling sources across datasets.For the lab value prediction tasks, we adopt the approach of Gyawali et al. [43], defining them from the SOFA score, which guides the severity of sepsis from specific lab values.Hence, each lab value is assigned as a categorical value based on its corresponding SOFA score.Statistics for prediction tasks are shown in Tables V, VII, VI.

C. Baselines and implementation details
Baselines As there is no previous work, to our knowledge, that tackled exactly the same goal as ours, we modified well-known general-purpose EHR embedding frameworks.By comparing GenHPF with baselines, we systematically evaluate the components that can influence prediction performance in multi-source learning settings.We analyzed these frameworks based on two options: feature utilization (selective or utilizing all) and embedding method (code-based or text-based).In addition, all models were provided with both n k i and v k i for a fair comparison with GenHPF.
• SAnD: This uses the conventional embedding, selected features M ′ i , and the flattened architecture, similar in spirit to SAnD [17].Note that feature embeddings from all medical events [M 1 , . . ., M N ] are directly fed to the sequence encoder h instead of being pooled to obtain individual m i .
• Rajkomar: This uses the conventional embedding, entire features M i , and the hierarchical approach, similar in spirit to [22] except the CDM standardization.Note that the feature embeddings from each M i are fed to f to obtain individual m i , which is then fed to g. • DescEmb: This uses the text-based embedding, selected features M ′ i , and the hierarchical approach, similar in spirit to DescEmb [25].
• AutoMap: This uses the same embedding method and features as Rajikomar [22].It trains M i by automatically mapping medical codes using ontology-level alignment with an unsupervised learning method.• Muse: This uses the same embedding method and features as Rajikomar [22].It trains M i using skip-grams and aligns the embedding space between bilingual dictionaries.
implementation For a fair comparison, f and g were both implemented with a randomly initialized 2-layer Transformer encoder, and a 4-layer Transformer encoder, making all models equivalent in terms of the number of trainable parameters (d v = 128, d m = 128, d p = 128). 4Although all frameworks share the same sequence of medical events, the selection of features and the embedding approach employed can vary across each frameworks.The selected features M ′ i 5 followed by DescEmb [25].
To maintain the same input information for both hierarchical and flattened models, we limit number of events per sample.Owing to computational resource constraints, the flattened models are limited to a maximum sequence length of 8192, and a correspondingly adjusted number of events were used as input for the hierarchical model, which includes the same events as the flattened model.Training details All experiments are conducted using five random seeds which are used to initialize the model parameters and to split the dataset.Their performance is evaluated based on the area under the receiver operating characteristics (AU-ROC) averaged over twelve tasks.We conduct all experiments in a multi-task learning setting, as our main interest is to develop a single model that performs multiple tasks using multiple EHR datasets simultaneously.For multi-source learning, we train the combined dataset and validate each individual dataset separately.Early stopping is enforced according to the validation AUROC for each dataset, and the best model is saved per dataset.Subsequently, each saved model is used to test the corresponding dataset.Hyperparameters We explored various hyperparameters to determine the optimal for each framework.However, we found that the impact of these hyperparameters on the results was not significant.Consequently, we use a unified set of hyperparameters for all cases, thereby simplifying the experiment while maintaining the performance for each model.The final hyperparameters are a dropout of 0.3, a batch size of 64, and a learning rate of 1e-4.For pretraining, we apply token masking with the same fixed ratio to SimCLR, MLM, and SpanMLM, in which 80% of the randomly chosen token positions are replaced with the [MASK] token, 10% of the positions are replaced with a random token, and the remaining 10% of the positions are unmodified.We apply Wav2Vec settings with 2 codebooks, 320 entries per codebook, a masking ratio of 65%, and a feature gradient multiplication of 0.1 which slows down the event encoder gradient update.For the codebook diversity loss weight, we use 0.1, 0.3, 0.1, 0.5 for MIMIC-III, eICU, MIMIC-IV, and the pooled domain, respectively.

D. Experimental Design
To assess the efficacy of GenHPF in various aspects, we developed a series of prediction tasks across four distinct scenarios: (1) single-domain learning, (2) pooled learning, (3) transfer learning, and (4) self-supervised learning.For pooled learning and transfer learning, we follow the settings from DescEmb [25].For single-domain learning, models are trained and tested on a single dataset.This part tests GenHPF for single-domain learning although its primary aim is that of multi-source learning.In pooled learning, it is crucial to utilize data collected from multiple EHR systems by leveraging the wealth of EHR data for prediction tasks.Each framework simultaneously is trained on all three datasets, and evaluated separately on each dataset.We compare the performance of single-domain learning and pooled learning to show that training on multiple datasets enhances predictive performance compared with models trained on a single dataset.
Next, in transfer learning, we aim to show that GenHPF can be beneficial when trained on a specific dataset and directly tested on other datasets (zero-shot learning) or when further trained on limited data (few-shot learning).In practice, a single deep-learning model is typically trained on a large-scale hospital dataset and subsequently transferred to individual institutions, which could enable small hospitals to benefit from models trained on a large scale.Apart from acquiring large and representative datasets, this also entails ensuring compatibility between code and data schemas across different EHR systems, akin to what is necessary in pooled learning.In this scenario, each model is first trained on a source dataset and then directly evaluated on a sample from the same dataset(i.e., zero-shot) or further trained (i.e., fine-tune) on a target dataset.
Finally, we investigate which SSL method with unlabeled data exhibits a performance improvement when fine-tuning the pretrained model on the prediction task.To demonstrate the benefit of our approach, we compare three models: 1) a randomly initialized model trained on a single dataset; 2) a pretrained model and fine-tuned on a single dataset; 3) a pretrained model on the multi-source (pooled) dataset and finetuned on a single dataset.Additionally, we assess the impact of pretraining on different fine-tuned data size settings, namely sample data, and full data, assuming that a pretrained model can be fine-tuned on a smaller hospital or a similar-sized hospital.

A. Single-domain learning
Figure 3 shows the single-domain learning results.GenHPF shows comparable or higher prediction performances, on average, across the 12 tasks than other frameworks using domain knowledge (+0.8%PAUROC on average against all frameworks on all three datasets, Fig 3 (a) circle marks) Appendix I-A provides the comparison results of GenHPF with a baseline involving more feature engineering.

B. Pooled learning
The results reveal that GenHPF exhibits a significant improvement in pooled learning when trained simultaneously on all three datasets, outperforming all other frameworks (+1.2%P, Figure 3 (a) triangles).This highlights the advantages of GenHPF which utilizes the textual representations of all features.Compared with single-domain learning results, text-based embedding models (DescEmb and GenHPF) consistently demonstrate higher performances when trained on pooled datasets from all three sources.In contrast, conventional embedding models (SAnD and Rajkomar) show decreased or unchanged performances for pooled learning.In addition, for text-based embedding models, GenHPF outperforms DescEmb in most cases when all three data sources are pooled together.

C. Transfer learning
Figure 4 presents the transfer learning results.To evaluate how the performance of the frameworks varies with the target dataset size, we use different proportions of the target dataset: x=0.0 indicates zero-shot learning, x=0.5 means fine-tuning with half of the target dataset, and x=1.0 is for fine-tuning with the entire target dataset.For zero-shot learning, the text-based embedding methods (DescEmb and GenHPF) consistently outperform the code-based embedding methods (SAnD and Rajikomar) across all source and target pairs.GenHPF demonstrate predominantly higher performance than the other models in most cases (+2.6%P, red line over other lines).As the sample size of the target dataset decreases, the strength of GenHPF becomes more apparent (+12.5%P,performance at x=0.0).In further fine-tuning on the full dataset (marked with 1 on the x-axis), the code-based embedding models perform worse than GenHPF with single-domain learning performance (dotted line) in most cases.In contrast, GenHPF exhibits comparable or higher performance than single-domain learning, except when the model is trained on MIMIC-III and transferred to the eICU.Next, we introduce two additional baselines capable of automatically map different code systems between two EHR datasets using unsupervised learning.GenHPF exhibits a higher performance against unsupervised learning methods for code mapping, as shown in Table IV.

D. Self-supervised pretraining
Table II presents the results.Pretraining sources (PT Srouce) are in two settings, single(same as the finetune dataset) and multi-source (MIMIC-III+eICU+MIMIC-IV). Fine-tune(FT) data size are varied with the data sampled size (10%, 30%, full).⋆ indicates p-value from the t-test between the randomly initialized model and pretrained model results.The highest performance for each fine-tune source, corresponding to the size of the fine-tune data, is highlighted in bold, and its p-value is indicated in parentheses.
The results show that GenHPF coupled with self-supervised pretraining methods (except SpanMLM) improves the prediction performance in most cases compared to models without pretraining.Among the pretraining methods, SimCLR consistently outperforms the others, exhibiting the highest prediction AUROC, for both the sample-data and full-data scenarios.In particular, SimCLR exhibits an average increase of 0.1%P and 0.6%P in the AUROCs for the single-and multi-source pretraining, respectively.The sample data results show that when the quantity of pretraining data exceeds that of the finetuning data to a larger extent, pretraining significantly affects the predictive downstream tasks.

V. DISCUSSION
In this work, we addressed the dual challenges of multitask prediction models for heterogeneous EHRs by proposing and investigating GenHPF for single-domain learning, pooled learning, transfer learning, and self-supervised pretraining.The results show that GenHPF achieves comparable or higher performances without relying on medical domain knowledge and by simply using all features as textual descriptions.
In particular, for single domain learning, a comparison between GenHPF and Rajkomar suggests that assigning unique embeddings to all feature names and values is unnecessary, since treating them as textual descriptions leads to a comparable performance.Moreover, a comparison between GenHPF and DescEmb implies that GenHPF can better capture the underlying semantics of distinct EHR sources than DescEmb utilizing all available information in a medical event.That is, applying medical domain knowledge to select a subset of meaningful features does not necessarily lead to a higher performance compared with simply using all possible features.Overall, the single-domain learning results show that GenHPF achieves comparable or higher performances, even without relying on medical-domain knowledge, by simply using all features as textual descriptions.The improved AUROC achieved without significant feature engineering made this evident.
In the pooled learning, both text-based embedding models (DescEmb and GenHPF) significantly improved the prediction performance compared with conventional code-based embedding models.This improvement results from the MIMIC and eICU datasets not sharing codes and from training conventional code-based embedding models on the pooled dataset expanding the number of required embeddings for each feature name and value, thereby preventing the models from leveraging larger amounts of training data.Conversely, text-based embedding models can take advantage of the extensive volume of various sources since the sub-words of medical descriptions are common, even among entirely dissimilar EHR systems.Furthermore, even within the text-based embedding models, GenHPF outperforms DescEmb in most cases, although De-scEmb uses manually selected features from each dataset.This highlights the advantage of GenHPF because it does not rely on any domain knowledge but rather uses all features in a textual form regardless of the EHR schema used.
In transfer learning, we observe a pattern similar to that in pooled learning; text-based embedding models consistently outperform code-based embedding methods.Through these experiments, we demonstrate that GenHPF effectively resolves two challenges (multi-task learning, multi-source learning).For multi-task learning, GenHPF outperforms models SAND and DescEmb, which employ feature selection by utilizing domain knowledge.Regarding multi-source learning, GenHPF demonstrated better performance than conventional embedding models such as Rajkomar and SAnD.
The self-supervised pretraining results show that SimCLR consistently outperforms the other methods.We conjecture that SimCLR's pretraining process effectively facilitates prediction in downstream tasks by learning patient-level representations, whereas the other pretraining methods focus on learning either token-level or event-level representations within the same patient.Furthermore, the performance improvement of GenHPF with multi-source pretraining provides insights into the necessity of pretraining on the pooled heterogeneous EHRs, which we believe is essential for large-scale EHR modeling.
Implementing GenHPF in a real-world hospital requires appropriate hardware resources, including GPUs connected to EHR database.Once operational, the framework minimally preprocesses patient data for various prediction tasks.A key advantage of GenHPF is that it can be integrated into any EHR system without requiring specific modifications, thereby significantly reducing both time and implementation costs.However, this approach to minimal preprocessing results in a larger input size, requiring higher computational requirements.

VI. LIMITATION
Although GenHPF demonstrated promising results, it still has limitations.First, since GenHPF utilizes as many features as possible from EHR events, computational constraints must be considered.Therefore, we used a subset of EHR events (lab tests, prescriptions, and input events) in this work.Better performance is expected if we exploit all EHR event types using more memory-efficient models [45], [46].
Second, as the current framework for multi-source learning relies on textual representation, it is limited to EHRs that share the same language.Lastly, we used only tabular data in the EHR; thus, future studies should consider incorporating additional modalities (e.g., radiographic images) into the framework.

VII. CONCLUSION
In conclusion, our study illustrates the potential of GenHPF for various learning scenarios, including singledomain, pooled, transfer learning, and self-supervised pretraining.The effectiveness of the framework without relying on medical domain knowledge and its ability to capture the underlying semantics of distinct EHR sources make it a promising approach for large-scale EHR modeling in the future.Furthermore, With the advent of large language models (LLMs) such as Chat-GPT, feeding text-based EHRs into an LLM via the GenHPF framework (with its ability to handle any EHR in text form) would allow for EHR predictions, either by fine-tuning the LLM or using the in-context learning technique.This would open up a wide set of applications that could reduce complications and improve patient care with less reliance on EHR schemas and feature engineering, such as predicting patient outcomes, intervention, and personalizing patient care.

A. Comparison of GenHPF with Benchmark [21]
We compare GenHPF with Benchmark [21], shown in table III.The best performances for each dataset are in bold.While Benchmark offers an expert-designed, featureengineered prediction pipeline, comparing it with GenHPF allows us to assess the effectiveness of our method, which operates without domain-specific knowledge.Benchmark originally used all tables, including lab tests and chart events.Due to the high computational demands from numerous chart events, we limited our comparison to the lab test table.This ensures a fair comparison, as both our method and Benchmark share only the lab test event.GenHPF generally exhibits a higher performance than that of Benchmark in most prediction tasks.

B. Comparison GenHPF with unsupervised learning methods in transfer learning
AutoMap [24] and Muse [44] use the same model architecture as Rajikomar [22] but can leverage learned embedding through the unsupervised pretraining of code features between the source and target datasets.We use these baselines for fair a comparison when transferring code-based embedding models, giving pretrained embedding, not just randomly initialized.Results are shown in Table IV.The two unsupervised learning methods for code-mapping do not exhibit improvement over Rajkomar in the full dataset performance.This indicates that pretraining with code-mapping between two sources using different EHR code schemes does not yield a performance improvement, and the original paper did not conduct the experiments across different EHRs.However, GenHPF which utilizes text-based embedding outperforms the baselines (Au-toMap, Muse, and Rajikomar) in both zero-shot learning and full dataset fine-tuning.

APPENDIX II STATISTICS FOR PREDICTION TASKS
This section presents the statistics for the prediction tasks.All numbers represent the composition ratios as percentages.Binary Classification Tasks Table V The tasks include predicting mortality, long-term mortality, los3, los7, and readmission.

Multi-class Classification Tasks Table VI
The tasks include predicting the final acuity, imminent discharge, and several lab values (creatinine, bilirubin, platelets, and WBC).For laboratory values, 'Null' denotes ICU samples that involve dialysis.The loss is not computed for this null class during training phases.For the final acuity and imminent discharge, samples outside the predefined classes are marked as 'Null'.Multi-label Classification Tasks Table VII Each row represents a class label, and the corresponding percentages denote the proportion of instances assigned to each class in the respective dataset.Class unification across the datasets follows DescEmb [25].

•
Final Acuity (Fi_ac) (multi-class): Predicts the patient's discharge location at the end of their hospital stay, including patient expiration.• Imminent Discharge (Im_disch) (multi-class): Predicts whether the patient will be discharged within a prediction window of 48 h and if discharged, predicting the discharge destination.• Diagnosis (Dx) (multi-label): Predicts all diagnosis (Dx) codes accumulated during an entire hospital stay.We group Dx codes into 18 Dx classes using Clinical Classification Software (CCS) for the ICD-9-CM criteria [healthcare2016hcup]. • Lab values (multi-class): Four distinct laboratory values[Creatinine (Crt), Bilirubin (Blr), Platelets (Plt))] are categorized into 5 classes, based on their corresponding ranges.These classes are derived using the thresholds employed to determine the Sequential Organ Failure Assessment (SOFA) scores.White blood cell (Wbc) is categorized into 3 classes.

Fig. 3 .
Fig. 3. Comparison of single domain learning and pooled learning prediction performances.(A) Results of the average AUROC on 12 prediction tasks.The data sources used for the evaluation are at the top of each graph.The y-axis indicates the AUROC.Each dot represents models (color) with source datasets used for training (shape) following the legends.Note that "Single" refers to the same data source as the evaluation dataset.The blue dashed line separates models into conventional embedding models (left-SAnD, Rajkomar) and text-based embedding models (right-DescEmb, GenHPF).Stars indicate the p-value of the t-test conducted to assess the significance between single-domain prediction and pooled learning.(B) Results of each prediction task using MIMIC-III as the source dataset.

Fig. 4 .
Fig. 4. Transfer learning results.The source data used for training and target data used for evaluation with zero-shot or few-shot learning are indicated at the top of each graph.The source dataset is on the left side of the arrow, and the target is on the right.The y-axis indicates the AUROC and the x-axis is the portion of the target dataset for zero-shot or few-shot learning.Shading around the lines indicates the standard error from five seed experiments.For comparison with single domain performances, single domain learning performances of GenHPF are marked with the dashed line.

TABLE III COMPARISON
WITH BENCHMARK MODEL (ONLY LAB FEATURES)

TABLE VII STATISTICS
FOR MULTI-LABEL CLASSIFICATION TASK(DX)