Self-Supervised Forecasting in Electronic Health Records With Attention-Free Models

Despite the proven effectiveness of Transformer neural networks across multiple domains, their performance with electronic health records (EHRs) can be nuanced. The unique, multidimensional sequential nature of EHR data can sometimes make even simple linear models with carefully engineered features more competitive. Thus, the advantages of Transformers, such as efficient transfer learning and improved scalability, are not always fully exploited in EHR applications. Addressing these challenges, we introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data. In this work, we aim to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities. The challenge amplifies when dealing with divergent patient subgroups, like those with rare diseases, which are characterized by unique health trajectories and are typically smaller in size. To address this, we employ a self-supervised pretraining strategy, generative summary pretraining (GSP), which predicts future summary statistics based on past health records of a patient. Our models are pretrained on a health registry of nearly one million patients, then fine-tuned for specific subgroup prediction tasks, showcasing the potential to handle the multifaceted nature of EHR data. In evaluation, SANSformer consistently surpasses robust EHR baselines, with our GSP pretraining method notably amplifying model performance, particularly within smaller patient subgroups. Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.


Self-Supervised Forecasting in Electronic Health
Records With Attention-Free Models Yogesh Kumar , Alexander Ilin , Henri Salo , Sangita Kulathinal , Maarit K. Leinonen , and Pekka Marttinen Abstract-Despite the proven effectiveness of Transformer neural networks across multiple domains, their performance with electronic health records (EHRs) can be nuanced.The unique, multidimensional sequential nature of EHR data can sometimes make even simple linear models with carefully engineered features more competitive.Thus, the advantages of Transformers, such as efficient transfer learning and improved scalability, are not always fully exploited in EHR applications.Addressing these challenges, we introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data.In this work, we aim to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities.The challenge amplifies when dealing with divergent patient subgroups, like those with rare diseases, which are characterized by unique health trajectories and are typically smaller in size.To address this, we employ a self-supervised pretraining strategy, generative summary pretraining (GSP), which predicts future summary statistics based on past health records of a patient.Our models are pretrained on a health registry of nearly one million patients, then fine-tuned for specific subgroup prediction tasks, showcasing the potential to handle the multifaceted nature of EHR data.In evaluation, SANSformer consistently surpasses robust EHR baselines, with our GSP pretraining method notably amplifying model performance, particularly within smaller patient subgroups.Our results illuminate the promising potential of tailored attentionfree models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.
Impact Statement-Large neural networks have demonstrated success in various predictive tasks using EHRs.However, their performance in small divergent patient cohorts, such as those with rare diseases, often falls short of simpler linear models due to the substantial data requirements of large models.To address

I. INTRODUCTION
T HE remarkable success of Transformer-based architec- tures [66] in natural language processing (NLP) and computer vision benchmarks [15], [67] has extended to certain applications within the realm of electronic health records (EHRs).In instances where EHRs encompass natural language input, such as discharge notes written by healthcare professionals, Transformers have achieved impressive results [30], [69].Nonetheless, the unique structure and characteristics of EHR data, especially sequences of clinical codes, present challenges that can sometimes hinder Transformers from consistently outperforming simpler models with carefully engineered features [1], [6], [32].While Transformers excel in many areas, their performance with EHR can be nuanced, often requiring extensive pretraining on large datasets.This can render them less practical for applications with smaller or specialized datasets.
Clinical code sequences in EHRs exhibit unique properties in contrast to typical text or image data [54].For example, while consecutive words in a sentence or pixels in an image are usually strongly correlated, the consecutive visits in an EHR dataset can be disparate in terms of time and context.These visits may relate to different health issues and can be spaced years apart.This characteristic discrepancy has resulted in situations, where simpler models such as linear regression and recurrent neural networks (RNNs) still remain competitive [11], [33], [55].Motivated by this observation, we hypothesize that the computationally and memory intensive self-attention mechanism in Transformers might be too complex for this application, leading us to explore a simpler yet effective alternative.In this article, we propose a sequential architecture for EHR analysis inspired by the recent works that eliminate the need for recurrent, convolution, or self-attention mechanisms [41], [64].
Our model, which we name SANSformers (sequential analysis with nonattentive structures for electronic health records), is specifically designed with inductive biases that accommodate the unique features of EHR data.In particular, we incorporate axial mixing to address the inherent multidimensionality of clinical codes within each visit, and Δτ embeddings to accurately capture the time lapses between visits.This work thus marks a critical departure from conventional Transformer applications and offers a novel and potentially more efficient way to analyze EHR data.
Our overarching motivation is to construct a model capable of predicting future healthcare utilization-a measure of an individual's use of medical services, such as hospital stays, doctor visits, and medical procedures-based on their disease history and other pertinent factors.Such prediction models are key to efficient management and planning in healthcare [49], [63] and are, for example, already employed in numerous countries to allocate healthcare resources [45].One significant challenge lies in making accurate predictions for patients belonging to divergent subgroups.These subgroups are characterized by differences in the distribution of the dependent variable.For instance, patients diagnosed with a specific disease often exhibit health history trajectories that deviate from those of other patients, as well as their own past patterns.This divergence can be particularly pronounced in cases of severe or chronic illnesses.Existing health economics research has addressed this challenge by developing models for specific subgroups [18], [61].However, building specialized models for each subgroup is not feasible when the subgroup size is small [21], and training a single model for the entire population could yield inaccurate results for divergent subgroups.
Addressing this issue, we borrow a successful strategy from NLP and computer vision, where models pretrained on larger corpora show improved sample efficiency when fine-tuned for specific tasks [7], [9], [16], [17], [52].EHR datasets are known to be noisy [48], [57], posing a potential issue for models such as GPT [52] that are trained to predict the next token.In the presence of noise, these models may end up fitting on the noise instead of the actual structure of the data.To mitigate this issue, we apply the principle of pretraining in the EHR context, introducing a self-supervised regime-generative summary pretraining (GSP)-that predicts summary statistics for a future window in the patient's history, such as the number of visits in the next year, based on current and past inputs.By predicting summary statistics, the effects of noise are alleviated, allowing for more reliable pretraining.
In our evaluation, we focus on patients diagnosed with type 2 diabetes (T2D), BP, and multiple sclerosis (MS), whose visitation patterns diverge strongly from the broader population (p < 0.001, t-test), as illustrated in Fig. 1, which features the BP subgroup.This underscores the complexity and variability of health trajectories across different subgroups, highlighting the need for flexible and adaptable predictive models, such as our SANSformers.
Our main contributions are summarized as follows.1) We introduce SANSformers, a novel, attention-free sequential model specifically designed for EHR data, supplemented with inductive biases such as axial decomposition and Δτ embeddings to cater to the unique challenges posed by the EHR domain.2) We conduct extensive comparisons of SANSformers with strong baseline models on two real-world EHR datasets.
Our results highlight the superior data efficiency and prediction accuracy of our model compared to the existing baselines.3) We demonstrate the value of self-supervised pretraining on a larger population for predicting future healthcare utilization of smaller, distinct patient subgroups.We introduce GSP, a self-supervised pretraining objective for EHR data, predicting future summary statistics.This offers considerable potential in improving healthcare resource allocation predictions, an application of significant importance in healthcare management.

A. Deep Learning on EHR
Deep learning has been a focal point in EHR research, with diverse approaches being proposed.For instance, Lipton et al. [40] used long short-term memory (LSTM) networks [27] for EHR phenotyping, applying sequential real-valued measurements of 13 vital signs to predict one of 128 diagnoses.Choi et al. [12] proposed a multitask learning framework using gated recurrent units (GRUs) [10] for phenotyping, showcasing the benefits of knowledge transfer from larger to smaller datasets.They also developed a bidirectional attention-based model, reverse time attention model (RETAIN), to enhance model interpretability [11].Harutyunyan et al. [22] contributed a benchmark framework for evaluating EHR models.
Notably, simpler models, such as linear ones, often exhibit competitive performance on EHR data [1], [4], [48], [55].These models usually rely on manual feature engineering to construct patient state vectors from longitudinal health data, as opposed to training end-to-end from raw EHR data.In contrast, SANSformers seeks to directly leverage raw longitudinal EHR data, aiming to circumvent the need for extensive feature engineering while maintaining the simplicity and efficacy of nonattentionbased architectures.

B. Attention-Less MLP Models
A recent wave of research has questioned the necessity of attention mechanisms in Transformers, especially within the domain of computer vision.For instance, the MLP-mixer proposed by Tolstikhin et al. [64] replaced attention with "mixing" operations, and Liu et al. [41] introduced spatial gating mechanisms to further simplify the architecture.Other studies echo these findings [37], [46], [65].These developments challenge the indispensability of expensive self-attention mechanisms and motivate the extension of such models to EHR data.SANSformers represents, to our knowledge, the first endeavor to apply attention-less Transformers to EHR data.Fig. 2 highlights the distinctions of SANSformers from regular Transformers.While we replaced the self-attention mechanism, retaining other proven components from the Transformer model was a mindful decision to preserve stability and performance in EHR applications.
Applying pretrained Transformers to EHR data has shown promise, as evidenced by studies from Li et al. [39] and Rasmy et al. [54], who achieved improved performance over RNNs using pretrained BERT Transformers.Their work only considered single discrete sequences, specifically diagnosis codes.Meng et al. [47] further extended this to multimodal data by using a topic model to also utilize the patient history from clinical notes.While the direct application of the BERT-based masked language model (MLM) pretraining to clinical codes data has yielded encouraging results, this strategy lacks customization for EHR data or specific tasks at hand.Consequently, we propose our GSP strategy, which is purposefully designed to better leverage the unique structure and characteristics of EHR data.Our proposed SANSformers model broadens this approach by incorporating diagnoses, procedures, and patient demographics.To emphasize the specific modifications introduced in our approach, we mirror the schematic of the Transformer architecture from Vaswani et al. [66].Alterations from the conventional Transformer layer are highlighted in red, particularly the introduction of Δτ embeddings and the replacement of the self-attention mechanism with our attention-free mixers.A side-by-side comparison on the right shows the conventional self-attention and mixer, underscoring the efficiency achieved with fewer projection operations.Unlike the self-attention's three projections (to query, key, and value), the mixer employs just a channel and a spatial projection.For an exhaustive schematic of the entire architecture, please refer to Fig. 4.
Shang et al. [58] introduced a unique approach, G-BERT, for medication recommendation.Their model combined graph neural networks (GNNs) with BERT to efficiently represent medical codes, accounting for hierarchical structures.G-BERT was pretrained on single-visit patient data, then fine-tuned on longitudinal data, achieving superior performance on the medication recommendation task.This model aligns with SANSformer's self-supervised pretraining approach, although SANSformer focuses on attention-less architectures.
Moreover, the strategy of reverse distillation, which initializes deep models using high-performing linear models, has shown notable success in clinical prediction tasks.Self attention with reverse distillation (SARD) architecture [33], designed explicitly for insurance claims data, employs a mix of contextual and temporal embedding along with self-attention mechanisms, achieving superior performance on a variety of clinical prediction tasks, largely attributable to reverse distillation.
An extensive review by Krishnan et al. [34] underscored the immense potential of self-supervised learning methods in healthcare.These methods can effectively leverage large-scale unannotated data across various modalities such as EHRs, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.medical images, bioelectrical signals, and gene and protein sequences.This review highlights the value of these methods in handling multimodal datasets and addresses the challenges related to data bias, reinforcing the suitability of self-supervised learning for EHR data and its promising role in the advancement of medical AI.

D. Handling Intravisit Data in EHR
Handling intravisit data, the clinical codes arising from a single patient visit, presents a critical challenge in deep learning on EHR data.Common methods include summing or flattening codes across the intravisit axis, but this may lead to a loss of nuanced information [39], [47].Choi et al. [13] proposed a convolutional transformer model that incorporates an inductive bias for modeling interactions between different codes within a single visit.This concept was extended by Kumar et al. [35], who used a 1 × 1 convolution to model these interactions, preserving more details from the intravisit data.Our approach, SANSformers, tackles this issue by leveraging axial decomposition, allowing us to model interactions within intravisit data without resorting to flattening or summing.This approach retains critical information that could enhance prediction accuracy.

A. Patient Cohorts
We used two EHRs data sources for our experiments: a confidential dataset (Pummel) that was sourced from the Care Register for Health Care and Register of Primary Health Care visits maintained by the Finnish Institute for Health and Welfare (THL) and a smaller publicly available MIMIC-IV dataset, on which the readers can run our method and replicate our findings. 1Table I lists some basic statistics of both datasets.
1) Pummel: The Pummel dataset encompasses the pseudonymized EHRs of all Finnish citizens aged 65 or older who have engaged with either primary or secondary healthcare services.These data span seven years (2012 to 2018) and capture a wide array of interactions with healthcare facilities, from scheduled appointments and phone consultations to nursing home visits and hospital admissions.
The raw EHR data comprise of a multitude of variables.These include medical diagnostic codes, as categorized by the International Classification of Diseases (ICD-10) and International Classification of Primary Care (ICPC-2), surgical procedure codes [38], and patient demographic details such as age and gender.Information about each visit, including the specialty of the attending physician, is also incorporated in the dataset.To construct a sequential model of patient history, we transformed the tabular data into a sequence of visits.Following this transformation, our dataset comprised 1 050 512 patient sequences spanning from 2012 to 2018, with an average of approximately 58 visits per patient.
2) MIMIC: The MIMIC-IV v1.0 dataset [19], [31] comprises EHRs from approximately 250 000 patients admitted to the Beth Israel Deaconess Medical Center (BIDMC).This extensive dataset includes detailed patient data, such as vitals, doctor notes, diagnoses, procedure and medication codes, discharge summaries, and more, collected during both hospital and ICU admissions.
For the purpose of this study, we focused solely on hospital admissions.Relevant patient information was extracted from the patients, admissions, diagnoses_icd, procedures_icd, drgcodes, and services tables.Visit information was grouped by the admission identifier hadm_id, while all visits were grouped by the patient identifier subject_id.Our modified dataset thus comprised 256 878 patients, averaging 2.04 visits per patient.

B. Prediction Tasks
In order to evaluate our model's performance, we established prediction tasks on both Pummel and MIMIC-IV datasets.For the Pummel dataset, our primary goal was to predict the healthcare utilization of specific patient subgroups, determined by diagnosed diseases.Rather than directly estimating monetary demand, which might vary between countries, we formulated two tasks indicative of healthcare utilization, to serve as a proxy for actual healthcare costs.Using one-year EHR histories, we predicted the following variables for each patient for the subsequent year.
1) Task 1-Pummel visits: The number of physical visits to healthcare centers (y count ).2) Task 2-Pummel diagnoses: The counts of physical visits due to six specific disease categories (y diag ).Both of these tasks are intrinsically related to healthcare costs.The Pummel Diagnoses task specifically focuses on six disease categories identified as significant healthcare resource consumers in Finland [25].These include cancer (ICD-10 codes starting with C and some D), endocrine and metabolic diseases (E), diseases related to nervous systems (G), diseases of the circulatory system (I), diseases of the respiratory system (J), and diseases of the digestive system (K).
For both tasks, we modeled the outcome as a Poisson distribution and used a neural network to estimate the expected event rate λ for each patient.During the training process, the negative Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
log-likelihood of the Poisson distribution is minimized, thereby serving as our loss function.
In the MIMIC dataset, we predicted the following.1) Task 3-MIMIC Mortality: Predicting the probability of inpatient mortality (y death ).For MIMIC Mortality task, the hospital_expire_flag feature from the MIMIC-IV admissions table is utilized as an indicator of patient mortality during the current hospitalization episode.To enhance the predictability of this model, the two most recent visits from each patient's history were excluded.As a result, task 3 effectively estimates the probability of mortality following the subsequent two visits.The loss function employed for this task was binary cross entropy.

IV. METHODS
Here, we describe our method.Section IV-A describes how the raw sequential patient data are converted into an input tensor.Section IV-B outlines the components of the SANSformers model: an adaptation of the axial decomposition to decompose the input tensor according to intervisit and intravisit dimensions (Section IV.B.1), mixers which replace the self-attention to update the token representations based on the decomposed intravisit and intervisit information (Section IV.B.2), and details of the positional and temporal encodings (Section IV.B.3).Section IV-C describes the GSP approach for self-supervised learning.

A. EHR Input Transformation Pipeline
EHRs present a substantial challenge for machine learning applications due to their variable structure, heterogeneous content, and large size.We apply a multistep transformation to convert the complex raw data into a form suitable for modeling.Let the sequential patient history be represented by v i,t , which denote collected visit records for each patient i, at time-step t = 1, 2, . . ., T i .Notably, here t is the index of the visit rather than absolute time, and T i is the total number of visits to a healthcare center for patient i.During training, we employ zero-padding to adjust T i to match the length of the longest sequence within the batch (T ).For example, with a batch size of two patients with 8 and 12 visits, respectively, we pad the shorter sequence with four pad tokens, resulting in sequences of length T = 12.
Our primary task is to predict a patient's healthcare utilization, defined as the number of visits in the year following the first recorded diagnosis of a specific disease.As input, we use one-year patient history leading up to this first diagnosis, what we refer to as the "fixed time frame."For example, suppose a patient is diagnosed with T2D on 5 April 2013.The input for the model in this case would be the sequence of visits from 2012 until the end of 2013.The target that the model is trained to predict would then be the number of hospital visits that the patient made in 2014.We elaborate how we form the input tensor from sequential patient history in the fixed time frame as follows.
Each visit record, v i,t , in our EHR dataset is characterized by a multivariate sequence of discrete tokens.These tokens signify various types of healthcare interactions such as diagnosis codes, procedure codes, and visit specialties.To accommodate these tokens within our computational framework, we perform a two-step transformation process: numerical encoding followed by projection into a denser vector subspace.First, we map the discrete tokens onto a single shared vocabulary, W, and represent them as one-hot-encoded (OHE) vectors, c ∈ R |W| .Subsequently, we project these OHE vectors into dense vectors, X emb ∈ R E , by multiplying each OHE vector c with an embedding matrix, W emb ∈ R |W|×E .Postprojection, we reshape the patient data to conform to a T × V × E structure.Here, T denotes the total number of visits, V represents the number of codes per visit (the intravisit dimension), and E corresponds to the dimension of the embedding vectors.
Given the variability in patients' visit frequencies, the "fixed time frame" approach might yield sparse histories for some patients.To address this, we enrich the temporal representation of visit records by incorporating both the sequence time-step t and the absolute time τ .Additionally, we calculate the time difference Δτ between visits (expressed in days), a strategy that has proven effective in previous studies [3], [11], [12], [39], [54].The details of the positional encodings are given in Section IV.B.3.This approach optimally exploits available temporal data, thereby enhancing the predictive capacity of our model.

B. Components of the SANSformer Model
The SANSformer maintains several components from the traditional Transformer model, such as positional encodings, skip-connections [23], and layer normalization [5], due to their demonstrated efficacy in handling sequential data and aiding model training.While the self-attention mechanism is replaced with the mixing mechanism, preserving other aspects of the Transformer model ensures that SANSformers benefit from the robustness and stability that have been empirically validated in numerous applications.The architecture of one SANSformer layer is shown in Fig. 4.
1) Adapting Axial Decomposition for EHR: Deep learning on EHR data encounters significant computational challenges when handling multidimensional input sequences.Flattening these sequences along the time-axis and applying traditional attention mechanisms often result in computational complexity of O(T 2 V 2 ), where T and V represent the time-step and intravisit size, respectively.We propose an axial SANSformer variant that incorporates axial attention, a technique that separately applies mixing along each axis of the input tensor, reducing the complexity to O(V 2 T + V T 2 ) [26].This mechanism has been optimized for image data, and there the idea is to decompose the attention across the whole image into rowwise and columnwise attention, attending all pixels on the same row/column as the pixel whose representation we are updating.
The axial attention technique for image data requires careful adaptation for the unique structure and semantics of EHR data.In its naive form, axial mixing along the time axis would be Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 3. Axial decomposition.Illustration of the distinct processing of data components in the axial SANSformer model.When updating a specific token (or code) within a visit, the model independently considers all previous visits (through the Causal Time Mixer) and all other tokens within the same visit (via the Visit Mixer).A prior visit is represented as the summation of its constituent tokens.These separately processed components are combined using a weighted addition mechanism, controlled by the parameter α axial , highlighting our approach's adaptability to the unique attributes of each prediction task.equivalent to attending tokens in previous visits at a specific intravisit index (analogous to attending rowwise pixels).Although meaningful for images, it loses semantic relevance in the EHR context where the order of codes within the previous visits may be arbitrary.We address this issue and modify the axial mixing operation by first representing each visit as the sum of its tokens before applying the mixing, allowing us to capture the comprehensive context of previous visits rather than isolated codes.This modification for Causal Time Mixer is shown in Fig. 3. On the other hand, intravisit mixer is similar to columnwise mixing for image pixels.
Additionally, we refine the combination of visitwise and codewise attended tensors, replacing the simple addition used in the original axial attention mechanism with a weighted average.A parameter, α axial , is introduced and tuned via cross validation, to balance the contributions of these tensors, catering to the unique requirements of specific prediction tasks.These adaptations enhance the axial SANSformer's effectiveness and interpretability when dealing with complex EHR data.By setting α axial to zero we get a special case of the model which we call the additive SANSformer.
2) Introducing Mixers for EHR: After the axial decomposition, each patient is represented by two matrices with dimensions T × E and V × E, corresponding to mixing over time (i.e., across visits) and mixing over tokens (i.e., intravisit), respectively.The SANSformer then performs a nonlinear operation that transforms an input tensor where B is the batch size, T is the sequence length, and E is the embedding dimension, onto an output tensor of the same dimensions.Here, we describe only the mixing over time, corresponding to the upper branch in Fig. 4; mixing over tokens is similar except for the causal masking (lower branch in Fig. 4).
In Transformer models, cross-time interaction crucial, and conventionally achieved by self-attention mechanisms.However, recent research [41], [64], [65] has shown that similar results can be obtained by implementing feedforward layers across the time axis, a process we refer to as "mixers."These facilitate cross-token interaction along a specific axis through matrix multiplication, functioning similarly to 1 × 1 convolutions where the number of channels is equal to the size of the hidden dimension.Despite capturing only second-order interactions, compared to the third-order interactions encompassed by self-attention, see [41], we hypothesize that "mixers" provide a satisfactory level of interaction for most applications in the realm of EHRs.
Our method of achieving cross-token interaction through mixing draws inspiration from the SGU introduced in [41].Given an input X ∈ R T ×E with sequence length T and embedding dimension E, we transform it into an output Y of identical dimensions.The transformation process is detailed as follows: Here, U ∈ R E×2P and V ∈ R P ×E are trainable weight matrices, while P denotes the projection dimension, usually larger than E. We use Gaussian error linear units (GELUs) [24], a preferred activation function in modern Transformer models, including BERT.In (3), the SGU function carries out cross-time mixing by dividing Z ∈ R T ×2P along the projected dimension into two portions, Z 1 , Z 2 , ∈ R T ×P .An affine transformation is then applied to one of these segments Here, W ∈ R T ×T is another trainable weight matrix, and denotes elementwise multiplication.For simplicity, normalization operations and skip-connections have been left out from the equations above.The GELU activation is used in all mixer components.To facilitate autoregressive model training, we introduce causal masking on the SGU weight matrix, W , to prevent future time-steps from leaking information.This is efficiently done by zeroing all upper-triangular elements of the matrix before the multiplication in (6).
3) Incorporating Positional and Temporal Encodings: While SANSformers are effective at learning complex patterns, they are intrinsically order-invariant, which means they do not inherently consider the temporal sequence of tokens.To integrate this sequence information, we use positional encodings, which are implemented using sinusoidal functions with varying wavelengths.These encode different positions along the embedding axis, providing unique representations for each token's position.PE(t, 2i) = sin(t/10 000 2i/dmodel ) and PE(t, 2i + 1) = cos(t/10 000 2i/dmodel ).Here, PE represents the positional encoding for position t and dimension 2i or 2i + 1.The variable d model denotes the embedding dimensionality, while t and i denote the position and dimension, respectively.
While these encodings uniquely identify positions, i.e., visit indices, they are insufficient for EHR data due to temporal aspects.Specifically, the relevance of two visits typically diminishes as the time gap between them increases.To address this, we introduce Δτ embeddings, which represent the elapsed days between consecutive visits.Consequently, the enhanced input to the model is the elementwise addition of the original input, positional encodings, and Δτ encodings.This approach incorporates both static and dynamic elements from a patient's medical history, providing a more comprehensive representation for the model.

C. GSP
Our focus on divergent patient subgroups, defined as patients with a certain rare disease diagnosis, necessitates a consideration of patient history only up until the first incidence of the given diagnosis.This context imposes constraints on the data accessible for predictive tasks, in terms of both the size of disease-specific subgroups and the temporal data available.Furthermore, our task confronts an additional challenge often faced in healthcare resource allocation: predictions for a specific year typically rely on just the preceding year's history.Factors contributing to this practice include limited patient histories, patients moving and changing healthcare providers, and the annual cycle of resource allocation [18].However, we note that the Pummel dataset, with its seven-year history per patient, has previously demonstrated potential to enhance predictive accuracy further by using longer histories when available [35].
To optimize the use of available data and address these challenges, we propose the GSP method.GSP is a self-supervised pretraining strategy designed to utilize the abundant patient data in the general population outside the target patient subgroups.Furthermore, it would be possible to capitalize on the temporal data in the years before the first diagnosis for the target patients that is otherwise discarded.However, to maintain the integrity of our model's predictions and prevent information leakage between the pretraining and fine-tuning phases, we restrict in this work the pretraining phase to patients not part of any specific subgroup considered in our fine-tuning tasks.This strategy ensures a clear separation of data between the two phases, fortifying the reliability of our approach.The training objective for GSP is to generate a summary prediction of healthcare utilization for the upcoming years in a sequential fashion (in the general population, i.e., outside the target subgroups), based on the health records of the preceding year-a target that aligns closely with our primary task.

V. RESULTS
In this section, we evaluate the performance of the SANSformer model across the tasks outlined in Section III-B, demonstrating its ability to handle the complexities of EHR data effectively.Our primary objective is to examine its predictive capabilities across various patient subgroups.Through this comprehensive performance analysis, we aim to showcase the model's predictive strength, scalability, and applicability in real-world healthcare contexts.
baselines are the RETAIN model [11] that uses RNNs for EHR, Transformer-based models such as bidirectional encoder representations from transformers (BEHRT) [39], self-attention with reverse distillation (SARD) [33], and bidirectional representation learning model with a transformer architecture on multimodal EHR (BRLTM) [47] designed for clinical code sequences, as well as the traditional lasso/logistic regression models which serve as linear benchmarks.These comparisons are conducted on both Pummel and medical information mart for intensive care (MIMIC) datasets.After that we demonstrate the advantages offered by our GSP strategy, studying its potential to improve prediction accuracy in three divergent subgroups within the Pummel dataset: patients diagnosed with T2D, BP, and MS.

A. Details of the Experiments
For the PUMMEL dataset, we utilized data from patients diagnosed with the specific disease between the years 2012 and 2015 to construct the training set, while data from those diagnosed in 2016 were reserved for the test set.We subdivided the training data, allocating 80% for model training and the remaining 20% for validation purposes.For the MIMIC dataset, after preprocessing, we heldout 20% of the data as the test set.The remaining data were split, with 80% used for training and the residual 20% utilized as a validation set.Our baseline models, which include L1 regularized logistic regression, RETAIN [11], BEHRT [39], BRLTM [47], and SARD [33], are compared with our proposed axial and additive SANSformer variants.We average the hidden tokens from each timestep to obtain the logits for SARD on regression tasks.
All models are implemented using PyTorch [50] and trained with the Rectified Adam optimizer [42] with a cyclical learning rate schedule [62] and linear decay.The computation was performed using a single NVIDIA Tesla V100 GPU.The models were trained for 20 epochs with a batch size of 32.
the validation set and the search ranges are provided in the supplementary material.
The performance of our models on the PUMMEL dataset is evaluated using Spearman's rank correlation and MAE for visit count predictions (task 1), metrics widely used for example in the risk adjustment literature [14], [36], [60].For Pummel diagnoses (task 2), we present metrics averaged across six disease category counts.For the MIMIC dataset, the performance of mortality prediction (task 3) is reported using the area under the receiver operating characteristic curve (AUC).The table compares performance in BP (N = 827) and MS (N = 123) subgroups.In each table, the results from the randomly initialized models are shown on the top and the corresponding results from models pretrained using GSP, MLM, and reverse distillation (RD) [33] are shown at the bottom.Mean performance ± standard deviation from five random restarts is reported.Bold indicates the best score in the corresponding column.

B. Baseline Comparisons Without Pretraining
In this section, we examine the effectiveness of the proposed SANSformer model against the selected baseline models without pretraining.This provides an assessment of our models in a controlled setup, where each model is trained from scratch for the task at hand, minimizing external influences.
The results from the Pummel dataset (from the diabetes subgroup consisting of approximately 41 000 training samples) and the MIMIC dataset, corresponding to Pummel visits, Pummel diagnoses and MIMIC mortality tasks, are presented in Tables II-IV.By isolating the effects of pretraining, this comparison reveals the inherent capabilities of the SANSformer model in processing EHR data and provides a preliminary insight into its predictive performance.
A close examination of the results shows that our SANSformer model consistently outperforms the baseline models across tasks and metrics, except for the MAE metric in task 2, where BRLTM is slightly better.This is a strong indication of the model's potential, confirming that the inductive biases we have integrated into it contribute significantly to its performance in handling EHR data.These foundational results establish the context for our main experiment on the larger PUMMEL dataset, leveraging the GSP strategy.

C. Transfer Learning for Divergent Subgroup Prediction
We next investigate the transfer learning capabilities of the SANSformer model when applied to three distinct subgroups in the PUMMEL dataset: T2D (N = 41 761), BP (N = 827), and MS (N = 123).The model is first pretrained, facilitated by GSP, on the broader population data (N ≈ 1 million), following which it is fine-tuned on these individual subgroups.This methodology permits us to evaluate the model's adaptability and effectiveness in transfer learning and its performance Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Data efficiency is evaluated by manipulating the size of the training dataset from the T2D subgroup, which provides a sufficiently large sample.This approach addresses the challenge that reduced neural network performance is often associated with smaller training data sizes [20].Fig. 5 presents the results from pretrained models utilizing various data sizes from the T2D subgroup.The results demonstrate that our axial SANSformer model is the most data efficient, providing superior performance in both Pummel visits and Pummel diagnoses tasks.
We further contrast the performance of GSP-pretrained models with models that have been randomly initialized.This comparison is conducted on the MS and BP subgroups, which are smaller and present a dual challenge due to their size and highly skewed target histograms (as illustrated in Fig. 1).As shown in Table V, GSP provides a significant boost in results across all model architectures when compared to random initialization.In these smaller, divergent subgroups, the axial SANSformer model consistently either outperforms or matches the best performing models, highlighting the robustness and versatility of GSP in handling diverse and complex tasks.
Examining challenging scenarios, such as the extremely small and skewed MS subgroup (N = 123), brings out the robustness of the GSP strategy.Even under such conditions, models pretrained with GSP, including our axial SANSformer, retain competitive performance.This emphasizes the value of GSP in maintaining model effectiveness across diverse situations, as opposed to relying on random initialization or other strategies like reverse distillation.For a more in-depth understanding of the nuances of the model performance on this subgroup, we've undertaken a thorough error analysis, which is detailed in Section VI-E of the supplementary material.

D. Scalability in SANSformers
To evaluate the parameter scalability of our SANSformers, we conduct experiments on the MIMIC dataset with task 3, systematically increasing the number of parameters in the models.As depicted in Fig. 6, both additive and axial versions of SANSformer exhibit better scalability compared to traditional Transformers when the number of parameters is increased.Throughout these experiments, the size of our training data remains unchanged.As a result, larger models could tend toward overfitting, as was noted in the RETAIN and SARD baselines.However, even with their increased parameter count, SANSformers resist this overfitting trend, signifying their robustness and suggesting enhanced generalization capabilities.Results from the scalability study on the Pummel dataset are included in Section VI-D of the supplementary material.
1) Summary of Results: Throughout our comprehensive evaluations, the SANSformer models consistently exhibit superior performance.They either surpass or match closely with the baselines in the nonpretraining experiments, emphasizing the value of the integrated inductive biases, such as axial decomposition and Δτ embeddings.A detailed ablation study on their effectiveness across both Pummel and MIMIC datasets is presented in Section VI-C.Furthermore, in the divergent subgroup analysis, neural network models derive substantial benefits from an initial pretraining step.Notably, the introduction of the GSP pretraining strategy enhances model performance, achieving gains beyond other strategies like MLM and RD.

VI. CONCLUSION
This article presented attention-free MLP models [41], [64] in the EHR domain, which have previously demonstrated competitive performance in computer vision and NLP tasks.The proposed SANSformer model, in particular, exhibited robust performance across various datasets and tasks.Regardless of whether initialized randomly or pretrained, it consistently outperformed other strong baselines, thereby consolidating its advantages as discussed in Sections V-B-V-D.
In predicting for divergent subgroups with pretraining, SANSformers consistently match or outperform the other competitive baselines, as depicted in Fig. 5 and Table V.The advantages of axial mixing over additive visit summarization are most evident in tasks that require the capture of complex interactions between tokens, such as task 2 in Table V and Fig. 5. Without pretraining, SANSformer models consistently outperform other baselines even when randomly initialized, as seen in Tables II-IV.This highlights that SANSformer models are more parameter-efficient than Transformers while maintaining their scalability, as illustrated in Fig. 6.
These results underline the substantial potential of selfsupervised pretraining on the general population to amplify prediction accuracy across model architectures in divergent subgroups, providing a boost to subgroup-specific prediction models that are essential in allocating valuable healthcare resources.Interpreting the features learned by SANSformer models and comparing those with features engineered by domain experts could be an interesting extension of this work.

A. Limitations
Our proposed SANSformer models introduce a notable advancement in utilizing attention-free MLP models within the EHR domain.However, there are inherent limitations to be considered.First, SANSformers encountered a restriction related to sequence length, defined by the dimensions of the weight matrix W [refer to (6)].However, such a limitation is also observed in RNNs and Transformers, where, despite their theoretical capability to process indefinitely long sequences, practical challenges such as the "vanishing gradient" problem and memory limitations, respectively, inhibit their effectiveness for processing of lengthy inputs.Second, while our model effectively utilizes inpatient encounters as an indicator to anticipate healthcare utilization, it does not directly compute costs required by some applications [56], relying instead on a proxy.Consequently, further adaptations and enhancements would be required to derive the actual expected healthcare cost.Lastly, the exclusion of the self-attention mechanism from SANSformers ostensibly reduces the model's interpretability.While attention maps, from the self-attention mechanism, provide a degree of insight into model explainability, the clarity and reliability of such interpretability are subject of ongoing debate within the research community [29], [68].Applying alternative modelagnostic interpretability methods, such as SHAP [44], to furnish insightful and trustworthy explanations of SANSformer's decision making process in healthcare contexts is an exciting direction for the future work.grammatical coherence and LaTeX formatting, OpenAI's Chat-GPT was used to assist in polishing the language and structure in the manuscript.

Fig. 1 .
Fig. 1.Highlighting the challenge of predicting with a divergent subgroup.The histograms depict the number of hospital visits for two groups: the general population (top) and individuals diagnosed with bipolar disorder (BP, bottom).The observed bimodality in the histogram of counts results from the application of a topcap measure, a mechanism designed to address the long-tail problem by capping the maximum count value at 36 visits.This highlights the disparate visitation patterns within subgroups, demonstrating the complexity in predicting future healthcare utilization.A Student's t-test indicates a significant difference between the distributions of general population and BP (p-value < 0.001).

Fig. 2 .
Fig. 2. SANSformer architecture.The figure provides a schematic representation of a single SANSformer layer.To emphasize the specific modifications introduced in our approach, we mirror the schematic of the Transformer architecture from Vaswani et al. [66].Alterations from the conventional Transformer layer are highlighted in red, particularly the introduction of Δτ embeddings and the replacement of the self-attention mechanism with our attention-free mixers.A side-by-side comparison on the right shows the conventional self-attention and mixer, underscoring the efficiency achieved with fewer projection operations.Unlike the self-attention's three projections (to query, key, and value), the mixer employs just a channel and a spatial projection.For an exhaustive schematic of the entire architecture, please refer to Fig. 4.

Fig. 4 .
Fig. 4. Detailed schematic of the axial SANSformer model.The original tensor, comprising embedding size (E), intravisit size (V), and time-steps (T), is axially decomposed to yield two tensors with dimensions T × E and V × E. The model's dataflow bifurcates into two distinct branches: a visitwise branch (top) and an intravisit branch (bottom).The visitwise branch aggregates tokens from preceding visits (enabled by causal masking), while the intravisit branch focuses on tokens from the current visit.Spatial gating unit (SGU) facilitates the cross-token interactions for each branch.The two branches are integrated using a scalar weight α axial , which is optimized using cross validation.Consequently, when α axial = 0 the model simplifies to the additive SANSformer variant.

Fig. 5 .
Fig. 5. Data efficiency through pretraining on the Pummel dataset (T2D subgroup).The figure evaluates the performance of pretrained models across varying training data sizes.Row 1 depicts the Spearman's rank correlation for Pummel Visits and Pummel Diagnoses tasks, while Row 2 reports mean absolute error (MAE) for the same tasks.For each data point, the mean and standard deviation computed from three random restarts are shown.The performance of the randomly initialized RETAIN (RI) model has been included for reference.

Fig. 6 .
Fig.6.Parameter scalability on MIMIC.This figure plots model performance on task 3 against the number of trainable parameters, illustrating the scalability of each model.Impressively, both additive and axial SANSformer models maintain strong performance even as the number of parameters is increased.This is particularly noteworthy given the constant training data size, as the SANSformer models resist overfitting and showcase their superior generalization capacity.

TABLE I BASIC
STATISTICS OF THE TWO EMR DATASETS

TABLE V PERFORMANCE
ON DIVERGENT SUBGROUPS IN PUMMEL DATASET