A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients

Motivated to address the inconsistency between the essential i.i.d. assumption in machine learning theory and the data heterogeneity in real-world applications, we propose a novel calibrated ensemble (CE) algorithm to facilitate learning with diverse data subgroups. Unlike the traditional ensemble framework in which each learner is trained independently using the entire dataset, our method exploits the strengths of various machine learning models by training them simultaneously and forming model-ergonomic data subgroups as part of the training process. Consequently, each learner is calibrated to a unique subset of data based on their individualized predictive strength. Clinically, we can interpret each model as an expert specializing in treating patients with particular disease manifestations. We evaluate the CE model in our motivating domain of identifying lupus patients with severe SLE flares using 1541 clinical encounters in the Mass General Brigham (MGB) Lupus Cohort. Our experimental results demonstrate the efficacy of our CE model across seven performance evaluation metrics compared to five individual machine learning models and regular ensemble approaches. We further utilize ANOVA and Tukey HSD post-hoc statistical analysis to discover characteristic features of individual model clusters for clinical interpretations.


I. INTRODUCTION
Machine learning (ML) has attracted a significant amount of interest in recent years and has become a rapidly emerging field in artificial intelligence. Conceptually, ML can be viewed as discovering the underlying pattern in a large collection of data (i.e., training examples), guided by various learning algorithms. The effectiveness of this process relies on the assumption that the collected data are drawn independently from the same distribution. For example, logistic regression (LR) [1], neural networks [2], and many other standard algorithms implicitly make this assumption in their learning processes. However, this assumption is often violated in practice. Practitioners applying machine learning to real-world data often find themselves in a common predicament: data are col-The associate editor coordinating the review of this manuscript and approving it for publication was Vishal Srivastava. lected from heterogeneous sources and, consequently, have different underlying distributions. This is particularly true in the medical domain where patients may belong to distinctive subgroups with altered disease characteristics, or different physicians can introduce biases due to subjective interpretations of the clinical test or lab results [3], [4].
The subgroups of comparable subjects which can be effectively modeled as coming from the same distribution are notoriously difficult to identify. Many efforts, such as multiple-task learning [5], multi-view learning (MVL) [6], and transfer learning [7], have been made to address the idiosyncrasies in subsets of training data. Another typical approach is to group the data using descriptive features guided by domain knowledge. However, this option could lead to reduced training data, and the domain knowledge may not always be available. We provide a brief survey of these related studies in Section II.
In our study, we observe that different machine learning algorithms tend to focus on a different set of features in their decision-making processes, suggesting that they are targeting patients with different disease manifestations. Based on our findings, we propose an iterative approach in which an ensemble of k ML models dynamically partitions the training data into k subgroups. Model parameters and subgroups of patients best suited for fitting by individual learners are adjusted in each iteration to maximize the models' performance. Unlike the traditional ensemble framework in which each learner is trained independently using the entire dataset, our method exploits the strengths of various machine learning models by training them simultaneously and forming model-ergonomic data subgroups as part of the training process. We term our model calibrated ensemble (CE) because it is an ensemble of multiple learners and each learner is calibrated to a unique subset of data.
In addition, instances in each algorithmically induced data cluster can be interpreted as patients exhibiting certain homogeneous traits that are apprehensible to a particular learned model (i.e., expert) but relatively opaque to others. To further understand these latent disease subgroups, we utilize the ANOVA [8], Bartlett's [9], and Tukey HSD post-hoc [10] statistical tests to identify individual model clusters' characteristic features and study their clinical interpretations.
The main contribution of this paper is a novel ensemble approach to explicitly address multiple underlying data distributions in building predictive models. In the clinical setting, the resulting models can be interpreted as experts and the data associated with each model is a patient subgroup in which the expert (i.e., the model) specializes. Our method can be extended to any dataset that is a mixture of multiple distributions, which is typical in real-world applications.
Another contribution of our work is a new method to discover latent disease subgroups in clinical data. In contrast to the majority of existing approaches where data clusters are typically identified via domain knowledge before the onset of model training, our algorithm iteratively forms model-ergonomic subsets in the model training process. Subsequently, the clinical interpretation of these patient groups can be inferred by performing statistical analysis on each feature across different model clusters to identify characteristic traits, and thus, leading to potential discoveries of patient subgroups that are not yet established in the clinical setting.
We demonstrate the efficacy of our CE algorithm in our motivating domain of identifying disease flares in patients with systemic lupus erythematosus (SLE), a heterogeneous disease characterized by a range of clinical manifestations and laboratory abnormalities. The heterogeneous nature of the disease lends difficulty to accurately predicting disease flares, as does the irregular nature of real-world clinical observational data with varying distributions.
Leveraging experts' domain knowledge, Zhao et al. introduced a domain induced Dirichlet mixture of Gaussian processes (DI-DPMGP) model to address the patient subgroups and physician subjectivity in predicting the disease course for multiple sclerosis patients [4]. In their approach, data subgroups generated by a k-means algorithms served as hierarchical constraints to a non-parametric model. Ross et al. [11] proposed a novel clustering with constraints method to identify new and clinically relevant categories of lung disease. In particular, they introduced a new way of looking at subtyping/clustering by recasting it in terms of discovering associations between individuals and disease trajectories.
In the MVL domain, Liu et al. explored multi-view learning [6] in classifying mild cognitive impairment (MCI), an early stage Alzheimier's disease. They proposed an effective method to enhance the feature representation of multi-modal MRI data by combining multi-view information to improve the performance of MCI classification. Serra et al. [12] proposed a multi-view genomic data integration methodology, in which the information from different data layers (views) is integrated at the levels of the results of each single view clustering iteration.
In the MTL and TL domain, Hu et al. applied transfer learning to generate individualized patient models, grounded in the wealth of population data, while also detecting and adjusting for inter-patient variabilities based on each patient's own histologic data [13]. Zhao et al. [3] applied transfer learning techniques to address human subjectivity in predicting disease course for chronic progressive diseases.
It is worth noting that all of the aforementioned methodologies require pre-defined criteria for data groups or subtasks/views. Nevertheless, this information can often be unavailable. Our proposed method is motivated to address this limitation by identifying the latent data clusters algorithmically as part of an ensemble model's training process. Consequently, the resulting learners are calibrated to specialize in distinct subsets of data based on their individualized predictive strengths.

III. METHODS
This section illustrates our proposed calibrated ensemble model, which will iteratively partition the patients into subgroups during its model training process, leveraging individual learners' strengths.

A. CALIBRATED ENSEMBLE MODEL 1) MODEL TRAINING
We denote our training data as . , x d i ) and y i ∈ {0, 1} (i = 0, 1, 2., . . . , n) are the attributes and the corresponding observed class labels for instance i, respectively.
Our CE algorithm starts with training k baseline models, denoted as m 1 , m 2 , . . . , m k , using the entire dataset D. The training process entails all necessary steps to obtain the best performing model, including 10-fold cross-validation, imbalanced data treatment, and nested 10-fold cross-validation for hyper-parameter selection as described in Section IV-C1.
Next, we will form k data clusters for the corresponding models. The cluster membership of each training instance will be set to the model group with the highest probability score for the class label. For example, assuming our CE algorithm employs three learners, m 1 , m 2 and m 3 and P(m( x i ) = y i ) denotes the probability score for class y i when model m is applied to instance x i . If P(m( x i ) = y i ) are 0.4, 0.7, 0.6 from m 1 , m 2 , and m 3 respectively, then x i will be assigned to the m 2 group. In the case of a tie, the group membership is set randomly among the equal performing algorithms.
Once we have formed the initial clusters, we will retrain each model using its own group data, followed by reassigning the group membership of each instance according to the probability scores after applying the retrained models. This model training and membership assigning process is repeated until convergence, i.e., when there is no group membership change for all the data points. To prevent degenerative models, we ensure a minimum of 15 instances from each class in each cluster. As a result, a data point will not be moved to a higher performing group if its reassignment will violate the minimum instance requirement. The convergence proof of the CE algorithm is given in Section III-B. We provide the outline of the algorithm in Figure 1.
It is worth noting that our CE algorithm bears some resemblance to a k-means algorithm [14]. However, there are two fundamental differences between the two algorithms. First, k-means is an unsupervised clustering algorithm aiming to discover the underlying structure of the data, whereas CE is a supervised algorithm that exploits the strength of individual machine learning algorithms. Second, the k-means algorithm acts on a set of descriptive attributes of the dataset and the clusters are formed using similarity measures, whereas the clusters in CE are formed based on models' performance. Indeed, the descriptive characteristics of each cluster can be inferred afterward by performing statistical analysis on the obtained subgroups (see Section IV-E for details).

2) MODEL INFERENCE
We perform model inference by applying each learner to the new data point and select the prediction with the highest probability. If we consider each model as an expert specializing in treating a certain patient type, then the highest probability corresponds to the highest confidence. Assuming that each expert is reasonably skilled (i.e., better than random guessing), this approach is consistent with the design principle of the CE algorithm, i.e., for each instance, the best learner is the one with the highest predicted probability score in the corresponding label class. With this inference method, our CE model demonstrates significant performance gains over individual baseline models, as well as regular ensemble leaners. Detailed performance comparisons are presented in Section IV.
We also experimented with another intuitive inference approach, which is trying to place a new data point into the ''right'' algorithmic group and then apply the corresponding learned model. Nevertheless, assigning the correct group membership to a new instance can be challenging because the data clusters are formed algorithmically and there are no specific descriptions of each group. To this end, we experimented with both centroid-based and prediction-based methods. The centroid-based method calculates the cluster centroid for each algorithmic group by taking the average of all data instances in the group. A new instance is assigned to the most similar group measured by the Euclidean distances between the data point and the centroids. The prediction-based approach builds a 3-class classification model based on the class memberships of the training data. Both methods demonstrated worse outcomes than the above highest probability approach.

B. PROOF OF CONVERGENCE
We show that our CE algorithm will converge within a finite number of steps. Following the same notion as in Figure 1, the cost (C) of the CE algorithm is defined as follows: where −k is the total number of ML algorithms employed by the CE ensemble.
−g j is the data cluster associated with model j. −y is the ground-truth label of instance x. −P(m j ( x) = y) is the probability score of class y when model m j is applied to instance x. Thus, C is the total deficiency in predicted probability scores of all instances with respect to their ground-truth labels. We claim that the cost function C is strictly decreasing in each iteration of steps 4) to 6) in Figure 1. This can be shown as follows.
First, in step 6), we observe that the group membership for x changes from cluster g to g only if P(m ( x) = y) > P(m( x) = y), which means model m makes a more accurate prediction for x. This improved performance can either correct a wrong prediction or result in a higher predicted probability score towards the ground-truth class. In both cases, the membership reassignments will decrease the algorithm's total cost C.
In step 5), we observe that all models in iteration t (i.e., m t i ) will be retrained using the adjusted corresponding data clusters (i.e., g t i ) to obtain m t+1 i , ∀i ∈ {1, 2, . . . , k}. If the cost incurred by m t+1 i is higher than that of m t i for instances in the cluster g t i , we will simply keep the old model by setting m t+1 i = m t i . Thus, the cost C is non-increasing in step 5). Since the C is strictly decreasing throughout the iterations and is lower-bounded by 0, the CE algorithm converges within finite steps.

IV. EXPERIMENTAL RESULTS
In this section, we first describe our motivating task of predicting lupus flares. We then demonstrate the efficacy of the CE algorithm by comparing its performance to that of individual baseline models and regular ensemble learners. Lastly, we illustrate the interpretation of the data clusters identified by the CE model's individual learners under the clinical setting.

A. PREDICTING LUPUS FLARES
Lupus is a chronic autoimmune disease with a prevalence of at least five million people worldwide [15]. Patients suf-fer from various symptoms, including pain, extreme fatigue, hair loss, cognitive issues, and physical impairments that affect every facet of their lives. Systemic lupus erythematosus (SLE) is the most common form of lupus, affecting approximately 70% of lupus patients [15]. The clinical course of SLE is heterogeneous and characterized by disease flares which can range from mild to life-threatening, affecting various organ systems [16]- [18]. Such flares can lead to irreversible organ damage and lower health-related quality of life, as well as considerable economic costs.
Identifying severe SLE disease flares in real-world data could provide unprecedented insight into nuanced patterns underlying disease activity. The stratification of patients by risk for SLE flares could lead to improved clinical monitoring and targeted treatment. However, accurate assessment of lupus flares is critical but problematic in clinical trials [19]. A gold standard measure is the SELENA-SLEDAI flare index (SFI) [20], a cumulative and weighted index used to assess disease activity across 24 different disease descriptors in patients with SLE. In practice, a revised SFI (i.e., rSFI) [21], [22] is preferred, which further incorporates additional information as an expert domain driven rule-based algorithm and classifies SLE patients into mild-flare, moderate-flare, severe-flare, and no-flare categories.
Our study applies machine learning techniques to identify patients in the most severe-flare category (class 1) against the remaining mild/medium/no flare patients (class 0) based on their rSFI scores, engendering a binary classification task. This undertaking is valuable because hospitalizations are needed for the most severe SLE flares, occurring in 7% of individuals with SLE per year and accounting for most of the direct costs of SLE care [23], [24].

B. DATA AND PREPROCESSING
Our data comes from the longitudinal EHR-based Mass General Brigham (MGB) Lupus Cohort, including patients with SLE from two large academic medical centers and multiple community hospitals. These subjects were identified by a previously validated SLE phenotype study and have been followed longitudinally between 2016-2020 [25]. Our dataset consists of 1,541 clinical encounters over this period. This study was approved by the Mass General Brigham Institutional Review Board, and informed consent was waived.
We extracted a total of 203 features from patients' encounter information, lab results, and medication records. Categorical features were further processed using one-hot encoding, a technique in which an integer encoded categorical variable is converted to a set of binary variables, each of which indicates a unique value in the category [26]. One-hot encoding eliminates the artificial ordering introduced by the integer values that a machine learning algorithm could exploit erroneously.

1) IMBALANCED DATA
Our dataset is highly imbalanced with a class 1 (severeflare) to class 0 (mild/medium/no flare) ratio of 332 to VOLUME 10, 2022 1209. Applying standard machine learning algorithms to an imbalanced dataset leads to insufficient performance on the minority class, which is often the more interesting and important class under investigation. Indeed, the primary interest of our classification task is to accurately predict patients with severe SLE flares. To address the class imbalance issue, we employed the cost-sensitive learning [27] technique in our model training process. Specifically, a higher cost (i.e., weight) is assigned to all minority instances to facilitate a larger penalty when they are misclassified. For each algorithm, the best weight was selected as a hyper-parameter using a nested 10-fold cross-validation detailed in Section IV-C1.

2) MISSING VALUE IMPUTATION
There are 30 features with missing values (MVs) in our dataset, with the missing percentage ranging from 0.13% to 97%. We removed 14 features missing in over 40% of patients. For the remaining features, we imputed the missing values using the mean or mode for the numeric and categorial features, respectively. Finally, we applied z-score normalization to standardize the data.

C. EXPERIMENTAL FRAMEWORK 1) BASELINE MODELS
We employ five baseline machine learning methods: Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), and XGBoost. Figure 2 illustrates our framework for training and evaluating each baseline model. Specifically, all experiments are conducted using an outer 10-fold (black box) cross-validation. Therein, we divide the training data into ten disjoint partitions (i.e., folds), and train/evaluate each classifier ten times with different training and test data. At each iteration t, (t = 1, 2, . . . 10), fold i will be designated as the test data, and the remaining nine folds will be designated as the training data. We report the average performance of the ten test folds. Optimal hyper-parameters are selected using a grid search [28] and an inner 10-fold cross-validation (red box in Figure 2), aiming at the highest validation area under the receiver operator curve (AUC) [29]. The optimal parameter combination is then used to perform a final training on the complete 9-fold of data. The model performance measures are computed from the ground-truth and predicted class memberships based on the predictive probabilities. The performance is evaluated using seven metrics: overall accuracy, recall, specificity, PPV, NPV, F1, and AUC.

2) LEARNERS FOR THE CALIBRATED ENSEMBLE MODEL
We selected RF, LR, and NB to be the learners for our CE model based on a study of baseline models' principal predictors. Specifically, in our experiments, we observed that three (i.e., RF, LR, NB) out of these five algorithms exhibit notably different principal predictors, while DT and XGboost have a significant number of common predictors as the RF algorithm. Thus, we conjectured that RF, LR, and NB focus on different disease characteristics and target different patient subgroups. To capitalize on our findings, we chose RF, LR, and NB for our CE algorithm.

D. MODEL PERFORMANCE EVALUATION
Our CE model converged after eight iterations with 135, 634, and 622 instances in the RF, NB, and LR clusters, respectively. The total cost function, as defined in Equation (1), monotonically decreased from 171.02 to 24.38 across the iterations (Figure 3). Table 1 presents the main results of our study. Each model's performance is evaluated using seven metrics shown in columns 2-7.
From the row labeled ''Average 1'', we observe that the five individual models achieved an average overall performance of 67% test accuracy with 77% and 64% in Recall and Specificity, respectively. The average PPV and NPV scores are 0.38 and 0.91; their large discrepancy can be explained by the highly imbalanced data in our study. The average F1 score and AUC are 0.50 and 0.71, respectively.
Compared to the individual model, our CE methods demonstrated significant advantage across all seven evaluation metrics as shown in the row labeled ''Gain (%) over Average 1'' in the CE category. Specifically, the improvement in overall accuracy is 10% with 12% and 11% in class 1 and class 0, respectively. The gains in PPV, NPV, F1 scoare, and AUC are 45%, 2%, 35%, and 11%, respectively.
In addition to individual models, we further compared our CE model's performance to a traditional ensemble of the five baseline learners. To this end, we employed two types of inference for the regular ensemble model. The ''hard'' inference performs a majority vote from the base learners' final classification decisions (i.e., class 1 or 0), while the ''soft'' inference calculates the average predicted probability scores from five base learners and thresholds it at 0.5 to make the decision. We observe from Table 1, row ''Average 2'', that the regular ensemble learner made a noticeable improvement over the average of individual models in predicting class 1 (i.e., Recall, 8%), but the gain in class 0 is limited (Specificity, 2%). Other improvements are in NPV (3%), AUC (5%), and F1 score (2%). There is a 3% drop in NPV.
Last, we compare our CE model's performance to that of the ensemble approach. From row ''Gain (%) over Average 2'', We observe that the CE method outperformed the traditional ensemble approach with a 7% improvement in overall accuracy, 4% and 9% in class 1 and class 0, respectively. Most noticeably, the CE model offered a 49% gain in PPV with a marginal 2% trade-off in NPV, resulting in a 32% improvement in F1 score. The AUC improvement over the average ensemble approach is 6%.

E. CHARACTERISTIC FEATURE ANALYSIS
In this section, we present our study of the characteristics of patient subgroups formed by the CE model. We first aimed to identify features whose cluster means were statistically different among the model-specific groups. To this end, for each feature, we applied a one-way Analysis of Variance (ANOVA) test [8], which is an extension of the Student t-test for more than two groups. Specifically, ANOVA compares the means among the groups and determines whether any of those means are statistically significantly different from each other. Formally, for our application, it tests the null hypothesis: where µ i RF , µ i NB , and µ i LR denote the average value of feature x i over instances in the RF, NB, and LR subgroups, respectively. We further applied Bartlett's test [9] to ensure the homogeneity of variances assumption in the ANOVA analysis. A statistically significant result (i.e., rejecting hypothesis H 0 ) from an ANOVA analysis indicates at least one group differs from the other groups. However, the omnibus test does not inform where the significance lies. To further analyze the pattern of difference between means, we performed the Tukey HSD (''Honestly Significant Difference'') post-hoc test [10] for those statistically significant features. The Tukey HSD test is similar to a pairwise t-test, but more reliable for data with more than two independent groups. Table 2 presents the statistically significant features (Column 2) of each model-specific subgroup (Column 1) with a 95% confidence interval for all three tests. The cluster means and p-values for each feature are displayed in Columns 3-5 and Columns 6-10, respectively. In particular, the last three columns in Table 2 present the pairwise p-values of the three clusters, from which we inferred the features unique to each algorithmic cluster. For example, we observe that the pairwise p-values for the feature ''Initial -pleuritis'' are 0.001, 0.001, and 0.9 for RF vs. NB, RF vs. LR, and NB vs. LR, respectively. Thus, this feature is statistically insignificant between the NB and LR clusters (p-value = 0.9) but significant for RF vs. NB (p-value = 0.001) and RF vs. LR (p-value = 0.001), suggesting it is a characteristic feature for the RF subgroup. As illustrated in Table 2, there is a total of 10, 7, and 12 features unique to the RF, NB, and LR subgroups, respectively. Lastly, two features (i.e., ''arthralgias only'' in arthritis details and current symptom of inflammatory arthralgias/arthritis, are used in all three clusters.

F. CLINICAL INTERPRETATIONS OF DATA SUBGROUPS
We observe in Table 2 that the patients in the NB cluster have nearly twice higher initial manifestation of nephritis (RF:0.10, NB:0.18, LR:0.09) than the other clusters This group also has a considerably higher usage of blood thinning medication warfrain (RF:0.03, NB:0.08, LR:0.02). A comparison of these features is presented in Figure 4(a).

V. CONCLUSION
In this work, we proposed a new calibrated ensemble (CE) approach to address the heterogeneity in real-world data, which often violates the fundamental i.i.d. assumption in machine learning theory. Unlike the traditional ensemble framework in which each learner is trained independently with the entire dataset, our method exploits various ML models' strengths by training them simultaneously and instituting model-specific data subgroups as part of the training process. As a result, each learner is calibrated to a unique subset of data based on their individualized proficiencies. Clinically, we can interpret each model as a specialist for patients with a particular set of disease manifestations.
We evaluated the CE model in our motivating domain of identifying lupus patients with severe SLE flares in 1,541 clinical encounters. Our experimental results demonstrated consistent efficacy of our CE model across seven evaluation metrics when compared to five individual ML models and regular ensemble methods. We further conducted statistical analysis to identify characteristic features of each model-specific patient subgroup and examined their clinical interpretations.
Two factors could have contributed to the success of our CE algorithm. The first is the enforcement of the i.i.d. assumption for each ML algorithm. In particular, by restricting data to a high-performing subset for each learner, the i.i.d. assumption is enhanced for each algorithm, thereby allowing greater potential for success. The second factor is that different ML algorithms can be most effective for different disease manifestations due to their intrinsic designs. For example, the discriminative (e.g., RF, LR) and generative approaches (e.g., NB) are fundamentally different in model structure and learning principle. The CE algorithm exploits each learner's strength and dynamically selecting an ergonomic subset for each model as part of its training process. Our method can be extended to other applications where the collected data may come from multiple underlying distributions.