Computational Simulation of Virtual Patients Reduces Dataset Bias and Improves Machine Learning-Based Detection of ARDS from Noisy Heterogeneous ICU Datasets

Goal: Machine learning (ML) technologies that leverage large-scale patient data are promising tools predicting disease evolution in individual patients. However, the limited generalizability of ML models developed on single-center datasets, and their unproven performance in real-world settings, remain significant constraints to their widespread adoption in clinical practice. One approach to tackle this issue is to base learning on large multi-center datasets. However, such heterogeneous datasets can introduce further biases driven by data origin, as data structures and patient cohorts may differ between hospitals. Methods: In this paper, we demonstrate how mechanistic virtual patient (VP) modeling can be used to capture specific features of patients’ states and dynamics, while reducing biases introduced by heterogeneous datasets. We show how VP modeling can be used for data augmentation through identification of individualized model parameters approximating disease states of patients with suspected acute respiratory distress syndrome (ARDS) from observational data of mixed origin. We compare the results of an unsupervised learning method (clustering) in two cases: where the learning is based on original patient data and on data derived in the matching procedure of the VP model to real patient data. Results: More robust cluster configurations were observed in clustering using the model-derived data. VP model-based clustering also reduced biases introduced by the inclusion of data from different hospitals and was able to discover an additional cluster with significant ARDS enrichment. Conclusions: Our results indicate that mechanistic VP modeling can be used to significantly reduce biases introduced by learning from heterogeneous datasets and to allow improved discovery of patient cohorts driven exclusively by medical conditions.

However, the more data-driven models are applied in healthcare settings, the more the issue of impaired performance on different datasets, i.e., poor generalizability of such models, is becoming apparent [5], [10], [11], [12], [13].If ML models are developed on one dataset, they learn data distributions which are specific or characteristic for this particular dataset and perform worse on data obtained from other sources with potentially different distributions [14], [15], [16].Moreover, attempts to apply models developed in a single hospital to patients from another hospital have also already revealed significant limitations [17], [18].In medicine generally, but particularly in the ICU setting, there are multiple reasons why data from different hospitals can differ significantly, e.g., different admission strategies, guidelines for treatment, patients' baseline values, protocols on settings of medical support devices or definitions of cut-off values [19], [20], [21].
On the one hand, the issue of poor generalizability of developed models cannot be solved by blindly increasing the size of the training dataset, as this does not necessarily guarantee a good performance of a model on another dataset [10].On the other hand, pooling of data from diverse origins for development of AI/ML tools introduces further biases driven by data origin.This can represent a challenge for the application of both supervised and unsupervised AI/ML methods, as relevant medical information can be hidden behind biases introduced by different datasets [22].
A potential solution to these challenges is to exploit models that allow to infer the core information approximating a patient's status.Such computer models, which are complex enough to model heterogeneous human pathophysiological states, are often referred to as "virtual patient (VP) models" or "in silico" patients [23].These mechanistic models rely on real patient data and represent a specific pathophysiological state of a patient.Therefore, they can be considered a "digital twin" of a real patient at a given point in time.VP models aim to capture specific features of patient dynamics while avoiding excessive detail.They are based on well accepted and understood physiological principles and can be adapted to represent individual patients [24].VP modeling, therefore, enables data augmentation through identification of individualized model parameters in the matching procedure of the VP model to real patient data.These model-derived parameters represent an approximation of a disease state of a patient and potentially should not depend on the assessment protocols of the underlying dataset.Therefore, models integrating these parameters are expected to be generalizable across different application sites.In the area of in silico clinical trials encouraging results support this hypothesis.Thus, the responses of the matched VP cohorts to the insulin therapy were generalizable across different hospitals once they were compared to the responses of original cohorts in corresponding hospitals [25].Moreover, previous applications of hybrid approaches incorporating both mechanistic and data-based modeling have already resulted in successes in other areas of research.Thus, model-derived parameters of individual patients were used to infer important clinical covariates for a patient state [26] or stratify patients [27].
In this paper, we investigate how a mechanistic VP model can be employed to infer model-derived individualized parameters from ICU data pooled from diverse hospitals.We show that such data augmentation allows a reduction in the bias introduced by diverse datasets, and provides clinically meaningful information from noisy heterogeneous data, for instance from data pooled from different hospitals, which allows improved discovery of patient subpopulations through clustering.We demonstrate our approach on a cohort of patients with suspected acute respiratory distress syndrome (ARDS)a potentially life-threatening condition assessed from multiple hospitals in Germany as part of the ASIC project [28].
During the development of ARDS, due to an inflammatory process and a diffuse damage of alveolar-capillary membrane, protein-rich fluid enters the alveolar space impairing gas exchange.The weight of such a "wet lung" leads to an increased gravitational pressure on the lower, dependent lung compartments.This pressure in combination with the already present edema leads to the formation of atelectases, especially under mechanical ventilation (MV) with inadequate settings [29], [30], [31].This leads to respiratory insufficiency with relevantly impaired pulmonary gas exchange and possible multi-organ failure and fatal outcomes [32], [33].Despite the existence of an explicit clinical definition (the Berlin definition [34]), significant numbers of patients with ARDS are unrecognized or recognized late by clinicians [35], [36], [37].Thus, diagnosis is difficult and often delayed resulting in incomplete adherence to guideline-based therapy and high morbidity and mortality rates [32], [33].Failure to recognize ARDS in a timely fashion leads to failure to use strategies that improve survival [37].Early diagnosis of ARDS may facilitate measures to avoid progression of the lung injury, including protective mechanical ventilation, fluid restriction, and adjunctive measures proven to improve survival such as prone positioning.
Therefore, there is an urgent need for methods that could assist clinicians in early recognition of ARDS in the ICU setting.Several ML models have been developed for the early diagnosis of ARDS in the ICU [4].However, insufficient quality of ARDS labeling in retrospective datasets, which is caused by under-recognition of ARDS by clinicians [35], [36], [37] and by the ambiguities in the use of the Berlin definition [4], represents an important challenge for successful development of applicable ML models, as they must be trained on properly labeled ARDS events.In this paper we provide a way to address this issue.We show that a mechanistic VP model can be used to infer a set of model-derived parameters approximating disease states of individual patients from raw data, which can be used to identify non-diagnosed ARDS patients, providing a route to improved ML model development for early ARDS recognition.

A. COMPUTATIONAL MODEL
The simulator used in this study includes a comprehensive model of the pulmonary system based on mechanistic models of ventilation and gas exchange [38].It was later extended to include cardiovascular components [39].The simulator has already been validated using real patient data [40], [41].Internally, the model is constructed as a system of differential algebraic equations obtained from published literature, experimental data, and observational studies, that quantitatively represent established physiological processes.The equations are solved iteratively, with the solutions of one iteration at a time point used as inputs to the iteration at the next time step.This allows accurate representation and observation of gradual changes in several parameters that are otherwise difficult to estimate.The simulator consists of different modules representing the airways, the lung as a collection of ventilated alveolar compartments coupled to mechanical ventilator, anatomical shunt, dead space and the tissue compartment.The lung is modeled using 100 alveolar compartments, each of which may have different properties such as flow resistance, vascular resistance, compliance, etc.Thus, ventilation-perfusion mismatch can be modeled, allowing the simulation of conditions such as ARDS [42], [43], [44].
The simulator represents a dynamic cardiopulmonary state in vivo that is initialized with numerous input parameters.Some of these parameters are routinely measured in intensive care setting, such as blood gas analysis (BGA) measurements or respirator settings (the full list of parameters used as inputs for the model is given in the Supplementary List I).Others, however, are rarely measured, such as cardiac output, anatomical shunt or biophysical characteristics of individual alveolar compartments, and thus these must be estimated using optimization procedures.

B. CREATION OF A VIRTUAL PATIENT COHORT
To fully define each of the virtual patients, the simulator was fitted to individual patient data using advanced global optimization algorithm [45], [46], [47].The model parameters that were identified in the optimization procedure included 2 groups of parameters.Firstly, rarely measured physiological parameters (anatomical shunt, respiratory quotient, anatomical dead space volume, metabolic rate of O2, cardiac stroke volume, and inspiration to expiration ratio), were determined through optimization if they were missing in patient data.Parameters defining distributions of properties of alveolar compartmental parameters (vascular resistance and flow resistance of compartments) were also identified in the optimization process.To model ARDS development, another main parameter was introduced to the optimization procedure -the number of closed alveolar compartments (n cc ), accounting for the formation of atelectases and modeled through increased external pressure on the compartment leading to no ventilation and complete alveolar shunt.The optimization problem was formulated to find a configuration of model parameters that minimizes the difference between the model outputs and the observed patient data (arterial blood gas values at all time points in a window).Further details on the optimization procedure are given in the Supplementary File.
The optimization procedure was performed in two time windows relative to the onset of ARDS (t 0 ): from t 0 -2d to t 0 -1d (window 1) and from t 0 to t 0 + 1d (window 2), where d stands for 1 day.We assumed a patient to be in a steady non-ARDS state in the window 1 and in a steady ARDS state in the window 2. The one day interval between the two windows was assumed to represent a transient state and was excluded from the optimization.The optimal parameterization of the simulator for each patient in the window 1 comprised a VP configuration.To model ARDS development, in the window 2 optimization was performed exclusively for the n cc keeping the VP configuration found in the first window intact.
After fitting the simulator to individual patients, a list of parameters was calculated based on simulator outputs and parameters found in the optimization procedure in both time windows for each of the patients.These parameters, among others, included n cc , ventilation and shunted blood fraction (the full list of optimized and simulation output parameters is given in the Supplementary List II).For each of the patients, these parameters comprised model-derived data consisting of 18 features.

C. DATA
Four German hospitals (later referred to as Hosp A, Hosp C, Hosp D and Hosp E) provided retrospective, fully depersonalized data on ICU patients collected during the project "Algorithmic surveillance of ICU patients with acute respiratory distress syndrome" (ASIC) [28] of the SMITH consortium, which is part of the German Medical Informatics Initiative.The ASIC project was approved by the independent Ethics Committee (EC) at the RWTH Aachen Faculty of Medicine (local EC reference number: EK 102/19, date of approval: 26.03.2019).The ASIC project was registered at the German Clinical Trials Register (Registration Number: DRKS00014330).The Ethics Committee waived the need to obtain Informed consent for the collection and retrospective analysis of the de-identified data as well as the publication of the results of the analysis.Additionally, a historical dataset from one of the participating hospitals was included into the analysis (Hosp B).It comprised fully depersonalized data of ICU patients that were extracted according to the same rules as within the ASIC project.The time period for the historical dataset started with the introduction of the patient data management system in the ICU of the respective hospital and ended with the start of the ASIC project and covered a period of 10 years.Patient inclusion criteria were age above 18 years and a cumulative duration of invasive MV of at least 24 hours.There were no explicit exclusion criteria.Each patient's data included routinely charted ICU parameters collected over the whole ICU stay, biometric data and ICD-10 codes.The full list of parameters used in this study is given in Supplementary List I. Data from all five datasets were brought to the same units of measurement and were checked for consistency.During depersonalization, the concept of k-anonymity was applied to several parameters that posed a risk to privacy including age, height, weight, and BMI.These parameters were binned into intervals and the number of patients in each interval and in each combination of intervals of 4 parameters was assessed.If there were less than 8 patients in a particular interval or less than 10 patients in any combination of intervals including this interval, all patients of this interval were excluded from the analysis.Due to this, not all datasets of patients who initially met the inclusion criteria could be extracted from the respective hospital and included in the final dataset.The overall number of patients in the final dataset comprised 29,275 patients.
The criteria for the diagnosis of ARDS are defined in the Berlin criteria [34].As medical imaging data were missing in our dataset, only suspected ARDS onset time could be determined according to the Berlin criteria.It was defined as the timepoint when the ratio of arterial partial pressure of oxygen (PaO 2 ) and the inspired fraction of oxygen (FiO 2 ), also known as P/F ratio or Horovitz index, dropped below 300 mmHg for the first time and stayed below this threshold for at least 24 hours.Moreover, to be able to fit a simulator to the ICU data and create a cohort of virtual patients, only patients having specific MV, blood gas analysis and other parameters charted both before and after the suspected ARDS onset were selected.The final number of patients fulfilling these criteria comprised 1007 patients.The initial and final number of patients in corresponding hospitals is given in Table 1.A full description of data preparation is given in the Supplementary File.

D. CONSENSUS CLUSTERING AND ENRICHMENT ANALYSIS
We generated two datasets from the patient data representing the individual disease status to be used in the clustering algorithm.The first dataset comprised mean values of original measured parameters, which were used as inputs to the simulator, calculated on time windows 1 and 2 (before and after suspected ARDS onset respectively, see Supplementary List III).The second dataset comprised model-derived data: simulator outputs and parameters found in the optimization procedure (see Supplementary List II).The former dataset thus represented data from the cohort of original patients, while the latter represented the model-derived data, i.e., data from the virtual patient cohort.
Consensus k-means clustering was performed for different number of clusters in each of the cases.Consensus clustering is based on repeated multiple times (1000 times) clustering of the sampled data from the original dataset and is known to produce robust clusters [48].To further increase robustness of discovered clusters, another step was introduced to the clustering procedure.It was allowed to assign an outlier label to some patients, if they could not be securely assigned to any of observed clusters.In the clustering procedure, quality of clustering was assessed using mean cluster's consensus, as described in [48].This metric is introduced based on consensus matrix D: where M (h) is a connectivity matrix of the perturbed dataset obtained in the h-th resampling of the original dataset and M (h) (i, j) is equal to 1, if items i and j belong to the same cluster in h-th clustering repetition and 0 otherwise.I (h) is the (N × N) indicator matrix such that its (i, j)-th entry is equal to 1 if both items i and j are present in the perturbed dataset and 0 otherwise.Then, a cluster's consensus m(k) is defined as the average consensus index between all pairs of items belonging to the same cluster k: where I k is the set of indices of items belonging to cluster k and N k is a number of items in cluster k.Finally, the mean cluster's consensus is the cluster's consensus averaged over all clusters.This metric is a summary statistic which reflects the mean stability of clusters discovered in the consensus clustering algorithm and represents the overall robustness of discovered configuration of clusters.Mean clustering quality with 95 % confidence intervals was calculated by repeated (100 times) clustering on subsamples (80%) of dataset.A full description of the clustering procedure is given in the Supplementary File.
For each of the discovered clusters, enrichment with respect to clinical conditions and to each of the 5 underlying hospitals was evaluated using one-sided hypergeometric test for enrichment with a significance level of α = 0.05 [49].Analogously to gene set enrichment analysis, this method allows to identify clinical conditions (or hospitals) that are over-represented in a particular cohort (cluster) of patients compared to the whole population.For instance, if patients of Hosp A are encountered in a particular cluster more frequently than in the overall patient population formed of 5 hospitals, then that cluster is enriched with patients of Hosp A. Observed statistical significance values for each of conditions under consideration were corrected for multiple testing using Benjamini-Hochberg correction [50].

E. MODULES USED IN THE STUDY
In this study, the RBFOpt package [39] was used for fitting the VP model to real patient data in the optimization procedure.The following Python programming language [47] implementations were used in the study: scikit-learn [48] implementation of k-means clustering was used in the consensus clustering algorithm (sklearn.cluster.KMeans); scipy [49] implementations of hierarchical clustering were used in the consensus clustering algorithm (scipy.cluster.hierarchy,scipy.spatial.distance);statistical analysis was performed with scipy library (scipy.stats.hypergeom,scipy.stats.ttest_ind).Clustering results were compared using a two-tailed Student's t-test with a significance level of α = 0.05.

A. CREATION OF A VIRTUAL PATIENT COHORT
Fitting quality of the optimization procedure for all patients is shown in Fig. 1.Acceptable quality of fitting (simulator outputs within 2 standard deviations of measured data) was observed for 95.9% patients in the window before suspected ARDS onset and for 84.5% patients in the time window after suspected ARDS onset.Acceptable quality of fitting in both windows was observed for 81.7% or 823 patients.Thus, reliable model-derived data were obtained for 823 patients, which were used in the subsequent analysis.

B. CLUSTERING RESULTS
Clustering quality for different configurations of the number of clusters is shown in Fig. 2. For original measured data the best clustering quality was observed for 2 clusters, followed by a steep decrease in clustering quality for 3 clusters and gradual decrease of clustering quality for clustering configurations with a cluster number larger than 5.
In contrast to the clustering on the original measured data, the clustering quality on model-derived data was found to be significantly higher for all configurations of number of clusters (see Fig. 2 for the results of clustering and Table 2 for the results of the t-test).While on the original measured data, the quality decreased significantly already after increasing the number of clusters to 3, in the case of the model-derived data, the quality remained high for 2, 3 and 5 clusters.However, a cluster number above 5 also resulted in a steep decrease in clustering quality in this dataset.Thus, the number of clusters for further investigation was fixed to 5 for both clustering on original and model-derived data.
In case of clustering on original data each of the 5 discovered clusters had certain clinical conditions, which were over-represented in the respective clusters.However, all clusters were found to be driven by data from one or several particular hospitals, i.e., significant enrichment with respect to the hospital was found.Furthermore, 4 out of 5 clusters were dominated by significant over-representation of underlying hospitals, i.e., the highest enrichment was observed with respect to the hospital and not to the clinical condition, see Fig. 3(a).Enrichment results are given in Supplementary Table 1.Finally, none of the discovered clusters had significant enrichment of diagnosed ARDS patients (according to ICD-10 code J80.x).
In contrast, clustering on model-derived data revealed 2 mixed clusters, i.e., clusters without over-representation of any underlying hospital.In the remaining 3 clusters, although such an over-representation could be observed, it was significantly lower than in the clustering on measured original data, see Fig.   Additionally, clustering on model-derived data was able to discover a cluster with significant ARDS over-representation of diagnosed ARDS patients.This group of patients exhibited

FIGURE. 3. Significance of enrichment of clinical conditions and underlying hospitals in discovered clusters for clustering on original measured data (a) and model-derived data (b). The highest enrichment in each of the clusters is shown both for enrichment of clinical conditions (green bar) and for enrichment with respect to a hospital (red bar). In clustering on original data, all 5 discovered clusters are significantly enriched with data from some hospitals. In clustering on model-derived data, 2 clusters without enrichment for a hospital are observed and overall magnitude of enrichment with respect to a hospital is decreased.
multiple properties which are specific for ARDS patients.These encompass the lowest Horovitz index among all clusters, the lowest number of ventilation-free days and the highest mortality.Finally, this cluster showed the largest increase in number of closed alveolar compartments (n cc ) among all clusters.

IV. DISCUSSION
Data which are gathered in the ICU setting consist of global indices and parameters that reflect the state of the lung, such as BGA values or MV settings.However, these features in reality represent surrogate markers for the real pathophysiological state of the patient, leading to a significant simplification of clinical reality.In essence, ICU data are based on systematic monitoring of the enormous complexity of mechanisms accompanying the occurrence and progression of acute syndromes in individual patients.The development of complex syndromes is controlled not only by the core processes of disease progression (often molecular), but also by a large number of covariates arising from a diverse genetic background, lifestyle, exobiotic stress factors, and comorbidities.Another important factor is the large number of medical interventions in the context of intensive care, such as drug administration or MV.All these factors form highly complex feedback systems, in which the patient's condition causes and influences the interventions to be performed, which in turn influence the patient's condition.Such interventions can differ significantly among diverse hospitals introducing additional bias to the datasets [54], [55].Subsequently, relevant medical signals about a patient's state are often disturbed by noise or are missing completely.For instance, the human lung has inhomogeneous characteristics such as structural asymmetries and regional variations in ventilation and perfusion that cannot be captured by standard diagnostic methods.
To be able to infer relevant patient information, approaches of systems medicine and computational physiology can be used.Systems medicine aims to describe, model, and simulate living, medically relevant systems using methods similar to those used for complex technical processes.The main goal of computational physiology as a part of systems medicine is the adequate description of these relationships in a computationally efficient manner and the development of models that consider unique properties of the living organisms in response to their environment [23], [56].One of the pillars of computational physiology is VP modeling.The overall VP approach relies on the ability to determine parameters from data that are both patient-specific and time-varying, accounting for variability within and between patients.The ability of VP models, when appropriately adapted, to create a digital twin for a real patient also enables assessment of patient-specific parameters that are not readily measurable (e.g., vascular resistances, transpulmonary pressures, anatomic shunt, etc.).These unmeasurable parameters contain potentially important information about the patient's health status, which cannot be extracted from routinely measured ICU data due to the previously mentioned reasons [24].
In this paper, we demonstrate how a VP modeling framework can be applied to large ICU patient cohorts pooled from different hospitals to reduce dataset bias and to infer parameters approximating patients' disease states.First, we show how a mechanistic VP model can be used to derive model parameters of individual patients with suspected which comprise model-derived data.Secondly, we show how these data can be further utilized to improve clustering quality and discover medically relevant patient subpopulations.
A comprehensive physiological model, that was used in this study was already validated against real patient data [40], [41].However, in the current study, the simulator was firstly used to create a large (>1000 patients) cohort of virtual patients based on the retrospective observational data pooled from different hospitals.VP model fitting to real ICU patients showed a reasonable fitting quality.Acceptable fit in both time windows was observed for 81.7% of the patients in the cohort.The larger ratio of patients with acceptable quality of fitting in the first window can be explained by the fact that 11 parameters were optimized in the window 1, whereas only 1 parameter, namely n cc , was determined in the window 2. Therefore, reliable model-derived data were obtained for 823 patients.The optimization was performed separately for 2 time windows, which allowed to parameterize a patient in a steady non-ARDS state (window 1), and then track the ARDS development by changes in the number of closed compartments.The optimization using the data from both time windows together to parametrize a VP would potentially enable a better average fit in the time windows.However, this parametrization would correspond to an "average" state and would not allow to follow the progression of the ARDS.Moreover, the optimization of VP parameters other than n cc in the window 2 would potentially allow a better fitting quality in that window.Thus, in the future studies our modeling approach can be improved by allowing other VP parameters to vary within physiologically meaningful ranges during ARDS development, which might improve quality of ARDS modeling.The cohort of patients for whom acceptable fitting quality could not be achieved is of particular interest for further research.On the one hand, our approach for ARDS simulation integrates several assumptions and cannot guarantee an accurate approximation of all pathophysiological processes of ICU patients.On the other hand, the virtual patient model itself may be limited and fail in modeling certain states of ICU patients.For instance, we found that the cohort of patients with low fitting quality is characterized by significantly lower end-inspiratory pressures in the window 2. However, no clinical condition was found to be enriched in this cohort.Nevertheless, further research is needed to fully inspect reasons for low fitting quality.
To demonstrate the utility of the obtained model-derived data, we used a classic unsupervised learning approach, namely clustering.We compared the clustering on original data vs. clustering on inferred model-derived data.Intermediate clustering quality was observed in the clustering on original data, meaning that the consensus clustering method was struggling to split a full cohort into homogeneous groups and find a stable configuration of clusters.In contrast, clustering on model-derived data revealed significantly better clustering quality for all configurations of number of clusters.
More importantly, clustering based on the original data was strongly affected by the diversity of underlying hospitals.In all discovered clusters, patients from a particular hospital were significantly over-represented.In 4 out of 5 clusters, such enrichment was found to be the most significant for that cluster.These observations indicate that clustering on observed data is dominated more by the hospital source and much less by underlying medical conditions.Therefore, clustering on the pooled data is biased by the data source and does not allow to find mixed subgroups of patients.This finding is even more striking given the fact that we did not use external ICU datasets, e.g., MIMIC, HiRID, or AmsterdamUMCdb, for this study, which could have covered different patient populations.All patients in this study satisfied the same strict inclusion criteria and were later filtered and chosen according to uniform rules.For instance, chest X-ray data were not available during the study, which represented the main limitation for the retrospective ARDS diagnosis in the cohort.However, clustering on model-derived data obtained from each of the virtual patients allowed us to find 2 clusters of mixed hospital origin, i.e., clusters without over-representation of any underlying hospital.Moreover, although significant enrichment with respect to the hospital was still present in 3 out of 5 clusters, its magnitude was much less than in the clustering on original data (see Fig. 3).
These findings support the main characteristic of the VP models, namely the ability to identify relevant data patterns and infer individualized model parameters approximating the disease state from underlying data by leveraging mechanistic physiological principles while simultaneously avoiding an excessive level of detail.
Another interesting observation was that clustering on original measured data was not able to find a subgroup of "true" diagnosed ARDS patients.Partially, these patients were uniformly distributed among discovered clusters and did not form a separate group with typical ARDS properties, e.g., an impaired oxygenation or high driving pressures for MV.In contrast, clustering on model-derived data was able to discover a cluster with significant ARDS over-representation and clinical properties, which resemble those of ARDS patients.This finding is especially important in the context of unreliable ARDS labeling in retrospective data.Insufficient quality of labeling represents an additional factor that contributes to impaired generalization of AI/ML models developed on retrospective ICU data.the proper development of ML models for ARDS diagnosis and prediction, such models have to be trained on reliably labeled data.On the one hand, patients labeled with ARDS ICD codes still represent a lower bound on the number of true ARDS cases, as large numbers of ARDS patients are not diagnosed [35], [36], [37].On the other hand, reliable retrospective labeling constitutes a challenging task, since diagnosis according to the Berlin definition requires the clinical appraisal of certain conditions, such as hypervolemia, which are not assessable retrospectively.This lack of data is a critical point also for the future work on ARDS.A formalization of fluid overload is a challenging task, since there is no metric which is measured routinely to classify the fluid status of a patient.For instance, a cumulative fluid balance is not suitable to conclude on a hypervolemia.Thus, it remains a clinical appraisal which needs to be assessed at the bedside.Datasets containing this information are highly desirable for the future work on ARDS.However, they are not available yet and their generation would be quite laborious.Thus, it is questionable if they will ever reach the required size to be used in ML algorithms.Moreover, medical imaging data are frequently lacking in retrospective databases with observational ICU data.However, even if imaging data are available, reliable identification of the ARDS event remains a challenge due to a high interrater variability in chest imaging [57].Finally, studies on the development of AI models for ARDS are utilizing diverging rules to retrospectively label ARDS patients [58], [59], [60].
All patients in the cohort under consideration had a time point (suspected ARDS onset), when a part of the Berlin definition which accounts for the impaired oxygenation was satisfied.Presence of "true" ARDS patients in the cohort was guaranteed by the fact, that some patients had ICD-10 code for diagnosed ARDS.However, some of the patients might have had ARDS, but were not diagnosed and therefore lacked the ICD-10 code for ARDS, since it is known that a relevant number of ARDS cases stays undiagnosed.Therefore, the "true" ARDS cohort would have consisted of these two groups of patients: the "true positives" and "false negatives".Our hypothesis was that the patients from these two groups would be similar to each other and form a shared cluster in the clustering procedure.However, that was not the case for the clustering on original measured data, as none of the discovered clusters was enriched with diagnosed ARDS patients.Clustering on measured data was therefore not able to differentiate between ARDS patients and patients with other conditions, that could have led to decreased Horovitz index.In contrast, through clustering on model-derived data we were able to discover a cluster with significant ARDS over-representation and clinical properties, which resemble those of ARDS patients.At the same time this cluster was not enriched with other pathological conditions, which often have similar clinical picture, such as for instance Heart Failure [61].Furthermore, this ARDS cluster had the largest increase in the number of closed compartments (n cc ) in the model, which fully supports our approach of modeling ARDS by introducing closed alveolar compartments.Our findings suggest that the identified ARDS cluster might also include those ARDS patients which were not diagnosed by the ICU staff.Therefore, this approach could be additionally used to identify non-diagnosed ARDS patients, although further research and retrospective validation is needed to prove this hypothesis.
Our study has some limitations that have to be considered.First, as the actual ARDS clinical diagnosis time was not present in underlying data, the ARDS onset was identified retrospectively based on the Horovitz index.Potential availability of the ARDS diagnosis time would allow precise identification of the time windows for fitting of the VP model (at least for the diagnosed ARDS patients) enabling identification of more reliable VP configurations in future studies.However, to the best of our knowledge, no available database of clinical data contains clinical diagnosis timestamps.Therefore, datasets containing this information will have to be created from the ground up.Second, parameters of the virtual patients that were identified in the window before suspected ARDS onset were assumed to stay constant in the observation window of 2 days.This is only partially true, as most of the identified parameters are changing with time.Therefore, our approach to model ARDS development represents a significant simplification of the complex pathophysiological processes, which are happening during this critical condition.However, in our opinion, it covers the most important clinical manifestation of ARDS and can be used as the first approximation for the modeling.Moreover, our ARDS modeling approach was validated by the fact that the ARDS cluster, which was discovered in the data, had the largest increase in number of closed compartments, as expected.Nevertheless, VP modeling has the potential to infer additional information about the patient status which was not used in this study.For instance, by introducing physiologically meaningful changes in other VP parameters during ARDS development, one might significantly improve quality of ARDS modeling.However, it should be noted that model-derived parameters represent a virtual entity.Therefore, detailed clinical evaluation and validation should be performed before they are used in any support systems at the bedside.
Extensive data requirements and complexity of the fitting process of the VP model constituted additional limitations of the study.The former did not allow us to use all available patient data and was the reason for the significantly lower number of patients in the final analysis cohort compared to the initial cohort (see Table 1).It must be considered that to reach the aim to create a sufficiently large dataset for the analysis, not only data collected during the current project but also a historical dataset (Hosp B) were included.It cannot be ruled out that patient populations or therapeutic concepts have changed over the years introducing additional bias into the analysis.However, this limitation reflects the real-world situation, as ML models are mostly developed on retrospective datasets with some temporal separation from datasets, where such models are intended to be used.Furthermore, this limitation does not influence the overall conclusions of the study, as enrichment of a similar magnitude was observed with respect to the Hosp B and the other 4 hospitals (see Supplementary Table 1).The latter limitation required the use of the computing cluster the optimization procedure.Although our approach was limited only to the identification of at most 11 parameters for each of the virtual patients, it required the use of advanced global optimization algorithm and significant computational resources.Matching of the simulator to individual patient data and further analysis was performed on the computational cluster of the RWTH Aachen University using 10 nodes with 40 cores each, 2.66 GHz, 4 GB RAM.The longest runtime for one simulation comprised 5 min.Optimization for each patient required repetitive (100 iterations) simulation for multiple time points in each of the 2 windows.Therefore, the overall matching procedure took on average several days of computational time.All this still tremendously complicates a straightforward implementation of such methods at the bedside.
In general, VP modeling possesses further limitations, restraining its applicability in real-world setting.First, it requires complex validation of the developed models.Second, VP models are usually limited to an organizational level of the human body and do not consider the influence of exogenous covariates, e.g., preexisting diseases, lifestyle, genetic predispositions, or environmental influences [24].

V. CONCLUSION
In this study we have shown how a mechanistic VP model can be used to infer parameters approximating disease states of individual patients with suspected ARDS from observational data of mixed origin.Our results support the hypothesis that mechanistic modeling can be used to significantly reduce biases in data, introduced by pooling of data from different hospitals and to allow a discovery of patient cohorts driven exclusively by medical conditions.Overall, the continuous development of hybrid modeling approaches integrating diverse computational technologies, continuing increases in computational power, and ever-growing numbers of available datasets leads to the expectation that these technologies will make a significant contribution to precision medicine, with benefits for patients, physicians, and the healthcare system as a whole.

FIGURE. 1 .
FIGURE. 1. Quality of fitting the simulator to real patient in the time window before suspected ARDS onset (a) and after suspected ARDS onset (b) Cohort of 1007 patients with suspected ARDS.Acceptable quality of fitting (simulator outputs within 2 standard deviations of measured data) was observed for 95.9% patients in the window before suspected ARDS onset and for 84.5% patients in the time window after suspected ARDS onset.

FIGURE. 2 .
FIGURE. 2. Clustering quality for different numbers of clusters for clustering on original measured data (orange line) and model-derived data (blue line) data.Mean clustering quality with 95 % confidence intervals over repeated (100 times) clustering on subsample (80%) of dataset is shown.Mean clustering quality and results of a two-tailed Student's t-test for mean quality of clustering are given in Table2.
3(b) and Supplementary Table