Real-Time Prediction for Neonatal Endotracheal Intubation Using Multimodal Transformer Network

Neonates admitted to neonatal intensive care units (NICUs) are at risk for respiratory decompensation and may require endotracheal intubation. Delayed intubation is associated with increased morbidity and mortality, particularly in urgent unplanned intubation. By accurately predicting the need for intubation in real-time, additional time can be made available for preparation, thereby increasing the safety margins by avoiding high-risk late intubation. In this study, the probability of intubation in neonatal patients with respiratory problems was predicted using a deep neural network. A multimodal transformer model was developed to simultaneously analyze time-series data (1–3 h of vital signs and Fi<inline-formula><tex-math notation="LaTeX">$\text{O}_{2}$</tex-math></inline-formula> setting value) and numeric data including initial clinical information. Over a dataset including information of 128 neonatal patients who underwent noninvasive ventilation, the proposed model successfully predicted the need for intubation 3 h in advance (area under the receiver operator characteristic curve = 0.880 <inline-formula><tex-math notation="LaTeX">$\pm$</tex-math></inline-formula> 0.051, F1-score = 0.864 <inline-formula><tex-math notation="LaTeX">$\pm$</tex-math></inline-formula> 0.031, sensitivity = 0.886 <inline-formula><tex-math notation="LaTeX">$\pm$</tex-math></inline-formula> 0.041, specificity = 0.849 <inline-formula><tex-math notation="LaTeX">$\pm$</tex-math></inline-formula> 0.035, and accuracy = 0.857 <inline-formula><tex-math notation="LaTeX">$\pm$</tex-math></inline-formula> 0.032). Moreover, the proposed model showed high generalization ability by achieving AUROC 0.890, F1-score 0.893, specificity 0.871, sensitivity 0.745, and accuracy 0.864 with an additional 91 dataset for testing.


I. INTRODUCTION
E NDOTRACHEAL intubation is used for sustaining life in premature and term neonates with respiratory difficulty in the neonatal intensive care units (NICUs). However, intubation is accompanied by serious adverse events, including increased morbidity, reintubation, subglottic stenosis, laryngeal injury, ventilator-associated pneumonia events, swallowing and speech impairment, and tracheobronchitis [1], [2], [3], [4]. In particular, for neonates, attempting endotracheal intubation is dangerous compared to adults because they have a small airway [3], hence, it needs to be cautious when attempting endotracheal intubation in neonates. Meanwhile, predicting endotracheal intubation is challenging in neonates because they are in a physiologically unstable period with underdeveloped immunity and fluctuating cardiopulmonary status. Even though several studies on adult intubation have been already performed [1], [5], [6], [7], [8], considerable efforts are still required to develop models for neonates. Because clinical characteristics for adults have a very different range and pattern from those for neonates [9], [10], [11], applying an adult endotracheal intubation model to neonates has limitations. Clark et al.'s study [12] on the prediction of endotracheal intubation in neonates focused on predicting intubation within a relatively broad period (> 24 h), which did not provide precise information on when to perform intubation. Therefore, the purpose of this study is to provide real-time prediction on neonatal intubation within the next 3 h. By accurately predicting the need for intubation in real-time, additional time may be made available for preparation, which can help increase the safety margins by avoiding late intubation for high-risk neonates [1].
To develop real-time prediction model, the multimodal transformer based on deep neural networks was utilized to analyze both 1-3 h of time-series data and numeric data including risk factors for RDS [12], [13], [14] such as prematurity, cesarean section, pregnancy-induced hypertension, male sex, maternal diabetes mellitus, and multiple births. The deep neural networks have been widely applied for prognosis prediction in NICUs with great success [15], [16], [17], [18]. Feng et al. [15] used long short-term memory (LSTM) networks to predict the real-time mortality risk of preterm neonates in the initial stage of NICU hospitalization. Jia et al. [16] proposed a convolutional neural network (CNN)-based model that used routinely recorded neonatal patient information to predict weaning from mechanical ventilation. Recently, the transformer with the selfattention mechanism, introduced by Vaswani et al. [19], enables the model to compute contextualized representations of the input sequence, while the feed forward neural network is used to apply non-linear transformations to the representations. The transformer has achieved state-of-the-art results in a variety of natural language processing tasks, and has been particularly useful for time-series analysis tasks such as forecasting, anomaly detection, and classification. The transformer networks successfully utilized to efficiently capture disease-medication associations within their temporal context, and outperformed the CNN and LSTM in forecasting time series problems [20], [21], [22], [23], [24], [25].
In addition, the multimodal approach was used to enhance the model performance by providing a comprehensive understanding of clinical data [26], [27], [28], [29], [30]. Shamout et al. [26] proposed a multimodal approach to predict the deterioration of COVID-19 patients in the emergency department. The multimodal used a DNN that learns from chest X-ray images and a gradient boosting model that learns from routine clinical variables, and this model outperformed each single-modal prediction. Sano et al. [31] developed a sleep pattern detector that uses multimodal data acquired from smartphone and wearable technologies to detect sleep patterns with the bidirectional LSTM, and the results were statistically superior to those of non-temporal machine learning algorithms and state-of-the-art actigraphy software.
In this paper, we conducted a retrospective study with datasets of 128 neonates in NICU who experienced respiratory distress and developed a multimodal network to distinguish neonates who require intubation 3 h in advance. The multimodal network consisted of two subnetworks: a multilayer perceptron (MLP) block for numeric data analysis, and a transformer block for time-series data analysis. The feature vectors obtained by two blocks were concatenated and fed to the fully connected layer to calculate the intubation probability. The experimental results demonstrated that the average metrics for the proposed model were as follows: area under the receiver operating curve (AUROC) = 0.880, F1-score = 0.864, sensitivity = 0.886, specificity = 0.849, and accuracy = 0.857. The contributions of this research can be summarized as follows:

A. Dataset
A dataset was obtained from Chungbuk National University Hospital (Cheong-Ju, Korea) between June 1, 2020, and March 11, 2023. This retrospective study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Chungbuk National University Hospital (IRB no. CBNUH 2021-02-034-001), and informed consent was waived. The dataset collected in this study includes the personal information of neonates and their mothers. This dataset will only be used if approved by the corresponding author. The dataset on neonatal patients was manually collected from electronic medical records (EMRs). As described in Fig. 1, we found a list of 577 neonatal patients who were admitted to the NICU at Chungbuk National University Hospital. Patients without respiratory distress (288), patients who were admitted more than 48 hours after birth (53), and patients who had already undergone intubation at the time of admission (73) were excluded from the list. Then, we collected initial clinical and time-series data for 163 patients on the list, and excluded five patients who had attempted intubation after 12 h of admission and those with missing data, defined as more than 2 numeric data or 10% of time-series data, resulting in a total of 128 patients' data included. Because the initial tabular data such as body temperature and blood gas analysis results (pH, PCO2, PO2, BE, and lactate) could gradually recover or worsen over time, and may not be representative of the patient's condition in the long term (> 12 h). Accordingly, of the 128 patients, 36 neonates with intubation records were classified as intubated (or positive) patients, and the remaining 92 data were classified as non-intubated (or negative) patients. Before analysis, all positive and negative data were randomly shuffled and subsequently utilized for the analysis of the prediction model. The results of the statistical comparison between intubated and non-intubated neonates showed that the intubated patients had lower pH and pulse oximetry (SpO 2 ), higher PCO 2 and fraction of inspired oxygen settings (FiO 2 ), and faster heart rates ( Table I).
The dataset includes 19 numeric features (gestational age (GA), birth weight, Apgar scores at 1 and 5 min, sex, cesarean delivery, antenatal steroid use, pregnancy-induced hypertension, gestational diabetes mellitus, premature membrane rupture, outborn delivery, multiple births, initial body temperature, clinical risk index for babies (CRIB-II) score and parameters in initial capillary blood gas analysis (CBGA) including PO 2 , PCO 2 , base excess (BE), lactate, and pH) and four time-series features (HR, RR, FiO 2 , and SpO 2 ). The time-series data points were recorded in 1 h intervals and included records before the intubation attempts (or the last vital sign record). If intubation was performed less than 3 h after admission, the data recorded immediately after the admission were considered. In the control case, we randomly selected the total observation time and gathered time-series data as in the intubation case.
The numeric and time-series data were preprocessed as follows: The missing values in the numeric data and time-series data were filled with the average values and the most recent data, respectively. Then, the data were normalized via (1), wherex is the normalized value, and μ x and σ x denote the average value and standard deviation of x, respectively. B. Model Architecture 1) Multimodal Transformer: This section describes the multimodal transformer to predict the intubation probability of NICU patients. The proposed model uses two types of data: numeric data including initial clinical information, and time-series data including the vital signs and FiO 2 setting value of a patient. Fig. 2 schematically illustrates the multimodal transformer. The model has an MLP block and a transformer block to analyze numeric data and time-series data, respectively. The feature vectors obtained through the MLP and transformer blocks are concatenated and fed to the additional MLP block to calculate the intubation probability.
Each MLP block consists of a dense layer (D MLP = 32), a batch normalization layer, an activation layer (ReLU for numerical analysis and sigmoid activation for outcome prediction), and a dropout (0.2) layer. The transformer block consists of a transformer encoder including the input embedding network, multi-head attention network, and feedforward network, as illustrated in Fig. 2. In the input embedding network, the time-series data (input i ) are processed by the input embedding network to encode the time sequence of each data point. Specifically, input i is multiplied by the positional encoding function (P E i ). P E i is represented as sine and cosine functions with different frequencies, where i and d denote the position and dimension of the time-series data, respectively, and the embedding model dimension is D embed = 32 (2) and (3).
The positional-encoded feature vector and time-series data processed via the dense layer followed by Gaussian error linear unit (GELU) activation are added at the end of the stack in the input embedding network. In the multi-head attention network, the embedded feature vector is copied as the query (Q), key (K), and value (V ). The dot product of Q and K is obtained and scaled to determine the attention weight via the softmax function. The value is multiplied by V (4) to obtain a new sequence (z), where α ij is the weight calculated by softmax, and W q , W k , W v ∈ R d×d are the layer-specific weights for Q, K, and V, respectively. Each attention head processes an input sequence x = (x 1 , . . ., x n ) of n elements, with x i ∈ R d , and calculates the output sequence z = (z 1 , . . ., z n ) of the same length, where z i ∈ R d (5).
C. Model Training 1) Model Parameters: In a binary classification problem, if one class has significantly more instances than the other, then the model may be biased toward the majority class. To address the problem of data imbalance, one approach is to use loss weighting factors [32], [33], [34]. The idea is to assign different weights to the loss function for each class based on the class frequency in the dataset. Since the imbalanced dataset including 36 positive cases and 92 control cases was utilized in this study, we adjusted the loss weighting factor to prevent learning bias towards the more frequent class. Accordingly, we assigned a weight of 0.8 to the positive class and 0.2 to the negative class during training based on Bayesian Optimization. In addition, we used the Adam optimizer with β 1 = 0.9, β 2 = 0.999, and an exponential learning rate decay with 100 steps. The initial learning rate was 5 × 10 −5 , and the decay rate was set as 0.96. The model was trained for 2,000 epochs.

3) Performance Evaluation:
We performed 4-fold crossvalidation to ensure the reliability and consistency of the model performance. In each fold, the dataset was split into training (75%) and validation (25%). In the training and validation datasets, the positive/negative ratios were maintained as approximately 1:4. We evaluated the AUROC, f1-score, sensitivity, specificity, accuracy, ROC curve, and confusion matrix over the validation datasets. The statistical calculation was performed using a scikit-learn library (https://scikit-learn.org/).

A. Data Optimization 1) Optimization of Time-Series Data:
The proposed model is trained to predict the need for intubation a certain period cutoff time (t c ) in advance. For constructing the dataset, data at a certain time point were labeled as either intubation (positive) or non-intubation (negative). The criterion for the labeling was whether the data were included in t c before each neonatal intubation in NICU or not. If data were included in t c , the data were labeled as positive (Fig. 3). To evaluate the effect of t c on the model performance, we conducted extensive experiments in which t c was set as 1, 3, 5, and 12 h. In our experiment, the model performance was improved as t c was set to shorter. The t c with 1 h corresponded to the best performance (AUROC = 0.892, F1-score = 0.875, sensitivity = 0.858, specificity = 0.853, and accuracy = 0.853), followed by the t c with 3 h (AUROC = 0.880, F1-score = 0.864, sensitivity = 0.886, specificity = 0.849, and accuracy = 0.857) (Fig. 4). The model exhibited a poor  (Table II).
2) Multimodality Data: We used two subnetworks to simultaneously process the numeric and time-series data. To analyze the effect of each subnetwork on the model performance, we performed an ablation study, the results of which are summarized in Table III. The model using only the numeric data without CBGA scores exhibited a poor performance (AUROC = 0.710, F1-score = 0.673, sensitivity = 0.826, specificity = 0.643, and accuracy     CBGA scores was comparable to that of the model using CBGA scores (AUROC = 0.862, F1-score = 0.818, sensitivity = 0.902, specificity = 0.780, and accuracy = 0.807).
The LR and SVM models with a Gaussian radial basis function (RBF) were performed using scikit-learn library. The XGBoost regressor was implemented using XGBoost library. The subsample ratio of columns and max-depth were set as 0.8 and 8, respectively. For the model training, the learning rate was set as 10 −4 , and the tree estimator was set to 100. To address the problems of data imbalance and overfitting issues, the weights of the positive class and gamma were set as 2 and 1, respectively. These hyperparameters were optimized through a grid search method with 4-fold cross-validation. Before being fed to the LR, SVM, and XGBoost models, the time-series data were flattened and concatenated with the numeric data. For the MLP and LSTM models, the transformer block was replaced  Fig. 6). These models were implemented with the Keras library. The MLP block consisted of a hidden dense layer (D MLP = 128) with rectified linear unit (ReLU) activation. The time-series data were flattened and fed to the input MLP block used instead of the transformer block. The LSTM block included an LSTM layer (D LSTM = 128). Table IV highlighted that the proposed model achieved the best performance (AUROC = 0.880, F1-score = 0.864, sensitivity = 0.886, specificity = 0.849, and accuracy = 0.857). This Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

IV. DISCUSSION
As coronavirus disease 2019 (COVID-19) pandemic spreads around the world, respiratory-related studies have been introduced using machine learning and deep learning. To be specific, Varzaneh et al. [8] developed a decision tree-based model to predict the intubation risk of hospitalized patients with an accuracy of 0.93. Bolourani et al. [50] suggested the XGBoost model that predicts respiratory failure in admitted patients within 48 h of admission using a dataset of the emergency department with the highest mean accuracy of 0.919. Siu et al. [1] also successfully predicted adult intubation using RF model with a freely available Medical Information Mart for Intensive Care (MIMIC) dataset with an AUROC of 0.87. However, only scarce literature targeting neonates or infants has been reported. Previous studies [12], [51] proposed simple logistic models to predict intubation (an AUROC of 0.84) or late respiratory support (AUROCs of 0.801-0.881) for very low birth weight infants. Since the recent studies have focused only on whether to attempt intubation or respiratory support, developing a real-time model was needed to identify the timing of intubation, which requires a high burden and cost for medical staff. Furthermore, the Clark et al. [12] requires vital sign collected at 2-second intervals requiring automatic extraction features from electronic medical records (EMRs), but the different clinical data formats (CRFs) between hospitals and countries makes it impractical for use in the majority of hospitals. Additionally, our study targets all term and preterm neonates with respiratory diseases, thus it is inevitable to collect additional data and develop an optimized model accordingly. To develop a real-time prediction model for intubation, the advanced model architecture needs to be applied Compared to the traditional machine learning methods such as logistic regression, SVM, and RF, the transformers have shown a great ability for interactions in sequential data, and have widely adapted to a variety of time-series tasks such as forecasting [52], [53], [54], anomaly detection [55], [56], and classification [57], [58]. Therefore, in this study, a transformer architecture [19] was adapted for analyzing time-series data and an MLP block was added for comprehensive analysis with numeric data, resulting in the development of a multimodal transformer capable of effectively analyzing both time-series data and numeric data simultaneously.
In this study, the proposed model provides prediction of intubation probabilities using multi-modal information including 1-3 h of vital signs with FiO 2 setting value and initial clinical variables. To develop and evaluate this model, we established a dataset including the information of 128 neonatal patients under NIV support and validate the multimodal model that could classify neonates requiring intubation 3 h in advance. We focused on developing a real-time system capable of predicting short-term prediction because our dataset included 75% of intubated neonatal patients who attempted intubation 3 hours after NICU admission. Fig. 5 demonstrates that the predicted intubation probability reflects the historical trends of time-series data. In the Fig. 5(a), the respiratory rate (RR) increased to 104 beats/min (normal range of RR: 30-60 beats/min) between 9-10 h after NICU admission. During this period, the predicted intubation probabilities increased (50%-56.5%). In Fig. 5(b), the FiO 2 drastically increased to 35-41% (FIO 2 is typically maintained at 21%) between 3-5 h after NICU admission, and the predicted intubation probability approached 85.4%. Fig. 5(c) represents the probability increased slightly to 54.5% at 1 h, and sharply increased to 86.4% at 4 h after admission to the NICU, which is similar to the trend of time-series data. The RR was temporarily decreased by 20 beats/min at 1 h, and the FiO 2 increased rapidly to 40% after 4 hours of admission to the NICU. Fig. 5(d) shows that the predicted intubation probability was maintained at 70.7% or higher. During this period, the RR (69-89 beats/min) and FiO 2 (30-35%) were out of the normal range. In addition, we also present the prediction results of nonintubated patients (Fig. 5(e)-(h)). Fig. 5(e) shows fluctuation of the HR (112-160 beats/min) (normal range of HR: 120-160 breaths/min). Fig. 5(f)-(h) shows the RR values were often appeared to deviate normal range. However, we observe stable trends in the FiO 2 and SpO 2 , and confirm that the predictive probability of intubation was also stable.
The ablation study was conducted to optimize the model architecture, and the model using both numeric and time-series data achieved the highest performance. The model with 18 clinical data without CBGA scores (PO 2 , PCO 2 , BE, lactate, and pH) also showed comparable predictive performance (AUROC = 0.862) with one using CBGA scores (AUROC = 0.880). This model could be also practically used in institutions and hospitals because the CBGA requires expensive analyzing instruments and trained technicians.
We did further study to test the proposed model using a recently obtained test dataset from 91 neonatal patients, including 21 intubated patients and 70 non-intubated patients, in addition to the existing data of 128 patients used for model development.
The proposed model that achieved the highest performance was utilized for testing. Using this model, we performed endotracheal intubation predictions for 91 neonatal patients at hourly intervals. This result was validated by checking whether the neonates had a record of endotracheal intubation within the next 3 hours. As a result, the model achieved a performance of AUROC 0.890, F1-score 0.893, specificity 0.871, sensitivity 0.745, and accuracy 0.864. These results, as compared by the cross-validation results, demonstrate that the proposed model has high generality ability, though additional research may be needed to improve sensitivity.
For future work, ensemble learning [65], [66] could be considered to improve the accuracy and robustness of a decision-making system by combining the outputs of multiple models. In addition, state-of-art feature selection technologies [67], [68], [69] could be applied to reduce the dimensionality of the dataset, simplify the model, and improve its accuracy and generalization ability.

V. CONCLUSION
We have developed a real-time prediction system using a multimodal transformer network to predict neonatal endotracheal intubation 3 hours in advance. The proposed model has outperformed traditional machine learning models with AUROC = 0.880, F1-score = 0.864, sensitivity = 0.886, specificity = 0.849, and accuracy = 0.857, and reflected the historical trends of vital signs and FiO 2 setting value well. Furthermore, the proposed model demonstrated high generalization ability by showing AUROC 0.890, F1-score 0.893, specificity 0.871, sensitivity 0.745, and accuracy 0.864 with an additional 91 collected data for testing. To find the optimal model architecture, we have conducted the ablation study. It has been desirable to simultaneously process the numeric and time-series data compared to the single-modal network. In addition, we have performed extensive experiments to select the most appropriate cutoff time among t c = 1, 3, 5, and 12 h, and the appropriate t c was selected as 3 h. We expect that the proposed model could be practically applied to neonates at high risk for respiratory failure by using 23 factors readily available in hospitals.

APPENDIX A ARCHITECTURE OF MULTIMODAL MLP AND LSTM
The multimodal MLP (Fig. 6(a)) and the multimodal LSTM ( Fig. 6(b)) utilize two-type inputs: (i) the numeric data passes through a multilayer perceptron (MLP) block, and (ii) the time-series data passes through MLP and LSTM block, respectively. The representations extracted by the two branches are concatenated into a feature vector, which is passed through an MLP model.