Machine Learning Methods for Neonatal Mortality and Morbidity Classification

Preterm birth is the leading cause of mortality in children under the age of five. In particular, low birth weight and low gestational age are associated with an increased risk of mortality. Preterm birth also increases the risks of several complications, which can increase the risk of death, or cause long-term morbidities with both individual and societal impacts. In this work, we use machine learning for prediction of neonatal mortality as well as neonatal morbidities of bronchopulmonary dysplasia, necrotizing enterocolitis, and retinopathy of prematurity, among very low birth weight infants. Our predictors include time series data and clinical variables collected at the neonatal intensive care unit of Children’s Hospital, Helsinki University Hospital. We examine 9 different classifiers and present our main results in AUROC, similar to our previous studies, and in F1-score, which we propose for classifier selection in this study. We also investigate how the predictive performance of the classifiers evolves as the length of time series is increased, and examine the relative importance of different features using the random forest classifier, which we found to generally perform the best in all tasks. Our systematic study also involves different data preprocessing methods which can be used to improve classifier sensitivities. Our best classifier AUROC is 0.922 in the prediction of mortality, 0.899 in the prediction of bronchopulmonary dysplasia, 0.806 in the prediction of necrotizing enterocolitis, and 0.846 in the prediction of retinopathy of prematurity. Our best classifier F1-score is 0.493 in the prediction of mortality, 0.704 in the prediction of bronchopulmonary dysplasia, 0.215 in the prediction of necrotizing enterocolitis, and 0.368 in the prediction of retinopathy of prematurity.


I. INTRODUCTION
Over 15 million babies are born preterm every year, and while their mortality and morbidity rates have been decreasing in recent decades, preterm birth is still the worldwide leading cause of childhood mortality under the age of five [1].Increased risk of mortality and morbidity among neonates is associated with low birth weight and low gestational age [2].Very low birth weight (VLBW) infants, that is, those with birth weight under 1500 g, which are treated in neonatal intensive care units (NICUs) in the Western Europe and in the USA, have a mortality rate around 11% [3].Furthermore, The associate editor coordinating the review of this manuscript and approving it for publication was Gang Li .many of the survivors develop severe complications such as neonatal sepsis [4], bronchopulmonary dysplasia (BPD) [5], or necrotizing enterocolitis (NEC) [6].These and other complications can also inflict long-term or permanent morbidities such as persistent pulmonary dysfunction in the case of BPD [5], gastrointestinal and neurodevelopmental problems in the case of NEC [6], or blindness in the case of retinopathy of prematurity (ROP) [7].Early detection of neonatal morbidities is of paramount importance to halt the progression of the disease, and for preventing further complications or even death [8].
In order to better assess the risk of mortality and neonatal illness, several scores have been proposed, such as the Apgar [9], SNAP [10], and SNAPPE [11] scores, and more recently the SNAP-II and SNAPPE-II [12] scores.Modern NICUs monitor the vital signs of neonates in an automated fashion, which allows for the collection of time series data of the physiological measurements.In our preliminary work, we observed that machine learning methods can enhance the performance of traditional SNAP-II and SNAPPE-II scores by leveraging time series data.The methodology leads to improved prediction of neonatal mortality [13] and neonatal morbidities of BPD, NEC, and ROP [14].In this study, we continue the mentioned preliminary work by presenting a comprehensive comparison of machine learning methods for prediction of mortality, BPD, NEC, and ROP in VLBW neonates.
The contributions of this article are i) to present a systematic study of machine learning methods for the prediction of neonatal mortality, BPD, NEC, and ROP, ii) to propose and analyze F1-score as an evaluation measure to AUROC in order to improve the sensitivity of the classifiers, iii) to analyze the methods using various lengths of time series data, and iv) to analyze the relative importance of different features using the random forest classifier, which we found to have generally the best performance in the classification tasks.This study was approved by the Ethics Committee of Helsinki University Hospital 115/13/03/00/14, dated April 8, 2014.

II. BACKGROUND
Bronchopulmonary dysplasia (BPD) is a severe chronic lung complication among the preterm infants resulting from the immaturity of the developing preterm lung and injuries associated with external effects, such as maternal intra-amniotic infection, mechanical ventilation, and excessive oxygen [15].The initial injury is often caused by respiratory distress syndrome (RDS) [15] or acute respiratory distress [5].However, oxygen and positive-pressure ventilation system used for treating these conditions worsens the injury and initiates the development of BPD [15].
Necrotizing enterocolitis (NEC) is a serious disease of the developing gastrointestinal tract.Early diagnosis of NEC is important in order to prevent the inflammation to progress to bowel necrosis and perforation, and eventually to death [8].There is also an economic aspect, as it has been estimated that the annual costs of the disease are between $500 million and $1 billion in the USA [16].
Retinopathy of prematurity (ROP) is a retinal disease associated with the combination of underdeveloped retinal vessels of preterm neonates and supplementary oxygen given to them [17].In addition, low gestational age and birth weight are known risk factors.Currently, the screening for ROP is performed by eye examinations and treated with laser [7] or with anti-vascular endothelial growth factor treatment [18].
SNAP-II and SNAPPE-II [12] are scores for predicting neonatal mortality and they are based on a simple logistic regression model.SNAP-II uses features extracted from the physiological measurements and laboratory results, such as mean blood pressure and lowest serum pH, during a 12 hour recording period.SNAPPE-II score additionally uses the birth weight, the gestational age, and the Apgar score.The Apgar score is reported at 1 minute and 5 minutes after birth for all infants, and at 5-minute intervals thereafter until 20 minutes for infants with a score less than 7, to assess the status of the infant and the response to resuscitation if needed.The Apgar score is a sum of five indices measuring the heart rate, respiratory effort, muscle tone, reflex irritability, and color of the infant [19].
Using machine learning for neonatal morbidity and mortality prediction has been previously explored in literature.Saria et al. [20] presented a model for predicting binary label of low/high morbidity where, high morbidity was defined as any of the following complications: sepsis, pulmonary hemorrhage, pulmonary hypertension, acute hemodynamic instability, moderate or severe BPD, ROP, NEC, intraventricular hemorrhage, or death.Their model utilized aggregated nonlinear Bayesian models and logistic regression, and their selected features included mean values, baseline variances, and residual variances of 3 hours of recorded heart rate, respiratory rate, and oxygen saturation.In addition, the gestational age and birth weight were used as inputs to the model.Our selected features are similar, as we use the mean and standard deviation of the heart rate and oxygen saturation, the gestational age, and the birth weight.However, our selected signals differ in that we do not use respiratory rate and we use additional features including systolic, diastolic, and mean blood pressure, as well as the SNAP-II and SNAPPE-II scores.We also consider a larger variety of popular classifiers for tabular data.Saria et al. [20] also considered the importance of physiological features.However, we analyze the features from the perspective of random forest feature importance, while in Saria et al. the importance was estimated using ablation analysis.We have also chosen to predict mortality and different morbidities separately as opposed to the combined high morbidity label used in the article.
In a more recent work, Podda et al. [21] presented multiple machine learning methods for predicting neonatal mortality in the terms of the probability of survival: logistic regression, k-nearest neighbor, random forest, gradient boosting machine, support vector machine, and neural network.Our work also considers logistic regression, k-nearest neighbor, random forest, and support vector machine, but we also use five additional classifiers and evaluate our classifiers in predicting different morbidities.In addition, our features differ from the ones used in Podda et al. [21].The models presented in Podda et al. [21] do not use physiological time series, but instead, they use the following data collected up to 5 minutes after birth: gestational age, birth weight, Apgar scores 1 minute and 5 minutes after the birth, sex, multiple gestation, mode of delivery, prenatal care, intra-amniotic infection, maternal hypertension, ethnicity, and antenatal steroids.Our dataset includes data up to 72 hours, from which we use multiple subsets in order to examine how the predictive performance of the models vary with the time series length.
Our work is a continuation of Rinta-Koski et al. [14] and Rinta-Koski et al. [13].These works presented preliminary results obtained for a small number of classifiers for mortality, BPD, NEC, and ROP prediction on a dataset collected in the NICU of Children's Hospital of Helsinki University Hospital.In Rinta-Koski et al. [14], Gaussian process classifiers were trained to predict BPD, NEC, and ROP, and the results were compared to the standard medical scores SNAP-II and SNAPPE-II.In Rinta-Koski et al. [13], neonatal mortality prediction was considered.The classifiers compared in the study were Gaussian process classifier, with three different kernels, support vector machine classifier, linear probit model, SNAP-II, and SNAPPE-II.

III. DATA
Our dataset consists of pseudonymized temporal and static data collected in the NICU of Children's Hospital, Helsinki University Hospital between years 1999 and 2013.This dataset is partially the same dataset used in [14] and [13].However, in this study, we have access to additional patients.We included all the infants admitted to NICU with birth weights under 1500 g (VLBW) in this study.Patients who died or were discharged before the age of 72 hours were excluded in order to prevent the explicit vital sign decay to affect the predictions.Also, patients with less than 50 measurements of any of the time series predictors were excluded.Patients with severe congenital anomalies were not excluded, as we wanted an inclusive sample and birth defects are an important cause of neonatal morbidity and mortality among the VLBW infants.The study included 977 patients, who fulfilled the inclusion criteria.Descriptive statistics of the set of included patients are presented in Table 1.
Temporal data includes physiological variables, such as heart rate and blood pressure, in the form of time series.Static data includes clinical information, such as gestational age and birth weight, medical scores SNAP-II and SNAPPE-II, diagnoses of the patients for BPD, NEC, and ROP, as well as information on the survival.There are up to 111 different sensor measurements, however, many, or even all measurements were missing for a large portion of the patients, and thus only a subset of the features was selected.The predictors were chosen to be the same as in [13]: systolic, diastolic and mean blood pressure, oxygen saturation and heart rate for temporal variables, and for static variables SNAP-II, SNAPPE-II, birth weight, and gestational age at birth.
We use these predictors because our preliminary work shows that the best result can be obtained when all these predictors are used in the classification.It is to be noted, that there is slight overlap in variables, as SNAP-II and SNAPPE-II scores also include information about the mean blood pressure, oxygen saturation, and birth weight.Relationship between SNAP-II and SNAPPE-II is also very close.However, the information in SNAP-II and SNAPPE-II is nonlinear in nature, due to the thresholding of the scores, and thus it is not redundant to include the constituent features.Using SNAP-II and SNAPPE-II is in a sense inclusion of prior medical knowledge on how the constituent variables should be processed.Analysing the individual constituents of SNAP-II and SNAPPE-II would be beneficial, as these scores might not be computed in every hospital, making the approach more general.This analysis is, however, left for future work.
The temporal variables were collected using sensor measurements throughout the stay of the patient in the NICU.The gestational age was expressed as completed days from mother's last menstrual period to delivery.In our clinical practice, voluntary early ultrasonographic examination is offered to all mothers, which is used for estimating the expected gestational age.If the difference between the gestational age, calculated from last menstruation period, and the expected gestational age is more than five days, gestational age is adjusted.
The sampling of the temporal variables is irregular in the hospital environment, meaning that the sampling of the sensor measurements is not synchronized.Our dataset also has a large number of missing values.We applied multiple preprocessing steps in order to standardize the data into a format suitable for machine learning methods, and also to examine the behavior of these algorithms in various circumstances.The non-synchronous data was transformed into another set, by splitting the time into 2-minute intervals and by filling missing data with the closest observation before the start of the interval, thus creating a synthetic synchronous dataset.
VLBW infants undergo an adaptation period after birth when their physiological recordings differ from the steady state attained later.We hypothesized that these initial adaptation period recordings might affect the classifier performance.As the age, when VLBW infants were admitted to NICU varied, we created two new sets from both the nonsynchronous and synchronous datasets with the first 6 hours of measurements excluded, in order to examine if the more VOLUME 8, 2020 123349 stable post-adaptation period recordings improve the performance of the classifiers.
At the end, we created multiple subsets from the previous 4 sets, by dividing the time series into 12, 18, 24, 36, 48, and 72 hours of data.Further exclusion of patients was conducted on the basis of observed sensor measurements.Measurements out of range of plausible values, defined by medical experts, were removed, and, again, at least 50 valid measurements per temporal variable were required to include the patient into the dataset.This minimum 50-valid-measurements rule causes different sets to have a different number of patients.
Detailed descriptions of the datasets produced by the different preprocessing steps are presented in the Supplementary material in Table 1a for mortality, Table 1b for BPD, Table 1c for NEC, and Table 1d for ROP.From the patients in our dataset, approximately 6-7% died, 28-29% had BPD, 3% NEC, and 8% ROP.
We also experimented with undersampling of the majority class during the training of the classifiers to overcome the class imbalance problem.This procedure is explained in the next section.Lastly, our preprocessing pipeline includes feature extraction and data standardization.Similarly to [13] and [14], we chose to use the means and standard deviations as our extracted features from the time series.We also normalized the data to have zero mean and unit variance, for each variable independently, using the mean and the standard deviation of the training set of each fold in order to prevent leakage between the sets.

IV. CLASSIFIER METHODS
Our selected methods include logistic regression, linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor, support vector machine, three different Gaussian processes, and random forest classifier.
In logistic regression (LR), a linear model is combined with logistic sigmoid function to model the posterior probabilities of each class [22].The resulting model is nonlinear and there is no closed-form solution, hence iterative optimization methods like Newton-Raphson [23] or gradient descent [24] have to be used.The probabilities of the classes can be defined for a binary classification task as is presented in Equation (1) [23], bias term included in w for clarity: Linear discriminant analysis (LDA) classifier [25] is based on the assumption that the class-conditional densities p(x | y = 1) and p(x | y = 0) are Gaussian.In addition, LDA further assumes that the density functions share a common covariance matrix.Under these assumptions, the classification problem can be formulated as finding the optimal 1D projection for the data, governed by the class-conditional means and common covariance matrix, and finding the optimal classification threshold on this line.The equations for the optimal projection vector and the classification threshold on the projection, given uniform class priors, are presented in Equations ( 2) and (3), respectively, for binary case [25]: In these equations, µ n denotes the mean of the samples from class n, and is the common covariance matrix.Quadratic discriminant analysis (QDA) classifier [25] is a generalization of the LDA classifier, where the requirement of the common covariance matrix is relaxed into class specific covariance matrices.This relaxation also causes the decision boundaries between classes to have a quadratic form.The posterior probabilities of classes can be computed as in Equation (4), for uniform priors again for clarity [25]: In this equation, i denotes the index of the class, c of which is the predicted class, and µ i and i are the mean vector and covariance matrix of the class i, respectively.k-nearest neighbor (KNN) classification algorithm is a machine learning method, using a distance metric, such as Euclidean distance, to find the closest examples in the training set to a query point, and then classifying points based on these neighboring points [22].The posterior probability of a query point belonging to a specific class can be estimated as the proportion of neighbors belonging to that class.The class associated with the highest posterior probability can then be used as the prediction of the model [23].
Support vector machine (SVM) [22] is based on finding a hyperplane that separates the samples between the classes with maximal margin.If the classes are not linearly separable, so-called slack variables are needed for each data point, such that some may fall on the wrong side of the separating hyperplane.The slack variables are constrained to be non-negative, and the separating hyperplane is selected such that the sum of the slack variables is minimized.So-called kernel trick can be used to lift the problem into a higher-dimensional space [22].However, we do not consider the kernel SVMs in this work.
Gaussian processes (GP) [26] are Bayesian non-parametric machine learning methods, which can be used in classification and regression tasks.Gaussian process defines a Gaussian distribution over functions and it is parametrized by a mean function and a covariance function.In our experiments we use zero mean, and thus the covariance function defines our GP model.
In this work, three covariance functions have been utilized, each consisting of a sum of constant kernel, linear kernel, and one of the following: squared exponential kernel (k RBF (•, •)), Matérn kernel with ν = 3/2 (k Matérn32 (•, •)), or Matérn kernel with ν = 5/2 (k Matérn52 (•, •)).The three latter kernels are presented in Equations ( 5)-( 7) with parameter ν substituted with the relevant value: Bootstrap aggregation of multiple decision trees gives arise to a model called random forest (RF) [22].In RF, each tree is given a random bootstrap sample of the training dataset for which splitting rules are learned.RF algorithm also includes random sampling of the features at each split, in order to reduce the correlation between the trained trees.The predictions of RF are obtained by aggregating all the treepredictions together by averaging them [22].

V. EVALUATION CRITERIA
We evaluate our models by using two primary evaluation measures: the area under the receiver operating characteristics curve (AUROC) and the F1-score, also called the Sørensen-Dice coefficient and the similarity coefficient.AUROC [27] is a common measure used in machine learning method evaluation within the medical field, see for example [13], [14], [28], [29], and is therefore selected for evaluation.However, our dataset has a high class imbalance, and thus in this study, we propose the F1-score as an alternative evaluation measure to better reflect the classifier performance in detecting the positive class, that is, the minority class in our case.
As an example of using the AUROC measure for classifier selection with high class imbalance, the best classifier in the detection of mortality in our previous study [13] had a high AUROC of 0.948, however, the sensitivity was 0.463.This means that nearly 54% of the deaths are undetected by the classifier.Similarly in the detection of ROP in our previous study [14], the best classifier had a moderately high AUROC of 0.84 and a very low sensitivity of 0.05, which results in 95% of the ROP cases to be left undetected.We will show (see Section VI), when we have a class imbalance problem, selecting the best classifier based on the highest F1-score will significantly improve the sensitivity with possibly only a slight decrease in the AUROC value.The F1-score is computed as the harmonic mean of positive predictive value (PPV/precision) and sensitivity (recall) on a single operating point.
We also present results on the mean F1-score, defined as the mean of F1-score calculated for detection of positive class and for the detection of negative class, PPV, sensitivity, specificity, and accuracy.The equations for calculating the evaluation measures and their relevant intermediate results, other than AUROC, are the following Equations ( 8)-( 15): Here TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively.NPV is abbreviation for negative predictive value.
Due to the small number of samples in our dataset, cross-validation was used.We used 8-fold cross-validation, repeated 8 times, similarly as in our preliminary work [13].Stratification was also used in the cross-validation, which ensures that a similar proportion of positive and negative samples are selected into each fold.Due to the high class imbalance present in our dataset, we performed additional experiments with class sub-sampling in our experiments.In this setting, we modified the training set of each crossvalidation fold to have the same amount of negative and positive examples by sampling the majority class without replacement.Predictions were performed on the entire test fold.
Parameter selection was conducted using a nested crossvalidation procedure, where the standard cross-validation training set was used in another inner cross-validation loop.The validation set of this inner loop was then used to estimate the generalization performance with the selected parameters.With this procedure we searched for the best parameters for RF and KNN.For RF these parameters were the number of randomly selected variables for each split, number of trees within the forest, and minimum number of observations for each leaf node.For KNN, the parameter was the number of neighbors.These parameters are presented in Table 2.
We used MATLAB's inbuilt functions to implement our experiments, except for the Gaussian process classifier, for which we used the GPstuff MATLAB toolbox [30].Evaluation measures were calculated utilizing Python framework Scikit-learn [31] for each model.

A. CLASSIFICATION EXPERIMENTS
In this section, we present the results of our experiments.We first present results with classifier selection based on the AUROC, similar manner to our previous studies.We continue by presenting results with classifier selection based on the F1-score and show how classifiers selected based on this measure relate to the classifiers selected with AUROC.Full classification results for classifiers selected based on the best AUROC are presented in Table 3, and for classifiers selected based on the best F1-score are shown in Table 4. Full graphical comparison between these experiments is presented in Figure 1.
In mortality classification (Table 3a), all classifiers except the QDA achieved over 0.9 AUROC.The RF had the highest AUROC of 0.922, however, with only a small margin to the Gaussian process classifiers with the GP-RBF having AUROC of 0.920, GP-M32 with 0.919, GP-M52 with 0.919, and to the KNN with AUROC of 0.918.The RF had also the highest F1-score of 0.477, in which the margin was larger, as the second best F1-score of 0.384 was achieved by the KNN.Graphical illustration of the results is presented in Figure 1a.
In BPD classification (Table 3b), no classifier achieved over 0.9 AUROC, the highest being the Gaussian process classifiers with 0.899 AUROC, however, only a small margin from the RF with 0.884 AUROC.A small performance gap can be seen in the F1-score, as the RF had the highest F1score of 0.704 and the second highest F1-score was achieved by the GP-M52 classifier with 0.687 F1-score.All the Gaussian process classifiers had virtually the same performance in this task.The logistic regression classifier had a similar F1-score as the Gaussian process classifiers with 0.686 value.Graphical illustration of the results is presented in Figure 1b.
In NEC classification (Table 3c), the RF had the best AUROC of 0.806, which was the only result over 0.8.The RF also had the best F1-score of 0.189.Interestingly, all the Gaussian process classifiers and the SVM classifier failed to detect any positive examples.Also, all the other classifiers except for the RF, Gaussian process, and SVM had over 0.6 sensitivity and less than 0.1 PPV, indicating high amount of false positives.Graphical illustration of the results is presented in Figure 1c.
In ROP classification (Table 3d), the Gaussian process classifiers and the RF classifier had all the best AUROC Experiment1 denotes the first experiment, where we selected the classifier for maximal AUROC, and the Experiment2 denotes the second experiment, where we selected the classifier for maximal F1-Score. of 0.846.However, the RF had slightly less variation in AUROC between the cross-validation folds and repetitions compared to the Gaussian process classifiers, indicated by TABLE 3. Classification results for representative classifier selected based on AUROC.The results are presented as the mean and in parentheses the standard error of each evaluation measure, over the cross-validation repetitions.Descriptors under the name of the method are in format: length of time series (hours) -irregularly or regularly sampled (I/R) -all data or exclude first 6 hours (A/E) -class distribution empirical or resampled to uniform (E/U).GP denotes the Gaussian process classifier, with RBF denoting the squared exponential kernel, M32 the Matérn kernel with ν = 3/2, and M52 the Matérn kernel with ν = 5/2.RF denotes the random forest classifier, KNN the k-nearest neighbor classifier, LR the logistic regression classifier, LDA the linear discriminant analysis classifier, QDA the quadratic discriminant analysis classifier, and SVM the support vector machine classifier.the 0.01 lower standard error.Similar to the NEC prediction, the Gaussian process classifiers failed to detect any positive examples and also the SVM had difficulties in detecting positive examples, reflected by the low sensitivity (0.005) and F1-score (0.009).On the other hand, the RF achieved the highest F1-score of 0.368 in addition to having the highest AUROC.Graphical illustration of the results is presented in Figure 1d.
In our second classification experiment we used the F1-score for classifier selection.The complete results for these experiments are presented in Tables 4a-4d.In general, when the classifiers were selected based on the highest F1-score, the AUROC values decreased slightly.However, more balanced PPV and sensitivity can be observed, due to the F1-score being the geometric mean of these.
In mortality classification (Table 4a), an example of the balanced sensitivity and PPV can be seen in the GP-RBF classifier, as sensitivity is increased from 0.157 to 0.845, with the cost of PPV decreasing from 0.493 to 0.253.Similar trade-off can also be seen in the SVM.The RF retains the highest performance in also this experiment with 0.915 AUROC and 0.493 F1-score, however, only a small margin in AUROC in comparison to the Gaussian process classifiers with 0.913 AUROCs.Graphical illustration of the results is presented in Figure 1a.
In BPD classification (Table 4b), the GP-M52 has the highest AUROC of 0.884, however, similar to the mortality classification, the margin is small between the different GP classifiers (0.879) and the RF classifier (0.881).The RF classifier has the highest F1-score with value of 0.704, similar to the results obtained when the AUROC was used for the classifier selection.However, the gap between the RF and the other classifiers has been decreased, as the Gaussian process classifiers and the logistic regression classifiers have around 0.69 F1-scores.Graphical illustration of the results is presented in Figure 1b.
In NEC classification (Table 4c), the RF has the highest AUROC of 0.785 and the highest F1-score of 0.215.We can see that when the Gaussian process classifiers and the SVM classifier were selected based on the highest F1-score, they no longer predict every example as negative, but have generally high sensitivities, the SVM having the highest sensitivity, VOLUME 8, 2020 123353 TABLE 4. Classification results for representative classifier selected based on F1-score.The results are presented as the mean and in parentheses the standard error of each evaluation measure, over the cross-validation repetitions.Descriptors under the name of the method are in format: length of time series (hours) -irregularly or regularly sampled (I/R) -all data or exclude first 6 hours (A/E) -class distribution empirical or resampled to uniform (E/U).GP denotes the Gaussian process classifier, with RBF denoting the squared exponential kernel, M32 the Matérn kernel with ν = 3/2, and M52 the Matérn kernel with ν = 5/2.RF denotes the random forest classifier, KNN the k-nearest neighbor classifier, LR the logistic regression classifier, LDA the linear discriminant analysis classifier, QDA the quadratic discriminant analysis classifier, and SVM the support vector machine classifier.
albeit with fairly low PPV.These high sensitivity Gaussian processes and SVM were trained using the class subsampling method, which achieves uniform class distribution.We can also see that when these classifiers were selected based on the AUROC, the selected classifiers were trained using empirical distribution.This shows that the type of preprocessing applied to the classifier can improve sensitivity significantly, which may be of special interest in clinical environment.Graphical illustration of the results is presented in Figure 1c.
In ROP classification (Table 4d), the same RF classifier which was the best in AUROC (0.846), also had the highest F1-score (0.368).The margin in F1-score is decreased in relation to the other classifiers, however, it still is relatively high.Similar large improvement in sensitivity can be seen in this task for the Gaussian process classifiers and for the SVM classifier as is seen in the NEC classification.Indeed, the GP-RBF has the highest sensitivity in this task.These high sensitivity classifiers were, again, trained using the class sub-sampling method.Graphical illustration of the results is presented in Figure 1d.

B. IMPACT OF TIME SERIES LENGTH
In this section, we evaluate the performance of the classifiers with respect to the length of the time series.Figures 2 and 3 visualize model performances, evaluated by AUROC and F1score, and averaged over the cross-validation repetitions and class-related sampling methods.
From the illustrations, we can see that the exclusion of the first 6 hours of data does not significantly improve or degrade the performance of the classifiers.This indicates that the mean and the standard deviation of the time series can be used for prediction of mortality, BPD, NEC, and ROP without exclusion of the initial measurements.
We can see that in mortality (Fig. 2a), BPD (Fig. 2b), and ROP (Fig. 3b) detection most classifiers have only small benefits from increasing the length of the time series.However, in NEC detection all classifiers seem to benefit from increasing the length of the time series, most noticeably in terms of AUROC.Especially the QDA improves by approximately 0.15 in AUROC.The F1-score of the RF classifier is significantly improved by approximately 0.10 when a longer time series is used.Our analysis suggests that in the prediction of mortality, BPD, and ROP, even 12 hours of time series data could be used for accurate classification, providing earlier detection of these outcomes.In NEC prediction, most classifiers benefit from using longer time series in terms of AUROC, and some also in terms of F1-score, most notably the RF classifier.

C. FEATURE IMPORTANCE
As the final stage, we used the RF classifier to rank the importance of all extracted features in order to have a better insight to the crucial factors influencing the RF classifier in the detection of neonatal mortality and morbidity.It should be noted that since the RF classifier performs the classification in a nonlinear manner, it is hard to determine the mechanism of how the change in one feature affects the detection performance.However, the feature importances still provide useful information on which features the RF classifier found to be the most informative.
The RF classifier consists of many individual decision trees, each trained using a subset of the patients and features.The out-of-bag error means the average error of each tree computed on the subset of patients which were not included in training the tree.The out-of-bag error can also be used for evaluating which features are the most influential for the detection performance.This is done by randomly permuting one feature across the different patients and then calculating the our-of-bag error.The feature with the highest out-of-bag error is judged to be the most important one, as randomizing its value caused the highest error.The features with a small positive or negative importance value can be judged to be not important, as randomizing their values does not change the error much [22].
The classifier was trained for each task using all the available data and the longest time series of 72 hours.Feature importances are graphically presented in Figure 4.The importance values have been normalized by the maximum value of all importances.In the case of mortality, birth weight has the largest importance out of all the features, an unsurprising result given the well-known contribution of low birth weight to neonatal mortality [32].Variation in the blood oxygen saturation has nearly as high importance as birth weight.Mean of systolic and mean arterial blood pressure are the third and fourth most important features.Other features, starting from gestational age, have a large drop in the importance values.
In BPD classification, gestational age has the largest importance by a wide margin.Birth weight and the medical score SNAPPE-II come the second.After the fourth featuremean of diastolic blood pressure-the feature importances decrease in a linear manner.Low birth weight and gestational age have been previously discovered to contribute to increased risk of BPD [33].Also, increased SNAPPE-II has been associated with BPD [34].Mechanical ventilation and supplementary oxygen are thought to play a major factor in the development of BPD [33], which might explain the blood oxygen concentration being in the top half of the features.
In NEC classification, blood oxygen concentration, birth weight and the mean of systolic arterial blood pressure were the three most important features, after which a large drop in the numerical value of importances is seen.In ROP classification, the standard deviation of diastolic arterial blood pressure had the largest importance, by a relatively wide margin.The following features in descending order were the standard deviation of systolic blood pressure, the mean of blood oxygen, the mean of mean arterial blood pressure, the standard deviation of mean arterial blood pressure, and the mean of systolic blood pressure.However, they had similar importances.After these features, a large drop in importance is seen.
Interesting observation can be seen when comparing mortality, BPD, and NEC feature importances with the ones of ROP.Within the first group, birth weight is either the most or the second most important feature, however, it is the least important feature in ROP classification.The most important feature in ROP classification-the standard deviation of diastolic blood pressure-is also the third least important feature in mortality and NEC classification, and the sixth least important feature in BPD classification.

VII. CONCLUSIONS
In this article, we have presented a systematic analysis of 9 different classifiers in the tasks of neonatal mortality, BPD, NEC, and ROP detection, using an NICU dataset.Our preprocessed datasets included irregular and regularized time series, time series of different lengths, and time series with initial 6 hours of observations excluded.Our experiments also included majority class sub-sampling for uniform class distribution.In our experiments the RF classifier had generally the best performance, when evaluated using our primary evaluation measures of AUROC and F1-score.
Our results show that the features proposed in the preliminary work [13], [14] are robust to the choice of classifiers in AUROC and F1-score measures, when the training data has been preprocessed in an optimal fashion.Our findings also show that the performance of the classifiers is unaffected by the exclusion of the first 6 hours of data.This finding indicates that the possible heteroscedasticity present in the neonatal adaptation period measurements does not degrade the performance of the classifiers.
In comparison to our preliminary work, presented in [14] and [13], we have shown that our preprocessing methods and the proposed F1-score for classifier selection improve the sensitivity of the same classifiers in the task of neonatal mortality prediction, albeit this methodology decreased the AUROC values slightly.In the task of BPD, NEC, and ROP classification, we have shown that the RF classifier can reach the best or the second best results in AUROC, and the best results in F1-score in each task.However, other classifiers also reach competitive F1-scores when the preprocessing scheme is tuned to maximize F1-score.
Our results suggest that in a clinical setting, where the sensitivity can have a high priority, the AUROC evaluation measure might not provide a good standard for comparison.Instead, the F1-score could provide a useful measure for comparison.The overall results also show that when sensitivity is important, sub-sampling the majority class can lead to increased sensitivity.In majority of the cases and in majority of the classifiers, the best results were obtained in each primary evaluation measure when the longest 72 hour period of data was used.This suggests that the long term mean and standard deviation of the time series provide better discriminative features than the short term counterparts.However, this effect was not very significant in other tasks than NEC classification.
Lastly, we acknowledge some limitations of this study, namely i) the dataset was collected in the same hospital, possibly introducing challenges to generalization to other hospital settings with different devices for physiological signal measurements, and ii) our algorithm is designed in a retrospective manner after the measurements for a certain time period are taken.This requires some adaptations for real-time applications, which possibly leads to a drop in the performance.

FIGURE 1 .
FIGURE 1. Visualization of our results as bar graphs.Experiment1 denotes the first experiment, where we selected the classifier for maximal AUROC, and the Experiment2 denotes the second experiment, where we selected the classifier for maximal F1-Score.

FIGURE 2 .
FIGURE 2. Effect of time series length on the classification performance for (a) Mortality and (b) BPD.The postfix ''−6h'' in the legend denotes that the first 6 hours have been removed from the time series.

FIGURE 3 .
FIGURE 3. Effect of time series length on the classification performance for (a) NEC and (b) ROP.The postfix ''−6h'' in the legend denotes that the first 6 hours have been removed from the time series.

FIGURE 4 .
FIGURE 4. Feature importances in each task, given by the RF classifier.Here µ denotes mean, σ standard deviation, and ABP denotes arterial blood pressure with postfix D, M and S correspond to diastolic, mean and systolic, respectively.Larger values suggest higher importance.These values are normalized into maximum of 1.0.

TABLE 1 .
Descriptive statistics of the set of included patients.

TABLE 2 .
Selected parameters for random forest and KNN (k-nearest neighbor) for each classification task.PPS denotes predictors per split and OPL observations per leaf.