Ensemble Learning Using Individual Neonatal Data for Seizure Detection

Objective: Sharing medical data between institutions is difficult in practice due to data protection laws and official procedures within institutions. Therefore, most existing algorithms are trained on relatively small electroencephalogram (EEG) data sets which is likely to be detrimental to prediction accuracy. In this work, we simulate a case when the data can not be shared by splitting the publicly available data set into disjoint sets representing data in individual institutions. Methods and procedures: We propose to train a (local) detector in each institution and aggregate their individual predictions into one final prediction. Four aggregation schemes are compared, namely, the majority vote, the mean, the weighted mean and the Dawid-Skene method. The method was validated on an independent data set using only a subset of EEG channels. Results: The ensemble reaches accuracy comparable to a single detector trained on all the data when sufficient amount of data is available in each institution. Conclusion: The weighted mean aggregation scheme showed best performance, it was only marginally outperformed by the Dawid–Skene method when local detectors approach performance of a single detector trained on all available data. Clinical impact: Ensemble learning allows training of reliable algorithms for neonatal EEG analysis without a need to share the potentially sensitive EEG data between institutions.


Introduction
Seizures are common during perinatal period [13], and management of neonatal seizures requires timely detection and treatment to reduce ensuing brain damage [4].The current gold standard for neonatal seizure detection is visual analysis by a human expert using a full-montage video electroencephalogram (EEG) [38].Since such service is rarely available in neonatal intensive care units (NICUs), there is an urgent clinical need for automated neonatal seizure detection algorithm (NSDA) with human expert level accuracy.
Deep neural networks (DNNs) generally require a large amount of training data [24].However, building a large and diverse enough neonatal EEG data set with high quality seizure annotations is time consuming, ambiguous [27,41] and often limited due to strict regulations (e.g. the Privacy Rule of the U.S. Health Insurance Portability and Accountability Act (HIPAA), or the European General Data Protection Regulation (GDPR)) making data sharing between institutions difficult, if not impossible [10,54].Challenges in sharing data have triggered growing interest in distributed approaches to statistical learning [23].
One approach that requires minimal sharing of information is model ensembling, i.e. models are trained locally at each institution and predictions on new data are aggregated (ensembled) from predictions made by the local models.This requires sharing only the models across the network of institutions rather than sharing the potentially sensitive, original biosignals.However, the procedures in model sharing need to be planned so that they mitigate the impact of possible inadvertent leaks of training data through a model [12,56].One solution to this problem is to have a trusted agent in charge of the models and an aggregation procedure.Compared to the federated learning [25], ensembling does not require communication between the institutions during the training phase (which may be difficult to set up) and it does not require the institutions to use the same model architecture.One institution could e.g.use a DNN, another an SVM and a third a decision tree classifier.
Once predictions on new data have been made there are a number of techniques by which they can be ensembled.If predictions are accompanied by probabilities they can be averaged [7,51], if not, a commonly used method for label aggregation is to simply select the most frequent label, referred to as majority vote in the following.One could also put more weight on some predictions if they are a priori more trustworthy, otherwise, an estimate of each annotator performance can be used [45,49,52].In 1979 Dawid and Skene [8] used an expected maximization (EM) algorithm [9] to estimate annotator performance and provide consensus labels.
Ensemble learning has previously been used in neonatal seizure detection.In [35] stacking is used where different model types trained on the same data are combined.In [44] three identical NSDAs are trained on the same EEG data but using labels from different experts.In this work we use ensemble learning on disjoint data sets, to simulate the situation were institutions train NSDAs on locally available data.Depending on the training data available at each institution and its similarity to new data to be labelled, the local NSDAs are expected to vary in performance.The main contribution and novelty of this work is in the discovery of how such locally trained models can be aggregated with the aim of achieving performance comparable to a single state-of-the-art NSDA trained on the union of all local training data sets.For aggregation we compared the majority vote, the mean, the weighted mean (via stacking) and the Dawid-Skene expected maximization algorithm.We show that the weighted mean outperforms the other methods if the NSDAs in the ensemble are trained on very few patients and Dawid-Skene marginally outperforms the other methods when the local NSDAs are not much worse than the state-of-the-art NSDA.The NSDAs and ensembles are further validated on an independent data set consisting of more than 2100 hours of EEG recorded from a small subset of the channels used to train the classifiers.

Methods
Multiple local models, referred to as local NSDAs in the following, are trained on disjoint subsets of multi-channel EEG recordings, simulating a scenario where several hospitals train NSDAs individually, without sharing patient data.The trained detectors are then shared with a trusted agent.To classify a short EEG segment from a new patient as seizure/non-seizure, the trusted agent sends the segment through all the local NSDAs and the predictions are aggregated using one of the following schemes: majority vote, mean, weighted mean or the Dawid-Skene method.The methodology is summarized in figure 1.
For local NSDAs, we used DNNs which take EEG segments as input.The networks share the same architecture but have different network weights since they were trained on disjoint training sets.

Aggregation schemes
In the following we consider a binary classification problem where the classes are labeled 0 and 1.Let D be a set of N predictions from R independent models where p j i is the estimated probability of model j of instance i belonging to class 1.By setting a threshold between the classes to 0.5, the predicted label of model j of instance i is given by A simple way to aggregate multiple predictions for instance i, when models do not output their confidence (e.g.class probabilities), is to use majority vote, i.e. select the most frequent label.Here we use the mean of predicted labels, When the models output class probabilities, which is e.g. the case when the models correspond to the neural networks, the predictions can be aggregated by taking the mean probability As some of the models might perform better than others, a weighted mean can be used to emphasize the more accurate models.To get the final prediction in a range between 0 and 1, we used logistic regression where σ(x) = 1 /(1+e −x ).The weights for w j are learned on a held out data set (see section 2.4).
The fourth aggregation method evaluated here is the Dawid-Skene method.The method estimates the sensitivity and specificity of each model, together with consensus predictions µ DS .For details of the method see appendix A. To predict the absence/presence of seizures from the above aggregation schemes, a threshold of 0.5 is used.
The second, proprietary, data set (the 3-channel DS) consisting of EEG recordings of 28 neonates, is used as a held out test set to evaluate the aggregation schemes in a real world setting, i.e. detectors are trained on the 18-channel DS and tested on this data set.The data set is also used in [46] and is a subset of the data set used in [31].Institutional Research Review Board of the HUS diagnostic center approved the use of this data, including a waiver of consent due to the study's retrospective and observational nature.Each recording spans from 19 hours to 7 days.The recordings were obtained using 4 needle electrodes (F3, F4, P3 and P4) with a common reference, instead of the full set of 19 electrodes used in the training data set.Neonatal recordings are typically performed with this reduced electrode set to allow easier maintenance in a long duration brain monitoring [6].The three bipolar derivations (F3-P3, F4-P4 and P3-P4) are used for both two human expert annotators and as the detectors input.
Additional attributes of the data sets are given in table 2 in appendix B.
Each EEG recording is cut into 16 sec long segments with 12 sec overlap.Out of the 79 (28) recordings in 18channel DS (the 3-channel DS), 38 (24) contain at least one seizure longer than 16 sec identified by three (two) human experts, meaning each of these recordings contain at least one consensus seizure segment.Segments containing more than 1 sec of zero voltage interval in at least one channel (disconnected electrode or pause in the recording) are leftout from the training and test sets.The signals are filtered with a 6th order Chebyshev Type 2 band-pass filter with cut-off frequencies of 0.5 Hz and 16 Hz, down-sampled to 32 Hz and rescaled to 16-bit integers.This is similar to the pre-processing in [47,21].

Neonatal seizure detection algorithm
Each NSDA is a neural network consisting of three components; a feature extractor, an attention layer and an output layer.The feature extractor is a CNN from [32].The features are extracted from each EEG channel separately and are combined into a single feature channel by the attention layer [21].The attention layer is used since expert labels are not specific to individual channels and neonatal seizures tend to be partial [38], i.e. localized in a small area of the brain and therefore only present in a subset of the recorded channels.The attention layer is also independent of the number of input feature channels making the detector independent of the number of recorded EEG channels.The output layer is a fully connected layer with two output nodes representing the two classes.A detailed description of the network architecture is given in appendix C.
To compare the aggregation schemes to current state-of-the-art NSDAs, we trained a neural network using all the recordings in the 18-channel DS containing at least one consensus seizure longer than 16 sec (P ).This NSDA is referred to as the baseline NSDA in the following.
The local NSDAs use the same neural network architecture as the baseline NSDA but differ in the data used for training.The patients in P (patients containing a consensus seizure) are partitioned into k = 3, 4, . . ., 10 subsets representing data sets in individual institutions.Partitioning is random such that each patient is in exactly one subset and there are at least three patients in every subset.The union of the k subsets is then P , the data set used as a training set for the baseline NSDA.By excluding patients without consensus seizures we ensure each subset has patients with seizures and eliminated the varying number of EEGs with normal brain activity in individual subsets, making the analysis more straightforward.As there can be a big difference between the training set sizes, we obtain local NSDAs with different generalisation strengths and consequently with different performance strengths on unseen data.This is expected in practice.Even though the acquisition equipment is subject to international standards and the electrodes are positioned according to the 10-20 system, the EEG signals may vary considerably depending on the patient cohorts as the signals differ between neonates of different ages and conditions [17,18].Therefore, the detectors are expected to perform differently on unseen data.

Training
After partitioning the training set, each NSDA (baseline NSDA and local NSDAs) is trained on 16 sec long EEG segments corresponding to the consensus seizures and non-seizure segments.To avoid complications due to class imbalance [21,22], the training sets are balanced prior to training by sub-sampling the non-seizure segments.Segments with disagreements between the human experts and partly seizure/non-seizure segments are not included in the training sets.Cross entropy is used as the loss function.The Adam optimizer is used to optimize the network weights using an initial learning rate of 0.001 which is then halved every 10 epochs.The NSDAs are trained for 30 epochs with a mini-batch size of 32.Hyper-parameters, learning rate and number of epochs, are tuned empirically, from observing the behavior of the loss function during the training of the baseline NSDA.A small mini-batch size is chosen due to a small amount of data used in some local NSDAs.For the weighed mean aggregation scheme, the weights w j , j ∈ {1, 2, . . ., R}, are learned using a stacking classifier [52].A logistic regression classifier is trained using the data from one randomly selected local NSDA in each experiment.This local NSDA is not used in an ensemble for making predictions on a test patient.Therefore, non-overlapping data sets are used for training the local NSDAs and the logistic regression classifier.Also, the training data of the local NSDAs would not need to be shared in practice as the input of the logistic regression classifier is just a set of seizure probabilities estimated by the local NSDAs and these can be provided by the trusted agent.
All the deep learning code used in the experiments is implemented using PyTorch 1.7.1 [36] and run on an NVIDIA GTX 1080 Ti GPU.For logistic regression, we use the scikit-learn [37] implementation with default hyper-parameters.

Performance
To Two sets of performance metrics are calculated, metrics based on the success/failure in classifying individual 16 sec long segments, and event-based metrics which indicate whether a seizure is detected at all, or whether a seizure is falsely reported.The segment-based metrics are sensitivity (SE), specificity (SP) and the area under the receiver operating characteristic curve (AUC).These metrics are calculated from segments without disagreements between human experts and segments with either seizure either non-seizure activity for the whole segment duration.The eventbased metrics are seizure detection rate (SDR), false detections per hour (FD/h) and the mean false detection duration (MFDD) [48].A consensus seizure is considered to be detected if it is detected at any point in time and a seizure is considered as a false detection if it did not overlap with any (consensus or not) seizure labelled by the human experts.Definitions of the metrics are provided in appendix D. Metrics calculated on each patient separately are summarized by their means and medians.
Before the event-based metrics are calculated a post-processing step is in order since segments overlap.Besides a few segments at the beginning and end of each recording, for each 4 sec long segment there are 4 overlapping 16 sec long segments.Prediction for a 4 sec segment is obtained by averaging predictions from overlapping 16 sec long segments [20,34].Seizures with duration less than 10 sec are excluded and considered normal brain activity as by definition seizures are longer than 10 sec [50].
Performance on the 3-channel DS is evaluated in the same manner as for the 18-channel DS, i.e. the metrics are calculated for each patient separately and then summarized with the mean and the median.

Results
To assess the clinical usefulness of the aggregation schemes they are compared to a baseline NSDA which is trained on data from all 38 patients in P (in a leave-one-subject-out setting for evaluation on the 18-channel DS).The baseline NSDA thus corresponds to the situation where a single agent has access to all the training data (P ), a situation which is expected to be favorable compared to aggregating predictions from multiple models trained on disjoint subsets of the same data.

Baseline NSDA
Table 1 compares the performance of the baseline detector to other NSDAs found in the literature.All detectors are neural networks and were trained or tested using the 18-channel DS.The difference between the mean (0.92) and median (0.98) AUC values for the baseline NSDA calculated on the 18-channel DS is mainly due to the presence of respiratory and heart rate artefacts and low seizure burden in some of the recordings.The performance of an NSDA on an independent test set is usually worse than performance estimates obtained from a held out training data.Such a decrease can be attributed to several factors, including differences in patient cohorts, seizure prevalence, the number of available EEG channels, the human experts that annotated the EEG [5], and training data not representing the general population.For example, the mean AUC decreased from 0.97 to 0.92 in [21] and from 0.99 to 0.96 in [33].We observe a similar drop in performance when the baseline detector was tested on a proprietary the 3-channel DS.Detailed validation of the NSDA performance is available in table 3 in appendix E.
In summary, the baseline NSDA gives comparable results to the state-of-the-art NSDAs and performs well on recordings which include only a small subset of the channels used in training.

Aggregation schemes
Here we evaluate the different aggregation schemes and compare them to the baseline NSDA and to the average performance of the local NSDAs.If the baseline performance can be reached with an aggregation scheme, it would indicate that the data does not need to be shared during the training of an NSDA to obtain a detector with state-of-theart performance.The four aggregation schemes, majority vote, mean, weighted mean and the Dawid-Skene method were evaluated on the 18-channel DS and the 3-channel DS for k = 3, 4, . . ., 10 local NSDAs.Results for the majority vote are not shown since in all cases majority vote was slightly outperformed by the mean aggregation scheme (see figure 7 in appendix E).
With an increasing number of local NSDAs the average performance of an individual detector gradually gets worse (figure 2).This is explained by the fact that the number of patients behind each local NSDA is becoming smaller since the total number of patients in the combined training sets is constant (37 for the 18-channel DS and 38 for the 3-channel DS).Consequently there is an increased risk of overfitting in individual detectors.The size of the local training sets is quantified with the mean median number of patients in the training set.E.g., if four local NSDAs are used and the mean median is 8.1, then on average there are at least nine patients in the training of two of the local NSDAs.Figure 2 shows that the AUC, seizure detection rate and false detection rate behave similarly across both data sets for all the aggregation schemes, but there is considerably more variability for the 3-channel DS.All the aggregation schemes give AUC values that are similar to the baseline value.However, the aggregation schemes differ in terms of seizure detection rate and false detections per hour.
Figure 3 shows the seizure probability estimates returned by local NSDAs for an hour-long recording, together with probability estimates obtained with the ensemble methods.All the aggregation schemes result in AUC close to one, although they detect only 3 out of 7 consensus seizures.The missed seizures are short in duration and they are clearly visible in the figure (as white bands) since the corresponding probabilities are higher than for the non-seizure segments.
The SDR in figure 2 behaves similarly for both data sets.For all values of k tested, the Dawid-Skene method is comparable to the baseline NSDA, while for the mean and the weighted mean aggregation schemes, fewer seizures were detected with an increased number of local NSDAs.Recall that when there are few NSDAs, each NSDA detects almost as many seizures as the baseline detector.The mean aggregation scheme performed slightly worse than the weighted mean and both performed notably worse than the Dawid-Skene method for more than four local detectors.Moreover, in figure 2 we observe that all aggregation schemes result in a lower number of FD/h than the average local NSDA.The average FD/h of the local NSDAs are noticeably higher for the 3-channel DS than for the 18-channel DS.One possible explanation is that the recordings in the 3-channel DS are much longer and on average just 3.5 % of a recording corresponds to a seizure activity.The mean aggregation scheme has a lower false detection rate than the baseline NSDA and the FD/h decreases steadily with increasing number of local NSDAs.This may be a result of low level of agreement between the local NSDAs for the large k (figure 5 in appendix E).So, even though an individual local NSDA falsely detects a large number of seizures, the aggregated prediction filtered them out or was below the 0.5 threshold.This may on the other hand caused problems with the Dawid-Skene method, i.e. the FD/h increased slowly on the 18-channel DS and rapidly on the 3-channel DS with increasing number of local NSDAs.In contrast, the logistic regression classifier determining the weights for the weighted mean aggregation scheme successfully detected local NSDAs with high/low false detection rate for all k tested.

Expert annotations
We observed low false detection rates for the mean and weighted mean aggregation schemes and therefore investigated whether the false detections are short or long in duration.We did not observe big differences between the aggregation schemes (10 -30 sec) and different values of local NSDAs (figure 6 in appendix E).
To summarise, all aggregation schemes tested here performed better than the average local NSDA and were comparable to the baseline NSDA for k ∈ {3, 4}.This shows that the overfitting by local models noted earlier is offset by aggregating their predictions.This is in line with published reports on ensemble methods such as Random Forests which aggregate predictions from multiple models individually overfitting the data.The decrease in performance for larger values of k is mainly a result of training the local NSDAs on smaller training sets that do not capture the general population.The (weighted) mean aggregation scheme detects fewer seizures than the baseline detector, however the false detection rate is comparable, if not lower.The Dawid-Skene method successfully detects the same number of seizures as the baseline NSDA for any number of local NSDAs, but the false detection rate is compromised for k ≥ 6. Predictions obtained with the Dawid-Skene are difficult to explain [19,55], only a few local NSDAs with poor performance may have caused unexpected and undesired aggregated prediction [28].

Conclusion
In this work we have shown that an NSDA based on a convolutional neural network together with an attention layer can accurately detect seizures, even if the data is obtained with different types of electrodes (scalp vs needle) and significantly lower number of channels than it was used for training.All the performance metrics of the NSDAs unsurprisingly dropped when training sets contained data from only a few patients.For aggregation of such NSDAs the weighted mean aggregation scheme performed best.Compared to the Dawid-Skene method, it successfully detected local NSDAs with high false detection rates and seizure detection rate was not as compromised as it was for the mean aggregation scheme.When a larger number of patients was included in the training of individual local NSDAs, i.e. when the number of local NSDAs was few, the Dawid-Skene method marginally outperformed the other aggregation schemes.It had a higher seizure detection rate and the false detections per hour was comparable to the (weighted) mean aggregation scheme.Independent of the number of local NSDAs, the majority vote was slightly outperformed by the mean aggregation scheme and all aggregation schemes performed better than the average individual (local) NSDA.
The experiments suggest that data does not need to be shared between institutions.It takes approx.15 seconds to process one hour of 18-channel EEG with 10 local detectors, which is fast enough to be used in an online setting in the clinic.By utilizing GPU optimized code in the preprocessing steps and a fast version of the Dawid-Skene aggregation method [40], one hour of EEG could be processed in less than 2 seconds.
To confirm the findings reported here in a real-world setting, data from multiple institutions would be required  C Architecture of the NSDA

B Data information
In this work the NSDAs are deep neural networks consisted of three components, a feature extractor [32], an attention layer [21] and an output layer (figure 4).We used PyTorch implementation of layers for the feature extractor and for the output layer.Using PyTorch notation, the attention layer was implemented as follows.If an input to the attention layer is of size (N, C in , L) then the output is of size (N, L) and can be described as where V ∈ R L×<inner size> and w ∈ R L×1 are learnable parameters.

Segment-based metrics
Segment-based metrics were calculated based on 16 sec long EEG segments.A true positive (TP) is a correctly predicted seizure segment, a true negative (TN) is a correctly predicted non-seizure segment, a false positive (FP) is an incorrectly predicted non-seizure segment and a false negative (FN) is an incorrectly predicted seizure segment.
• Area under the receiver operating characteristics curve (AUC).The receiver operating characteristics curve describes SE depending on 1-SP.

Event-based metrics
Event-based metrics are in comparison with the segment-based metrics focused on each predicted seizure and not just 16 sec long segments.Three event-based metrics were used [48]: • Seizure detection rate (SDR): where DS is a number of detected consensus seizures and CS is a number of consensus seizures.A seizure was considered to be detected if it was detected at any time of its duration.• False detections per hour (FD/h): where IDS is a number of incorrectly detected seizures and D is duration of data in hours.A seizure was considered to be incorrectly detected if it is not overlapping with any seizure annotated by the experts.• Mean false detection duration (MFDD): MFDD = 0; if IDS = 0 DIDS IDS ; otherwise , where DIDS is a sum of durations of incorrectly detected seizures in seconds and IDS is a number of incorrectly detected seizures.

Figure 1 :
Figure 1: A schematic diagram of the proposed method.Each data set is used to train a local NSDAs or weights that are shared with a trusted agent.The trusted agent makes predictions on new data.Seizure predictions for new data are obtained a) by aggregating predictions made by R NSDAs using the majority vote, the mean or the Dawid-Skene method, or, b) by aggregating predictions made by R − 1 local NSDAs using the weighted mean (weights are learned on the R th data set).
avoid overlap between training and test data when evaluating classifier performance on the 3-channel DS, leaveone-subject-out cross-validation is used.This entailed training 38 baseline NSDAs, 38 sets of local NSDAs and 38 sets of logistic regression classifiers, leaving out data from one subject (patient) at a time.The experiment is repeated 10 times, resulting in 10 •38 •(3 + 4 + • • •+ 10) = 19760 local NSDAs and 10 •38 •(1 + 1 + • • •+ 1) = 10 •38 •8 = 3040 logistic regression classifiers.Data from each left-out patient is sent through the corresponding baseline NSDA and local NSDAs.Predictions from the baseline NSDAs are compared to human expert labels to obtain performance metrics.Predictions from the local NSDAs are first aggregated using one of the aforementioned aggregation schemes: majority vote (1), mean (2), weighted mean (3) and the Dawid-Skene method (appendix A) to obtain the final predictions and these are then compared to human expert labels.

3 ( 10 Figure 2 :
Figure 2: Average area under the curve (AUC), seizure detection rate (SDR) and false detections per hour (FD/h) as a function of the number of local NSDAs used in the aggregation schemes.The solid lines represent the medians of ten runs together with interquartile ranges denoted with vertical lines.The grey dashed line represents the average metric of the baseline NSDA.The average (across ten runs) mean median number of patients in each NSDA is shown in parentheses.

Figure 3 :
Figure 3: An example of aggregated predictions from eight local NSDAs.The area under the curve is 1.0 for the mean and the weighted mean and 0.99 for the Dawid-Skene method.All aggregation schemes detect 42.9 % of consensus seizures and they do not falsely detect any seizure.

Figure 6 :
Figure 6: Average mean false detection duration (MFDD) as a function of the number of local NSDAs used in the aggregation schemes.The solid lines represent the medians of ten runs together with interquartile ranges denoted with vertical lines.The grey dashed line represents the average MFDD of the baseline NSDA.

Figure 7 :
Figure 7: Average area under the curve (AUC), seizure detection rate (SDR), false detections per hour (FD/h) and false detection duration (MFDD) as a function of the number of local NSDAs used in the aggregation schemes.The solid lines represent the medians of ten runs together with interquartile ranges denoted with vertical lines.The grey dashed line represents the average metric of the baseline NSDA.The average (across ten runs) mean median number of patients in each NSDA is shown in parentheses.

Table 1 :
Comparison [21]he area under the curve (AUC) values found in the literature.Each reference uses a different proprietary data set.All NSDAs, except[21], were trained using the 18-channel DS.Superscript L denotes leave-onesubject-out testing and superscript C denotes AUC value on concatenated recordings from the data set.
. A large data set would also allow a detailed study on the number of local NSDAs needed to reach the desirable classification performance and whether a mixture of different types of NSDAs improves or degrades the overall performance.society standardized EEG terminology and categorization for the description of continuous EEG monitoring in neonates: report of the American Clinical Neurophysiology Society critical care monitoring committee.Journal of Clinical Neurophysiology, 30(2):161-173, 2013.

Table 2 :
A summary of the data sets used in the study.Numbers inside parentheses represent standard deviation.Means for recordings are calculated across patients containing at least one consensus seizure longer than 16 sec (duration of one EEG segment).