Automatic Segmentation of Stroke Lesions in Non-Contrast Computed Tomography Datasets With Convolutional Neural Networks

Non-contrast computed tomography (NCCT) is commonly used for volumetric follow-up assessment of ischemic strokes. However, manual lesion segmentation is time-consuming and subject to high inter-observer variability. The aim of this study was to develop and establish a baseline convolutional neural network (CNN) model for automatic NCCT lesion segmentation. A total of 252 multi-center clinical NCCT datasets, acquired from 22 centers, and corresponding manual segmentations were used to train (204 datasets) and validate (48 datasets) a 3D multi-scale CNN model for lesion segmentation. Post-processing methods were implemented to improve the CNN-based lesion segmentations. The ﬁnal CNN model and post-processing method was evaluated using 39 out-of-distribution holdout test datasets, acquired at seven centers that did not contribute to the training or validation datasets. Each test image was segmented by two or three neuroradiologists. The Dice similarity coefﬁcient (DSC) and predicted lesion volumes were used to evaluate the segmentations. The CNN model achieved a mean DSC score of 0.47 on the validation NCCT datasets. Post-processing signiﬁcantly improved the DSC to 0.50 (P < 0.01). On the holdout test set, the CNN model achieved a mean DSC score of 0.42, which was also signiﬁcantly improved to 0.45 (P < 0.05) by post-processing. Importantly, the automatically segmented lesion volumes were not signiﬁcantly different from the lesion volumes determined by the expert observers (P > 0.05) and showed excellent agreement with manual lesion segmentation volumes (intraclass correlation coefﬁcient, ICC = 0.88). The proposed CNN model can automatically and reliably segment ischemic stroke lesions in clinical NCCT datasets. Post-processing techniques can further improve accuracy. As the model was trained and evaluated on datasets from multiple centers, it is broadly applicable and is publicly available.


I. INTRODUCTION
Non-contrast computed tomography (NCCT) is the most common imaging modality for volumetric assessment of stroke lesions [1], [2]. Manual lesion segmentation in NCCT images is time consuming and associated with high inter-observer variability. Semi-automatic lesion segmentation tools have been developed [3], [4]

but still
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . require human interaction, which could introduce a bias, while previous work on automatic NCCT lesion segmentation is limited [5], [6].
Deep convolutional neural networks (CNNs) have shown superior performance for various segmentation tasks in medical imaging because of their ability to learn complex patterns and relationships in the data [7]. Convolutional kernels in CNNs enable the learning of non-localized spatial relationships. The use of multi-scale features and three-dimensional (3D) kernels [8] allows an automated seg- VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ mentation algorithm to take advantage of the spatial contiguity of stroke lesions while maintaining localized context. However, for stroke lesion segmentation, these methods have only been applied to magnetic resonance imaging (MRI) [9]- [11] or computed tomography perfusion and angiography datasets [12], [13]. To date, multi-scale 3D CNNs for a fully automatic stroke lesion segmentation have not been evaluated in NCCT datasets, despite its common application in stroke imaging [1], [2]. Thus, the aim of this work was to train and evaluate a multi-scale 3D CNN model for stroke lesion segmentation in follow-up NCCT datasets. For further improvement of the CNN segmentations, post-processing methods were investigated. We tested the model's generalizability by evaluating it on an out-of-distribution holdout test set, using multi-center datasets acquired at entirely different centers that did not contribute to the training and validation sets.

A. DATASETS
A total of 291 clinical follow-up NCCT datasets from the ESCAPE (252 datasets) [14] and ERASER (39 datasets) [15] multi-center trials were used. These datasets were acquired across 29 centers and corresponding manual segmentations were available. Expert observers manually segmented patient lesions in three orthogonal planes simultaneously using ITK-SNAP [16]. The in-slice resolution ranged from 0.355 to 0.637 mm, the slice thickness ranged from 1.00 to 10.0 mm, and the number of slices ranged from 10 to 141.
Approval and informed consent for the datasets from the two trials was approved by the respective ethics board at each site contributing to the two trials. All datasets used in this retrospective secondary study were made available after complete anonymization. Thus, additional ethics approval and informed consent were not required.

1) TRAINING AND VALIDATION SETS
The ESCAPE datasets collected from 22 centers were used for training and validation of the 3D CNN-based lesion segmentation model. Out of the 252 ESCAPE datasets, 204 were used for training and 48 were used for validation.

2) OUT-OF-DISTRIBUTION HOLDOUT TEST SET
The 39 ERASER datasets used for the out-of-distribution holdout test set were collected from seven centers that did not contribute to the training or validation sets. Experienced neuroradiologists manually segmented the test datasets. Each example was segmented by two (19 datasets) or three (20 datasets) observers with multiple years of dedicated experience in stroke imaging. By using out-of-distribution datasets that were segmented by multiple expert observers, this completely independent test set provides a stronger estimate of the model's generalization performance.

B. NCCT SCAN PRE-PROCESSING
As NCCT images were acquired from multiple centers with differing scanners and imaging protocols, training a CNN directly on NCCT images without pre-processing resulted in very poor performance on the training set (data not shown). Thus, the datasets were pre-processed to ensure consistency between NCCT images collected from different centers.
First, the bone structures were removed from each dataset, retaining only the brain tissue in the images. To remove the bone structures, which have high Hounsfield values, a six-step procedure following the approach described by Muschelli et al. [17] was performed in a slice-wise manner. This approach was implemented using the Insight Segmentation and Registration Toolkit (ITK) [18]. Briefly described, a Gaussian filter with a variance of 4 pixels is used to smooth each slice. In the next step, the intensities are thresholded between 0 and 100 Hounsfield units, which removes most of the artifacts from bone and other high-intensity tissue. After this, a circular structural element with a radius of 1 pixel is used to erode the resulting segmentation. Subsequently, the largest connected component in each slice is extracted and a circular structural element with a radius of 1 pixel is used to dilate this component in order to create a brain mask for the slice. After performing the these three steps in each slice, the masks from each slice are combined into a final mask for the entire volume and any holes in this final mask are filled using the ''Voting Binary Hole Filling Image Filter'' in ITK. Finally, the images were thresholded again between 0 and 100 Hounsfield units to remove the remaining high-intensity tissue artifacts resulting from the morphological erosion and dilation. The images are then normalized to zero mean and unit variance to account for potential differences in scanner tube potential and different reconstruction algorithms. All images in the training, validation, and holdout test datasets underwent the same pre-processing procedure.

C. CNN ARCHITECTURE
The CNN used in this work is based on the DeepMedic model proposed by Kamnitsas et al. [8] and modified for NCCT stroke lesion segmentation. The network parameters were optimized with cross-validation. We used a total of 11 layers. The first eight layers consist of three parallel convolutional pathways for processing the images at multiple scales. The multi-scale pathways were created by using down-sampled versions of the NCCT images (by factors of 3× and 5×) as inputs to the parallel convolutional pathways, in addition to the original image. Each parallel pathway has eight convolutional layers consisting of 30, 30, 40, 40, 40, 40, 50, and 50 feature maps and uses convolutional kernels of size 3 × 3 × 3. Additionally, residual skip connections between layers two and four, between layers four and six, and between layers six and eight are used in each parallel pathway. The ninth layer combines the three multi-scale pathways together by using the concatenated outputs from layer eight of each parallel pathway. Layer nine uses 3×3 × 3 convolutional kernels and has 250 feature maps. Layer ten is a fully-connected convolutional layer with 1 × 1 × 1 convolutional kernels and 250 feature maps. Additionally, a residual skip connection between layers eight and ten was used. The final softmax  classification layer, layer eleven, produces the lesion probability maps. A threshold of >0.5 is used to binarize the probability map to a final lesion segmentation.

D. CNN TRAINING
All CNN model training was performed in Python 2.7 on Compute Canada and Calcul Quebec computing clusters. The DeepMedic framework (v.0.7.3), available from https://github.com/deepmedic/deepmedic, was used for model training. The DeepMedic framework performs model training on image segments extracted from the original image, rather than the entire image. In this work, segments of 37 × 37 × 37 were used. The network was trained for 35 epochs with a batch size of 10. Each epoch was divided into 20 sub-epochs, within which 1000 image segments were extracted and used for model training. An initial learning rate of 0.001, which decreases through training using a polynomial decay function, was employed. Root mean square propagation was used as the optimizer. L1 and L2 regularizations of 10 −6 and 10 −4 were used, respectively. Data augmentation consisted of mirroring along the sagittal axis. The CNN model achieved a mean Dice similarity coefficient (DSC) of 0.52 in the training set evaluated by 10-fold crossvalidation.
The DSC scores and lesion volumes for the automatic segmentations of the validation and holdout test sets were obtained using a single CNN model that was trained on the entire training data.

E. POST-PROCESSING OF CNN SEGMENTATIONS
The CNN-based binary lesion segmentations were postprocessed to further improve the segmentation accuracy (Fig. 1). The implemented post-processing consists of a connected component analysis to exclude small lesion components, most likely caused by noise artifacts, and an automatic hole-filling approach. Post-processing was performed using VOLUME 8, 2020 the ITK toolkit. Using a connected components analysis, components smaller than an empirically determined cut-off were removed. The exception was in segmentations where the largest connected component was smaller than the cut-off value, in which case no cutoff was applied. Afterwards, a hole-filling algorithm was used to fill gaps within the segmentation.
The validation dataset was used to estimate the optimum minimum object size cut-off and the hole-filling kernel radius. The minimum object size cut-off was optimized first, by varying the cut-off range from 0.3 cm 3 to 2.5 cm 3 . The cut-off that maximized the DSC was 1.5 cm 3 and was used for post-processing.
Using this minimum object size cut-off, the hole-filling radius was optimized next using values of 2, 3, 5, 7, and 10 voxels. As hole-filling causes the segmented lesion volumes to grow, and subsequently increased the error in lesion volume estimates, both the DSC and lesion volume error were considered when choosing the optimal value. More precisely, the DSC was maximized while the lesion volume error was minimized. The optimal radius was found to be 3 voxels and was used for post-processing.

F. SEGMENTATION EVALUATION
The DSC was used as the primary outcome measurement for evaluation of the automatic lesion segmentations. The DSC measures the overlap between two segmentations and is defined between 0 and 1, where 1 indicates perfect consensus. The DSC is calculated as: where A and B are two binary segmentations of the same dataset.
The DSC scores in the validation set, for which only a single manual segmentation was available for each dataset, were computed by comparing the manual segmentations to the CNN-based methods (CNN or CNN + post-processing).
The DSC scores for CNN-based methods in the holdout test set, for which two or three manual lesion segmentations were available for each dataset, were computed independently for each observer (e.g. CNN vs. observer A, CNN vs. observer B, for two observers) and averaged together using the arithmetic mean. This average value was then used as the DSC score for the CNN-based methods (CNN or CNN + post-processing) vs. Observers.
The inter-observer DSC for the holdout test set was calculated for pairs of observers. In examples with three observers, three pair-wise DSC scores (e.g. observer A vs. observer B, observer A vs. observer C, observer B vs. observer C) were calculated and averaged together using the arithmetic mean to obtain a single, average, inter-observer DSC score.
Lesion volumes were calculated by multiplying the number of lesion voxels in the binary segmentation mask by the volume of each voxel. As multiple observers segmented the holdout test set, an average lesion volume estimate was obtained for each example. This was calculated as the arithmetic mean of the lesion volumes segmented by the individual observers.
Intra-class correlation coefficients (ICC) were used to assess inter-rater reliability in lesion volume estimates [19]. ICC was calculated for absolute agreement between observers for manual lesion volume estimates, and for absolute agreement between manual lesion volume estimates (average of observers) and automated lesion volume estimates (CNN or CNN with post-processing) [20]. ICCs above 0.75 were considered as excellent inter-rater reliability, following the guidelines established by the American Psychological Association [21].

G. STATISTICS
Results are reported as mean ± standard deviation (SD) or median [interquartile range] as appropriate. The Wilcoxon signed-rank test or Friedman test with Dunn's multiple comparison post-hoc correction was used for comparisons. Correlation was quantified using Spearman's rank correlation. Statistical significance was set as P<0.05. Significance in figures is denoted by * (P < 0.05) and * * (P < 0.01). All statistical analyses were performed using Graphpad Prism 8.4.

III. RESULTS
The median lesion volumes for the training and validation sets were 40.4 [14.1-96.3] cm 3 and 41.5 [20.0-107.1] cm 3 , respectively. As the out-of-distribution holdout test set was segmented by multiple observers, the manual segmentation lesion volume for each example was defined as the average volume calculated across observers. It was not possible to ensure the volume distribution of the holdout test set is similar to the training and validation sets. This is because the test set was drawn from an entirely different multi-center trial and completely independent of the training and validation sets. The median lesion volume for the test set was 20.9 [9.7-63.7] cm 3 , which is considerably lower compered to the training and validation sets.

A. VOXEL-WISE AGREEMENT
The CNN model achieved a mean DSC score of 0.47 ± 0.22 in the validation set. Post-processing significantly improved the DSC to 0.50 ± 0.23 (Wilcoxon signed-rank test, P < 0.01; Fig. 2).
The model's generalizability was assessed using the outof-distribution holdout test set (Fig. 1). These datasets were manually segmented by multiple independent observers and had an inter-observer DSC score of 0.73 ± 0.13 (Fig. 3). The CNN lesion segmentations had a DSC score of 0.42 ± 0.25 compared to manual segmentations (average of observers), which was lower than the inter-observer DSC (Friedman test with Dunn's multiple comparisons, P < 0.01). Post-processing of CNN-based segmentations significantly improved the DSC to 0.45 ± 0.26 (Friedman test with Dunn's multiple comparisons, P < 0.05).
The decreased DSC score on the out-of-distribution holdout test data may be partly attributable to an abundance  of smaller lesions in the test set, as the manual vs. automated DSC scores showed a strong positive correlation with manually segmented lesion volumes (before post-processing: Spearman's ρ = 0.77, P < 0.01; after post-processing: Spearman's ρ = 0.74, P < 0.01, Fig. 4). Indeed, even the inter-observer agreement was lower on smaller lesions and showed a strong positive correlation with lesion volume (Spearman's ρ = 0.68, P < 0.01).

B. LESION VOLUME ESTIMATES
Median lesion volume estimates for the holdout test datasets calculated from CNN lesion segmentations before    Fig. 5). Bland-Altman analysis reflected a tendency for the model to over-predict lesion volumes, though this bias was minimal (Fig. 6).
Importantly, lesion volume estimates from CNN segmentations showed excellent agreement with manual segmentations. The ICC for CNN lesion segmentations was 0.86, which was further improved by post-processing to 0.88 (Fig. 7). The agreement between observers for manual segmentations was lower, with an ICC of 0.80. VOLUME 8, 2020

IV. DISCUSSION
In this study, we developed an automatic method for clinical NCCT stroke lesion segmentation using a 3D multiscale CNN. Volumetric assessments from the automatic CNN-based method were in excellent agreement with multiple neuroradiologists. We used an out-of-distribution test set, which was completely independent from the training and validation data, in order to obtain a reliable estimate of the method's generalizability and establish a reproducible baseline for automated NCCT stroke lesion segmentation with CNNs. As NCCT is a standard imaging procedure available in most stroke centers for follow-up assessment, a generalizable automatic lesion segmentation pipeline for this modality is of high demand.
CNN models for automatic follow-up lesion segmentation have primarily been developed and investigated for MRI [8]- [11], [22]- [24]. Though reported DSC scores for lesion segmentation in MRI are typically higher (0.67-0.79) than seen in our study, the proposed method is nevertheless a promising approach for NCCT segmentation. Lesion segmentation in NCCT is more challenging compared to typical MRI follow-up sequences such as diffusion-weighted MRI [25], as the ischemic changes in NCCT images are more subtle.
To prevent over-fitting of the CNN model to a specific dataset and imaging protocol, it was trained using multi-center NCCT datasets. Testing the model on out-ofdistribution datasets with multiple manual expert observer segmentations from seven completely independent centers revealed that our model generalizes well. Decreases in voxel-wise agreement in the holdout test set, compared to the validation set, may be attributable to multiple factors. First, as previously mentioned, the test dataset came from independent centers that did not contribute to the training or validation datasets. This stands in contrast to the often used method of creating a test set by sampling from the same data source as the training set, which increases the risk of over-fitting the model to a specific data distribution and inflating model performance. Evaluating the model on test data from independent centers as done in this study may provide a more reliable and reproducible estimate of model's generalizability.
Second, the test dataset was segmented by multiple observers who did not contribute segmentations to the training or validation sets. The inter-observer voxel-wise agreement was variable, with an inter-observer DSC of 0.73. This variability in the ground truth segmentations can be expected to impact the voxel-wise agreement of automatically extracted segmentations. However, evaluating the model using segmentations from multiple observers who did not contribute to the training set segmentations may further strengthen the estimate of model's generalizability.
Other recent studies on automatic NCCT segmentation of stroke lesions have used single-scale CNN models [5], [6], reporting mean DSC scores (0.54-0.57) and volumetric ICCs (0.88) comparable to ours (DSC = 0.45, ICC = 0.88). However, these studies only evaluated their models on test data that belonged to the same distribution as their training set and were only compared against single reference segmentations [5], [6]. How these methods generalize to datasets outside their training distribution and against multiple expert observers, important for evaluating their broader applicability, was not evaluated in detail in the original studies. As the developed models are not publicly available, it was not possible to evaluate them in this study. In contrast to this, our model was tested on out-of-distribution datasets from seven completely independent centers with segmentations from multiple expert observers, which demonstrated that our model generalizes well.
Finally, the test dataset contained a greater number of small lesions. Due to the higher surface area to volume ratio, a high voxel-wise agreement for small lesions is harder to achieve. This was also reflected in the fact that manual segmentations also showed less agreement for smaller lesions. A similar result was seen for CNN segmentations of NCCT images by Barros et al., where the DSC scores were lower for subtle injuries with smaller stroke lesions (0.37) [5].
Though the voxel-wise agreement of the CNN-based segmentations was inferior to the inter-observer agreement, the corresponding lesion volumes were not significantly different from manual segmentations and had excellent agreement with them. The agreement between automatically and manually segmented lesion volumes was higher than the inter-observer agreement between manual segmentations. This suggests a potential application of the model for consistent volumetric assessment of follow-up lesions in multi-center studies. Providing consistent results is an advantage of automatic segmentation algorithms, thereby reducing variability between sites or studies.
To improve the CNN-based lesion segmentations, simple post-processing techniques were used to correct for the speckly nature of CNN segmentations. However, more sophisticated analyses such as hidden Markov random fields may achieve further improvements [26].
Though promising, this preliminary study has some limitations. First, the CNN architecture used was originally designed for MRI images [8]. While this DeepMedic model establishes a baseline for NCCT segmentation with multi-scale 3D CNNs, future work may investigate modifications such as applying the well-known U-Net [27] for improved multi-scale segmentations. Second, segmentations of smaller lesions need to be improved. Future work may investigate training models specifically for segmenting small lesions, whether through CNN architecture modifications or using a small lesion training dataset. As clinical datasets may be difficult to acquire, data augmentation methods such as generative adversarial networks may be explored [28]. Third, the training of deep learning models is stochastic in nature. Future work may investigate ensembling multiple models trained on subsets of the training data, as this may allow averaging out deficiencies in single models [5], [10], [24]. Addressing these limitations may further improve CNN-based lesion segmentations in NCCT datasets.

V. CONCLUSION
This study demonstrated the successful use of a CNN-based automated method for follow-up stroke lesion segmentation in clinical NCCT datasets. This lays the foundation for developing more advanced automatic lesion analysis tools for NCCT images and can contribute toward consistent and high-throughput analysis of large multi-center studies.
To facilitate further development of NCCT lesion segmentation methods and to provide a baseline for future evaluations, the trained CNN model is publicly available [29].