Introduction
Magnetic resonance imaging (MRI) is commonly used to diagnose various neurological diseases due to its superior ability to provide detailed images of the brain’s anatomy. It offers excellent spatial and contrast resolution without exposing the patient to radiation. In daily clinical settings, radiologists interpret MRIs qualitatively, providing reports on disease-related findings visible in the images. On the other hand, numerous efforts have been made to derive quantitative measures from brain MRIs to characterize any disease-related changes in the brain and to understand its physiological status in relation to development, aging, and sex differences. Deep learning methodologies are frequently used in this regard, offering diagnostic labels for various conditions, including Alzheimer’s disease [1], [2], [3], detecting tumors [4], [5], [6], and facilitating image-based searches for similar cases [7], [8], [9].
Such studies using MRI often require a preprocessing step known as skull stripping (SS). This process involves isolating the brain parenchyma from the whole head MRI by eliminating non-brain tissues such as the skull, skin, fat, and eyeballs. Manually extracting brain parenchyma from 3D MRIs is an extremely labor-intensive task. As a result, a variety of automated SS methods have been proposed to simplify this process.
The classical SS methods proposed in the first decade of this century [10], [11], [12] are expected to perform well for parameter-optimized datasets. However, they have been reported to be less robust to changes in local anatomy and the type of disease under study and to perform significantly less accurately for different datasets [13]. Since the beginning of the 2010s, parameter-robust and versatile methods [14], [15], [16], [17], [18] and several open source software packages [19], [20] that can provide different kinds of image analysis, such as SS and anatomical segmentation, have been proposed. While these approaches achieve relatively high SS performance, they are often associated with long processing times.
On the other hand, several high-performance, rapid SS have recently been proposed as deep learning technology advances. U-Net [21], a symmetric arrangement of convolutional neural networks (CNNs) with a bypass structure corresponding to the same resolution, achieves accurate SS without requiring prior anatomical expertise of region extraction [22], [23], [24], [25], [26], [27], [28], [29], [30]. These methods can be categorized into two groups: vertically stack 2D SS results in each slice of the brain MR image [22], [23], [24], [25], and extend the U-Net architecture to three dimensions [26], [27], [28], [29], [30]. Although these SS methods based on deep learning show excellent performance, there are still significant concerns regarding their robustness across datasets. In many previous studies, the images used for training and evaluation were obtained from the same facility or datasets [22], [25], [26], [29]. Images taken within the same dataset or facility generally share common characteristics such as scanner, imaging protocol, subject posture, and other conditions. Therefore, training and evaluating a model using a single dataset is easier than evaluation using completely unknown cases. Robust performance for unknown environments is important in analyzing large-scale data that are collected across multiple sites or from multiple datasets. However, few studies have systematically investigated the influence of implicit differences in characteristics between datasets, which can be caused by variations in imaging environments, on SS performance.
One of the significant challenges in achieving robust SS is dealing with variability of the subject posture, which can differ significantly depending on different imaging environments and disease status between datasets. This variability can lead to geometric differences between the training and test data, making accurate extraction of brain structure more difficult, even with deep learning-based SS methods [23], [24]. Despite this known issue, no SS methodologies have been proposed that adequately account for the diversity of subject postures.
In this paper, we propose posture correction skull stripping (PCSS), a highly accurate and robust SS method that takes into account the diversity in subject postures. PCSS is a framework based on U-Net with the following four extensions: (i) preprocessing to estimate and correct the angle and position of the subject’s head to suppress posture variation; (ii) weighted loss function, which considers the imbalance between the brain region and other tissues [22]; (iii) a discriminator network for adversarial training introduced in generative adversarial networks (GANs) [31] and used in an SS study [24]; and (iv) ensemble of three-way segmentation for the brain [29]. For a rigorous evaluation across multiple datasets, we use five T1-weighted brain MRI public datasets (ADNI, CC-12, LPBA40, NFBS, and OASIS) and discuss the impact of different datasets on SS performance, which has not been addressed previously. In addition to the effects of (i) posture correction, which is the main proposal in this paper, each of the technical elements introduced in (ii), (iii), and (iv) are also evaluated to discuss the key techniques involved in achieving a robust SS.
The main contributions in this paper are as follows:
Proposal for posture estimation of subjects (head angle and position) and a connected correction method for constructing highly robust SS.
Clarification of the elemental techniques required to realize accurate and robust SS based on appropriate evaluations.
Proposal of a practical SS method with high speed (8.07 sec/case) and high accuracy (Dice score = 96.95) based on these effective elemental technologies.
The code and results are published at URL: https://github. com/IyatomiLab/Posture-Correction-Skull-Stripping.
Related Work
Among the open source software employed for automated processing of MRIs, 3dSkullStrip is provided as a component of the Analysis of Functional NeuroImages (AFNI) [32]1 package. It uses a modified version of the Brain Extraction Tool (BET) [11]. FreeSurfer [19]2 is an open source software package employed for automated processing of MRIs, and includes SS among its functionalities. Within this package, the Hybrid Watershed Approach (HWA) [12] is used for SS; its efficacy has been evaluated with extensive datasets [33]. However, the process typically takes several hours to complete on a standard desktop computer. MRICloud [20], 3 which is currently recognized as one of the most efficient SS methods, generates the segmentation mask using a technique known as multi-atlas label fusion and arbitration algorithms. However, the segmentation and SS process for a single case typically requires approximately one hour to execute.
Salehi et al. [22] proposed a network architecture called Auto-Net, which introduced an auto-context CNN for SS of 3D brain MRIs. Their proposed method employs an auto-context CNN for this task. Auto-Net, based on the U-Net [21] framework, incorporates 26 convolutional layers. Their proposed method can implicitly train 3D images without using computationally expensive 3D convolution. It performs SS on each cross Section and stacks them vertically to achieve SS for the entire brain MRI. The auto-context CNN in Auto-Net uses the auto-context algorithm [34]. In this approach, the posterior probabilities obtained from the first training’s segmentation are incorporated as contextual information through concatenation in the channel direction of the second training. The same thing occurs during evaluation. Salehi et al. achieved good results using Auto-Net with Dice scores of 97.73 and 97.62 on the LPBA40 and OASIS datasets, respectively. However, there is a possibility of overfitting due to the use of the same dataset for both training and evaluation. In addition, those authors did not separately evaluate the impact of the weighted loss function (cross entropy) on the ratio of non-brain region to brain region that they are introducing.
Jiang et al. [24] proposed SS using Wasserstein GAN (WGAN) [35] in conjunction with O-Net, which incorporates an attention module into the U-Net, establishing a new shortcut for the corresponding mapping between encoders and decoders. This approach effectively preserves detailed image features while leveraging deep semantic information to emphasize target regions for each channel. WGAN improves SS performance based on GAN by introducing the discrimination of SS results generated by O-Net. WGAN+O-Net, trained on the LPBA40 dataset, achieved a Dice score of 95.51 on the IBSR18 dataset. However, it was acknowledged that the IBSR18 dataset was of low quality and heavy artifact, and the evaluation was not performed on a variety of high-quality datasets. In addition, only 18 images were evaluated.
Fatima et al. [25] proposed MVU-Net, which performs separate SS on the coronal, sagittal, and transverse sections of input 3D MRIs and generates an ensemble probability map of the brain region. This approach reduces the ambiguity of SS that uses only single section. The architectural design of MVU-Net was inspired by U-Net and SCU-Net [36]. Its structure enables efficient training with a relatively small number of parameters (1.4 million). MVU-Net performed well on the NFBS and IBSR datasets, achieving Dice scores of 96.81 and 91.84, respectively. However, as in Salehi et al. there exists a potential risk of overfitting due to the use of the same datasets for both training and evaluation.
Fabian et al. [28] proposed nnU-Net, a deep learning-based segmentation method that automatically performs preprocessing, network architecture, training, and post-processing. nnU-Net extracts three types of parameters: —fixed, rule-based, and empirical— to construct model training. Fixed parameters include network architecture and training plan, rule-based parameters include normalization and resampling, and empirical parameters include ensemble and post-processing. Given a new segmentation task, nnU-Net determines these parameters and builds a pipeline connecting them, allowing users to generate segmentation models easily without domain knowledge. Theirs is one of the few papers to conduct a rigorous evaluation using 23 datasets of biomedical image segmentation and show robust and high performance. Therefore, we apply nnU-Net to SS and use it as a comparison in this paper.
The SS method that Isensee et al. [28] proposed has shown a certain effectiveness under specific circumstances. However, for the practical assessment in SS performance, it is important to replicate and evaluate diverse imaging environments using multiple datasets, just as those authors did. The impacts of the weighted loss function in Salehi et al. [22], training introducing a discriminator in Jiang et al. [24], and ensembling techniques in Fatima et al. [25] must be evaluated equally under practical conditions. In this paper, we also evaluate the effects of these techniques on SS performance.
Isensee et al. [27] developed HD-BET, an accurate skull stripping technique, using a large EORTC-26101 dataset collected from 37 institutions. HD-BET performs skull stripping on a U-Net using detailed GTs obtained by manual correction based on the BET [11]. HD-BET outperforms the previous six SS methods in a rigorous evaluation where the evaluation data (3,419 images from 12 locations) are obtained from different locations than the training data (6,586 images from 25 locations) and on untrained CC359, LPBA40, and NFBS datasets. In addition, HD-BET outperformed the competition not only on T1-weighted images but also on processing different scan sequences such as contrast-enhanced T1-weighted, T2-weighted, and FLAIR. In our experiments, HD-BET was compared to our proposal and other comparative methods as a reference, although the number of data used for training is more than ten times different.
Posture Correction Skull Stripping
In this paper, we propose posture correction skull stripping (PCSS), a SS method that is designed to be robust to variations in the position and angle of the subject’s head across different datasets. Figure 1 shows the overview of PCSS, which consists of two phases: (1) posture correction phase and (2) skull stripping (SS) phase. In the posture correction phase, a posture estimation network (PENet) consisting of CNNs is used to estimate and correct the head posture in MRIs. In the SS phase, the skull stripping network (SSNet), based on UNet, which has been widely used for segmentation tasks involving deep learning in recent years, is used to extract actual brain regions from original images.
The main contribution of this paper is to propose a robust and accurate SS method that includes posture correction. In addition, we compare and evaluate several technical elements used in the construction of SSNet using various datasets to provide guidance on the effective model structure for SS.
A. Posture correction phase
MRIs may also vary due to differences in the subject’s head position at the time of imaging, which can significantly reduce SS performance. Therefore, in the posture correction phase, the position and angle of the head are estimated and corrected to prevent SS performance degradation. Figure 2 shows an overview of the process in this phase.
The variation in head position is pronounced in the pitch direction (
The reference neck line is represented by
First, the posture correction phase estimates the tilt
B. Skull stripping phase
The 3D brain MRIs corrected by the posture correction phase are extracted by SSNet in the SS phase. Figure 3 shows an overview of the SS phase, including SSNet, which is based on U-Net and introduces weighted cross entropy as a loss function to account for the volume ratio between brain and non-brain regions [22] (weighted loss function), improves SS results by introducing the discriminator networks used in GANs [24] (adversarial training), and an ensemble of three-sections segmentation of the brain [25] (ensemble). The details of this technical component are described in III-C.
Medical images such as brain MRIs are difficult to collect large amounts of data due to privacy issues and acquisition costs. Moreover, the cost of annotating these images is very high, making it challenging to prepare a sufficient amount of 3D training data. In addition, SSNet, like many other SS methods, performs 2D SS on any cross Section of a 3D brain MRI and vertically stacks them (and, if necessary, ensembles the SS results for each section) to achieve the final SS.
In the case of general T1-weighted brain MRIs of
SSNet takes a 2D cross sectional image
C. Assessment of effective techniques for Skull Stripping
In addition to posture correction, the proposed PCSS introduces three techniques (weighted loss function, adversarial training, and ensemble) used with U-Net in recent years to achieve accurate, robust, and practical SS. PCSS is available in two architectures, PCSS-1 and PCSS-3, depending on the objective. PCSS-1 is an SS model that includes two machine learning technical elements (weighted loss function and adversarial training) in addition to posture correction, and high-speed inference is expected because SS can obtained by estimating only one section. PCSS-3 is a SS model that includes all elements (posture correction, weighted loss function, adversarial training, and ensemble), and since it provides an ensemble of SS results in three directions during SS execution, higher accuracy can be expected, but that occurs at the expense of longer execution time. Each of the machine learning technical elements used in PCSS is described in detail below. In this paper, we evaluate and discuss the effectiveness of these techniques.
1) Weighted loss function
Since brain regions account for only about 10% of the volume of many original brain MRIs, the segmentation task can be viewed as an imbalanced classification problem. However, no paper has focused on this imbalance and discussed and evaluated it. Therefore, the SS performance in training using weighted cross entropy, which is a loss function commonly used for imbalanced data, and ordinary binary cross entropy was compared.
\begin{align*} \mathcal {L}_{\text {ss}} &= -\mathbb {E}_{x_{i} \sim x}[\alpha \cdot m(y_{i})\cdot \log p(x_{i}) \\ &\quad +(1-m(y_{i})) \cdot \log (1-p(x_{i}))]. \tag{1}\end{align*}
The value of
2) Adversarial training
Recently, models such as pix2pix [37], which applies the adversarial training introduced by GAN [31], have been proposed and reported to improve the performance of U-network-based segmentation techniques. This approach improves the performance of the original model (generator;
The loss function for the model due to the addition of the discriminator is obtained by a reconstruction error \begin{equation*} \mathcal {L}=\mathcal {L}_{\text {ss}}+\lambda \mathcal {L}_{\text {Adv}}, \tag{2}\end{equation*}
\begin{align*} \mathcal {L}_{\text {Adv}}&=\mathbb {E}_{x_{i} \sim x, y_{i} \sim y}[\log (D(x_{i}, y_{i}) \\ &\quad +\log (1-D(x_{i}, G(x_{i})))]. \tag{3}\end{align*}
3) Ensemble of each section
Ensemble is a fundamental element of machine learning technology that improves performance by using multiple independent weak learners. The SSNet in this experiment obtained final SS results by stacking 2D SS results. We investigate the effect of the ensemble of SS results from different 2D sections on the final SS capability. Specifically, SS results were compared between stacks of only transverse sections and ensemble of three sections (i.e., the average of the brain probability maps). In the brain probability maps obtained for both approaches, regions with a probability of 50% or greater were extracted as the brain.
The ensemble typically combines the prediction results of separately trained models. However, preliminary experimental results showed no significant difference between the ensemble model trained with cross sections separately and the model trained with all three cross sections together. The ensemble adopted the latter implementation because it is undesirable in terms of computational resources and execution speed to invoke the three different models individually at runtime.
Dataset
Table 1 shows an overview of the datasets used in this paper. The five public 3D brain MRI datasets used in this study are the Alzheimer’s disease neuroimaging initiative 2 (ADNI2) dataset [38], the Calgary-campinas-12 (CC-12) dataset [39], the LONI probabilistic brain atlas (LPBA40) dataset [40], the Neurofeedback skull-stripped (NFBS) dataset [41], and the Disc-1 and Disk-2 of Open access series of imaging studies (OASIS) dataset [42]. Each image was acquired in NIfTI format.
For the training of the proposed PCSS (PENet and SSNet), images from ADNI2, which has the largest amount of recorded data and is a practical dataset with a wide range of head tilt, size, intensity, and other factors, were used. For the evaluation of SS performance, images from ADNI2 (data folded out in the 5-fold cross validation) were used. The remaining four datasets were used for evaluation.
In addition, only one case per patient was used for all datasets. Data from the same patient often have similar characteristics, and their inclusion may introduce bias in training and evaluation, affecting the validity and generalizability of the results. Moreover, if they are included in both the training and evaluation data, the model will overfit, resulting in higher values than the actual performance and incorrect evaluations. Therefore, in this experiment, they were eliminated to ensure rigorous evaluation.
Because ADNI2 does not have a ground truth (GT) of the SSed brain regions, the SS results were used from MRICloud [20], currently considered one of the most accurate SS methods, recognizing that it may not always be accurate in some cases. The evaluation of the proposed method is described below, but we have ensured the validity of the proposal by evaluating it with the CC-12, LPBA40, and NFBS datasets, which were assigned a GT of manual.
The GTs provided in the LPBA40 and NFBS datasets consist of manually edited masks based on automatic SS results from the BET [11] and brain extraction using nonlocal segmentation technique (BEaST) [16], respectively. The GT provided in the OASIS dataset is obtained by FreeSurfer (HWA) [19] rather than manual results. Although this result may contain a certain percentage of inaccuracies, it was used as a reference result since it has been used in many other studies for performance evaluation.
Experiment
In this experiment, we quantitatively evaluated the performance of the posture correction proposed to achieve an accurate and robust SS. To verify the effectiveness of the proposed PCSS and various SS techniques, we compared SS performance with previously reported methods using publicly available datasets (ADNI [38], CC-12 [39], LPBA40 [40], NFBS [41], and OASIS [42]).
A. Preprocessing
The datasets with a resolution of approximately 1 mm
B. Details of PENet and its evaluation
The PENet that performs the posture correction consists of three convolution layers and two fully connected layers. The details of the PENet configuration are described in Figure 4. Note that the network models presented in this paper are not limited to the configuration shown. The PENet took as input the slice of the sagittal Section with the largest number of pixels in the non-zero brightness (non-background). These slices were also extracted from the center of the matrix (i.e., the 128th slice) to 15% on each side (i.e., 38 slices each) to eliminate the possibility that the detected slices would be outside the brain region. The selected slices were reduced to
To evaluate the performance of PENet, the mean absolute error (MAE) was calculated between the estimated angle
In some MRIs, the facial region may be blacked out for patient privacy reasons, or the additional regions below the head may have already been removed. In order to achieve robust posture correction even for such images as data augmentation, the cutOut approach [43] was used between 0.6% and 25% of the whole image, and images that had already had the skull stripped by MRICloud with a probability of 20% (i.e., 60% were normal images). In addition, a random rotation of −30 degrees to +30 degrees and a random shift of −20 pixels to +20 pixels were added as data augmentation with a probability of 70% to account for the imaging environment. The early stopping was used to avoid overfitting of the PENet. It ends training when the loss of validation data can be considered as virtually no updates for 100 epochs. The PENet was trained on an RTX3090 with the Adam optimizer, using a learning rate of
To evaluate the robustness of PENet, evaluations were also performed when the amount of training data was reduced to
C. Details of SSNet and its evaluation
The U-Net based on SSNet consists of 16 layers of encoder and decoder CNNs; the detailed structure of the U-Net is shown in Figure 5. Only the ADNI2 dataset was evaluated for SS by 5-fold cross validation, while all images from the ADNI2 dataset were used as training data for the evaluation of the other datasets (CC-12, LPBA40, NFBS, and OASIS).
When training the model (i.e., training on the transverse cross sections), online data augmentation was added by randomly rotating the image from −20 degrees to +20 degrees and shifting it by −20 pixels to +20 pixels to mimic the variation in the subject’s posture during actual imaging. The same augmentation was applied in each cross Section when training the ensemble model. For the sagittal cross sections, however, a wider range of random rotation of −30 to +30 degrees was applied, as it was assumed that the range of motion for the posture at the time of imaging was wider.
A five-layer fully convolutional network (FCN) was used as the discriminator to introduce GANs training in SS. The detailed structure of the discriminator is shown in Figure 6.
D. Evaluation of skull stripping performance
In order to discuss the effectiveness of the proposed PCSS (PCSS-1 and PCSS-3), we compared and evaluated the SS performance for the five datasets described above with Auto-Net [22], MVU-Net [25], and HD-BET [27], the three state-of-the-art studies that have reported the best SS performance. In addition, to compare with 3D U-Net, nnU-Net [28] architecture was created by 3D U-Net. Auto-Net and nnU-Net were reproduced using the author’s public implementation, and MVU-Net was reproduced to the best of our ability based on the original paper. In addition, HD-BET, which uses 6,586 training images (about ten times more than PCSS) from 25 sites as a publicly trained model, was included in the comparison. The GT of this model was manually modified based on BET [11] and is not publicly available. Therefore, we cannot compare its performance under fair conditions using the same training images, but we included it for reference. Note that the ADNI2 result from HD-BET is not 5-fold cross validation.
Table 2 shows the technical elements summary of the PCSS. To evaluate the technical components of the proposed PCSS, U-Net + posture correction (+PC), U-Net + weighted loss function (+W), U-Net + discriminator (+Adv), and U-Net + ensemble (+Ens) were evaluated (i.e., using an ablation study.). In order to fix the experimental conditions, U-Net, +PC, +W, +Adv, and +Ens were given the same data augmentation as PCSS.
Recall, precision, and their harmonic score (the Dice score) were used as evaluation metrics. To quantitatively evaluate the number of SS failures, the number of cases with a Dice score less than 0.95 was tabulated.
Results
A. Result of posture correction
Figure 7 shows four examples of posture correction results for the ADNI2 test case. The top row compares the predicted reference neckline for the input image (red dashed line) with the GT displayed as a reference (yellow line), and the bottom row shows the results with the image rotated and shifted to the position of the alignment neck line (light blue line) from the predicted reference neck line; that is, the final posture correction result.
In the ADNI2 dataset, the GT measurements of the reference neck line had a standard deviation of 7.81 degrees in angle and 14.69 pixels (equivalent to 29 mm in physical size) in vertical position. These findings indicate the presence of such variations in head posture in the ADNI2 datasets.
The proposed PENet had an accurate estimation for the reference neck line, with an average error of 3.61 ± 2.72 degrees in the head angle and 6.42 ± 5.17 pixels (equivalent to 13 ± 10 mm in physical size) in the vertical direction, based on the average of 5-fold cross validation. As a result, the variation in the angle and position of the head was significantly reduced. The rightmost image in Figure 7 shows an example of large estimation error compared with three other examples.
Figure 8 shows the example result of a posture correction for the CC-12, LPBA40, NFBS, and OASIS datasets in panels (a) to (d), respectively. Although a quantitative evaluation is not available for these datasets, it was visually confirmed that the posture correction was performed appropriately, even for the datasets that were not used for training. In addition, this posture correction improved the SS performance in the later stages for all datasets, indirectly suggesting that this posture correction was also successful. Details are discussed below.
Figure 9 shows the relationship between the amount of training data for the PENet (612 cases in total) and the posture correction error (red, angle error; blue, misalignment). This result shows that PENet can accurately estimate posture even when trained with only about 150 cases (i.e.,
B. Result of skull stripping
Table 3 shows the score for all methods compared. In the evaluation of the ADNI2 dataset, where the training and evaluation data are from the same source, there was not much difference in the scores for each method. Only the proposed PCSS resulted in a lower precision than the other methods, resulting in slightly lower Dice scores. This is not due to inherently low SS performance, as discussed below, but rather because it was determined to be overdetected in the evaluation based on the GT defined by MRICloud. This overdetection is the fact that a GT for the determination of the cerebral sickle, which is very difficult to identify, was not determined to be a brain region. See discussion below for details.
On the other hand, the results of LPBA40, CC-12, NFBS, and OASIS show that the proposed PCSS achieves significantly higher SS performance than existing methods, including the state-of-the-art Auto-Net, MVU-Net, and nnU-Net. In particular, PCSS has a very small number of cases that could be considered SS failures (#Dice < 0.95). In addition, PCSS performs almost as well as HD-BET, which is trained on about ten times more data and outperforms HD-BET for CC-12 and NFBS with manual labels.
The performance difference between PCSS-1 and PCSS-3 was only 0.36 points on average in terms of Dice score, while the time required for SS was 8.07 and 20.24 seconds for PCSS-1 and PCSS-3, respectively. Of these, the time required for posture correction of one Section was 1.96 seconds (approximately 24.2% and 9.6% of the total SS processing, respectively).
Figure 10 compares examples of SS results using the baseline Auto-Net [22] and the proposed PCSS-3 observed from the sagittal plane. Note that Auto-Net offered the best performance of the three state-of-the-art methods that were used for comparison and were reproduced in the authors’ implementation. The red line shows the predicted mask by PCSS-3, and the yellow line shows the mask by GT. The proposed PCSS effectively performed SS for the images in all datasets. In particular, recall is significantly improved compared to the existing methods, and the reproducibility of brain regions in PCSS-3 is confirmed to be very high. The relatively low numerical results in OASIS for all methods are due to incomplete GT by FreeSurfer, as mentioned above; PCSS actually detects the appropriate regions.
Example of SS results by PCSS; from top to bottom, SS results for U-Net, Auto-U-Net, and PCSS-3.
Table 4 summarizes the effect of each technical element introduced in this paper on final SS performance. These are the averages of the differences in SS performance, as measured by Dice score, between the baseline (U-Net) and the case where each element X is implemented alone (i.e., +X). The average is the macro average of the scores of the datasets, excluding ADNI2 (used for training) and OASIS (which has GT reliability concerns). We could confirm that our posture correction proposal contributed to the improvement of SS performance for all datasets. In addition, we confirmed that the other three techniques also contributed to the improvement of SS performance.
Discussion
A. Effects of posture correction
The proposed posture correction method is a simple and robust method that requires only a small amount of training data (Figure 9). Our postural correction reduced head angle variability (i.e., standard deviation) by 4.2 degrees, from 7.81 degrees to 3.61 degrees, and vertical deviation by 16 mm, from 29 mm to 13 mm, each less than 50% of their pre-correction magnitude. In other words, correcting for pitch direction, which varied widely across the datasets, standardized the appearance across slices. As a result, this process improves SS performance for datasets that are not used for training and shows that our posture correction contributed to the improvement in SS performance. In addition, this process is computationally efficient (1.96 sec/case; approximately 24.2% of the entire SS process in PCSS-1).
B. Performance of skull stripping
1) Discussion on GT and performance evaluation
The proposed PCSS (PCSS-1 and PCSS-3) generally achieved the best SS performance (96.95 and 97.31 in Dice score, excepting OASIS) but was numerically lower than the other methods when evaluated on the test cases of the ADNI2 dataset used for training.
Firstly, we discuss why PCSS has lower SS (precision) scores only on ADNI2 test cases. Figure 11 shows images of the worst five Dice scores (as well as precision) in one fold on the PCSS-3, from left to right. The red line shows the region predicted by PCSS-3, and the yellow line shows that by MRICloud used as GT. In case (a), which has the lowest Dice score, MRICloud failed to detect a portion of the medulla oblongata and the cerebellum’s tonsil (1). In contrast, PCSS-3 accurately extracted this region. Hence, the lower Dice score was attributed to an error in the GT, not a failure in PCSS-3. In cases (b)-(e), the disparities between the brain masks generated by PCSS-3 and MRICloud were primarily concerned with the inclusion or exclusion of the cerebral longitudinal fissure, the space is occupied by the cerebrospinal fluid and the falx cerebri, a membrane that separates the left and right cerebral hemispheres. While MRICloud generally excluded this space and membrane from the brain mask, PCSS-3 was inclined to include them. Delineation of this space, which has a complex boundary, can be a challenging task even for neuroimaging experts. Considering that the GT of other databases includes this area in their brain masks, we believe a Dice score slightly lower than nnU-Net (97.04 vs. 98.88) is practically acceptable. Even in the worst case (b), the Dice score is 96.44, which represents sufficient accuracy for the ADNI2 dataset.
2) Comparison and discussion with other methods
The proposed PCSS achieves the best SS performance, including state-of-the-art 2D and 3D methods, in two datasets not used for training. In addition, excluding HD-BET, which is about ten times more training data than PCSS from the comparison, PCSS shows the best performance on all four datasets. This is evidence of its inherently high SS performance. The PCSS shows robust and accurate SS results for these unknown datasets, indicating that it does not overfit with the GT given by MRICloud, which is also confirmed by the results in Figure 10.
Especially for LPBA40 with manual GT, most of the cases (31 of 40) in the baseline U-Net have a Dice score of less than 0.95 due to under detection (low recall). By contrast, PCSS achieved extremely accurate SS for almost all cases despite using the same training data. In addition, LPBA40 has a resolution of 0.86 mm and 0.88 mm per pixel, which is different from the resolution used in the training. This result shows that PCSS is robust to data with different resolutions than the training data. The same trend is observed for other datasets, which confirms the excellent SS performance and robustness of PCSS.
The advanced SS methods Auto-Net [22], MVU-net [25], and nnU-Net [28] show good results for the ADNI2 dataset. However, the results for the other datasets are lower than PCSS, especially due to the lower recall (averaging 4.68, 5.94, and 4.39 points, respectively, with the exception of OASIS). The reason for this is that Auto-Net is a method that aims to improve accuracy by feeding back the trained probability maps of brain regions to the input, which may lead to overfitting. In addition, MVU-Net employs an architecture with a small number of parameters to improve efficiency, but this architecture may not have been expressive enough to achieve accurate SS even for multiple datasets. Meanwhile, nnU-Net uses 3D spatial information from 3D U-Net. Therefore, it has the best Dice score on the training data, ADNI2, but it is thought that overfitting prevents it from corresponding images, such as NFBS, which are cut off under the neck. HD-BET [27] shows the most robust SS result among previous studies. The low Dice score of the ADNI2 dataset is due to the difference in GT between HD-BET and MRICloud. The GT of MRICloud does not include the cerebral longitudinal fissure, whereas HD-BET does. As a result, the accuracy of HD-BET in ADNI2 is reduced. As we described earlier, it is difficult even for experts to create an accurate GT for this space. Therefore, the low score of HD-BET in ADNI2 is not inherently a problem with its SS performance, and the result shown in other datasets suggests that HD-BET is capable of achieving highly accurate and robust SS for diverse data. However, they used 6,586 images as training data and semi-manually created masks for 1,568 T1-weighted images. It takes 15 minutes to create a mask, which is very expensive to build a larger network in the future. PCSS achieved almost the same performance as HD-BET even though it only trained 612 images. This result shows that PCSS can achieve highly robust SS with few training data.
Although the original paper [22] and [25] showed better performance than other comparison methods when SS ability was evaluated by splitting within the same dataset, we found concerns about the robustness to data from different environments in this result. These differences in datasets are thought to be due to the stronger effects of overfitting caused by differences in subjects’ postures. By contrast, the proposed PCSS, which introduces posture correction and other techniques, is able to achieve robust and accurate SS results for a large number of datasets.
C. Discussion of technical elements for skull stripping
The main proposal in this paper, posture correction, improves SS performance on all five datasets and is an important factor in achieving robust and accurate SS. The introduction of weighted loss functions to address the imbalance between brain and non-brain regions and the introduction of the discriminator to further improve SS performance are of significant importance in this achievement. In particular, the three datasets labeled manual (i.e., excluding ADNI2 and OASIS) showed average improvements of 1.42 and 0.71 points, respectively. These results show that PCSS suppresses overfitting for ADNI’s GT and contributes to making SS robust to unknown data in different environments. This suppression of overfitting is also true for posture correction. In addition, The PCSS results in Table 3 show that these elements perform better when combined.
The trained PCSS-3 using ensemble of three sections improves the Dice score by only 0.36 points on average, compared to PCSS-1. On the other hand, PCSS-3 takes about 2.5 times longer than PCSS-1 per SS case. For maximum performance, PCSS-3, which uses a three-section ensemble, is preferred. However, PCSS-1 also outperforms previous state-of-the-art SS methods through its proposed posture correction. Therefore, PCSS-1 is generally preferable for actual operations due to its combination of speed and performance therefore suitable for high-throughput brain MRI analysis.
Limitation
The evaluation of PCSS in this study was limited to T1-weighted Images. However, MRIs have multiple types, including contrast-enhanced T1-weighted, T2-weighted, and FLAIR. There is also a process in MRI called fat saturation, which reduces the fat signal to show other tissues and structures more clearly. We plan to evaluate the performance and usefulness of PCSS for these different image types and processing.
Conclusion
In this paper, we proposed and published posture correction skull stripping, a highly accurate and robust skull stripping method for T1-weighted brain magnetic resonance imaging that accounts for the diversity of subjects’ postures. Using five publicly available datasets, we confirmed that PCSS outperforms existing state-of-the-art methods and the effectiveness of each of its technical components. This paper discusses and evaluates the use of larger and more diverse data than previous SS papers, thus setting the standard for future SS papers. We hope that our published PCSS will contribute to future research on brain MRI.
ACKNOWLEDGMENT
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: https://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
The MRI data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research and Development LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Company Inc.; Meso Scale Diagnostics LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education and the study is coordinated by the Alzheimer’s Therapeutic Research Institute, University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging, University of Southern California.
Data were provided 1 by OASIS Cross-Sectional: Principal Investigators: D. Marcus, R, Buckner, J, Csernansky J. Morri s; P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, and U24 RR021382.