Introduction
Ultrasound (US) imaging is a routinely used modality to monitor fetal growth and development. Measurement of fetal brain structures includes the cerebellum, cerebrum, midbrain, and thalamus on US images, and it forms a part of the fetal anomaly screening performed at 18–21 weeks gestation. Studies have found that alterations in cerebellum development are linked to neurodevelopmental impairments involving general motor function, mental development, and disorders such as autism [1]–[6]. The cerebellum is highly conserved in its developmental stages, clearly demarcated from surrounding brain structures and hence easy to evaluate on routine US images. This makes the cerebellum an important target structure to understand neurodevelopmental outcomes and identify perturbations in the antenatal period that affect its development.
Fig. 1 shows a sample representation of the trans-cerebellar plane of a US image. Semi-automatic approaches are often followed through manual corrections by an expert to rectify the external boundaries of the cerebellum. Manual measurement of the cerebellum depends on the sonographer’s experience and is often subjected to inter- and intra-observer variability. The manual investigation of US images is also a time-consuming and tedious process, especially for larger sample sizes. Hence, it becomes important to develop automated US image analysis methods to overcome the need for user interaction and inter-observer variability. Current clinical practices for the measuring of the cerebellum from US images are based on manual or semi-automatic techniques. Manual measurements require free-hand annotation by an experienced clinician, whereas semi-automatic techniques involve user input to fix the ‘end points’ of the cerebellum, which are used by an automated algorithm to produce measurements. However, both of these approaches are time-consuming and require considerable clinical expertise, as these assessments involve subtle measurements of the width of the cerebellum from US images comprising of varying sizes based on the presentations of the fetus. In addition, US images inherently have signal drop out, motion artifacts and non-uniform contrast resolution.
Representative of trans-cerebellar plane of an US image. (a) A schematic showing the cross-sectional (line) view acquired by an US scanner. (b) Trans-cerebellar plane. (c) Cerebellum depicted in green.
There have been several attempts at automating the cerebellum measurements. For accurate cerebellum measurements, segmentation of the cerebellum region acts as a prior. In three-dimensional (3D) US volumes, Liu et al. [7], used weighted Hough transform and a constrained randomized Hough transform to detect the fetal brain midline and the skull. This anatomical information was used to identify the location of the cerebellum using a constrained probabilistic boosting tree. Yaqub et al. [8] used Random decision forest to segment the four fetal structures, including the cerebellum and suggested the requirement of shape constraints to segment regions that lack definite boundaries. Although the 3D US is better in depicting the fetal brain structures, in clinical practice, 2D US images are the dominant imaging modality due to their availability and standardization of the clinical protocol [9]. With 2D US images, although certain algorithms have imposed specific shape constraints to segment the cerebellum [10], they exhibit high computational load and do not provide satisfactory accuracy.
In recent years, segmentation algorithms based on convolutional neural network (CNN) have become state-of-the-art. In particular, a feed forward network was set up to build a U-shaped architecture and this has been successfully applied to medical image segmentation [11]. U-Net based CNN architecture was used to segment multiple brain structures including the cerebellum using weak labels in 3D US volumes [12]. Apart from 3D US image segmentation, Ramos et al. [13] applied YOLO (You Only Look Once) architecture to detect the cerebellum region with bounding boxes in 2D US images. However, this method is for localization of the cerebellum and evaluated on a smaller sample of images. Nevertheless, semantic segmentation is a preferred pre-requisite for automated fetal biometry; Semantic segmentation refers to a process of linking each pixel in an image to a class of cerebellum.
In this study, we present a semantic segmentation technique based on deep CNN for automatic segmentation of the cerebellum in US images. Our segmentation method, ResU-Net-c proposes the use of U-Net [11] architecture, and introduces residual blocks (Res) with dilated convolutions units in the last two layers. Both the Res and dilated convolutions modules are optimized to retain the spatial resolution of the US images. Thereby, the subtle structure of the fetal cerebellum (c) is segmented from noisy US images.
The contribution of our work lies in the following,
This is the first study to report on the semantic segmentation of the cerebellum structure in 2D fetal US images.
ResU-Net-c is an efficient semantic segmentation network, where it leverages the strength of both deep residual learning and U-Net architecture. The dilated convolutional layers increases the receptive field of the network without losing the spatial resolution. This module enables the network to learn the low and high-level image features to aid in the segmentation of subtle structures in the fetal brain, thereby overcoming the inherent limitation of the US image characteristics in the segmentation task.
We benchmarked our algorithm in comparison to the state-of-the-art on a large image dataset.
Materials
The images were acquired using a GE Voluson E8 US machine from the Athena Diagnostics Imaging Centre, Chennai, India. All images were de-identified. The images were obtained using a standard-of-care clinical protocol that is a part of the routine clinical care; experienced qualified specialist radiologists followed the protocol established by the International Society of Ultrasound in Obstetrics & Gynaecology Education Committee (2007) for imaging fetal brain structures during the morphology scan at 18–20 weeks. We obtained a data set of 734 2D fetal US images of fetal trans-cerebellar plane.
Our experiments used a 5-fold cross-validation. Of the 734 images, each fold used 588 images for training and 146 for testing. The images were obtained in TIFF format. We cropped the images to
Methods
A. U-Net
U-Net is a deep CNN proposed by Ronneberger et al. [11] for semantic segmentation of biomedical images. It aims to label each pixel in the image with a corresponding label class. U-Net improves upon the fully CNN architecture [14] by expanding the decoder module’s capacity. The U-Net architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. In the proposed method, U-Net serves as a backbone of the architecture with encoder and decoder paths.
B. Residual Networks
Usage of deep layers in CNN evinces that they progressively learn more complex features. This can be beneficial to discriminately learn the complex image features of the cerebellum. However, the deeper networks have higher training and test error. He et al. [15], showed that overfitting could be caused due to the creation of complex functions with more layers. This may be the reason for the failure of deeper networks compared to shallow networks. The problem of overfitting can be suppressed with the use of an additional algorithm and regularization parameters. However, the failure of deeper networks is also attributed to the vanishing gradient problem due to the vast exploration of the feature space. This makes it prone to perturbations that may cause it to leave the manifold, and require additional labeled training data to recover, which is difficult to obtain in the medical imaging community.
The problem of training a very deep network has been alleviated with the use of residual neural networks. The Residual neural network makes use of skip connections to jump over some layers. ResNet models are implemented with double or triple layer skips. These layers usually contain rectified learning unit (ReLU) and batch normalization. The motivation for skipping the layers is to avoid vanishing gradients by reusing weights learned by the activation layer. Moreover, skipping simplifies the network by using fewer layers in the training stages and avoids the need for large training data. During training, weights adapt to the mute upstream layer, and amplify the previously skipped layer. As it learns the feature space, the network gradually restores the skipped layers. When all layers are extended, it stays closer to the manifold at the end of the training, resulting in faster learning. Skip connections have been shown to increase the performance in several image recognition tasks with deep networks. In our task, we have adapted skip connections to improve the performance of the task of segmentation.
A basic residual block is shown in Fig. 2. The identity mapping does not have any parameters and is used only to add the output from the previous layer to the next layer. The dimensions of x and F(x) may not be the same. The identity mapping is multiplied by a linear projection \begin{equation*} y = F(x,\{W_{i}\})+W_{s}x \end{equation*}
The function
C. ResU-Net-c Architecture
The architecture of ResU-Net-c is presented in Fig. 3(a). It takes an input of size
The proposed ResU-Net-c. (a) The network architecture. (b) Architecture of a single layer with residual connections. (c) Representation of
Two types of convolution operations (
Dilated convolutions at the rate 2 and 4 are applied at layers 5 and 6, respectively. Dilated convolution broadens the receptive field with respect to a given constant filter size while preserving the full spatial dimension. Fig. 3(c) shows the representation of dilated convolution spaced according to the specified dilation rate. In the proposed ResU-Net-c architecture, the last two pooling layers are replaced with dilated convolutions to maintain the same field of view while preventing the loss of spatial information. ReLU is used as an activation function in all the layers to introduce non-linearity in the network. Max pooling operation with a stride of 2 pixels is used from 1–3 layers in the encoder part. This performs down sampling operation along the spatial dimensions and reduces the number of parameters and computations in the network to avoid overfitting. It also helps to provide a scale-invariant representation of the input image. Similarly, we used up-sampling layers in the decoder part to increase the spatial resolution of feature maps. The max pooling and up sampling layers do not have a skip connection between the input and the output. The output of the pooling and up-sampling layers are fed to their corresponding encoder and decoder blocks. The prediction layer consists of a single convolution layer with kernel size
D. Loss Functions
A loss function is used to measure and minimize the difference between the predicted binary output and the binary ground truth. We used the Dice Loss (DL) [16], the combination of DL and Binary Cross-Entropy (BCE) [17], and Focal Tversky Loss (FTL) [18]. DSC was selected to measure the overlap of the segmented regions and BCE to quantify the pixel wise agreement between the output and the ground truth. FTL was used to solve the class imbalance problems in training segmentation models.
1) Dice Loss (DL)
Dice Score Coefficient (DSC) is an overlap index that is widely used in medical image segmentation to assess segmentation performance. The two-class DSC variant for class c is given as, \begin{equation*} DSC_{c} = \frac {\sum _{i=1}^{N} P_{ic}g_{ic} + k}{\sum _{i=1}^{N} P_{ic} + g_{ic} + k} \end{equation*}
\begin{equation*} DL_{c} = \sum _{c} 1- DSC_{c} \end{equation*}
2) Combo Loss (CL)
We used a combination of BCE loss and DL [17]. For two-class problems, BCE loss function can be expressed as follows, \begin{equation*} BCE(g,p) \!=\! -1/N\sum _{i=1}^{N} [g_{i}.log(p_{i}) \!+\! (1 \!-\! g_{i}).log(1 \!-\! p_{i})]\quad ~ \end{equation*}
The CL is computed as below, \begin{equation*} CL = 0.5*BCE+DL \end{equation*}
The CL is parameterized by an individual weight factor
3) Focal Tversky Loss (FTL)
In a highly imbalanced data and small foreground area, false negative (FN) detections need to be weighted higher than false positives (FP) to improve recall rate. Tversky similarity index (TI) [18] is a generalization of DSC score which allows for flexibility in balancing FP and FNs, \begin{align*} TI_{c} = \frac {\sum _{i=1}^{N} p_{ic}g_{ic} + k}{\sum _{i=1}^{N} p_{ic}g_{ic} + \alpha \sum _{i=1}^{N} p_{i\overline {c}}g_{ic}+\beta \sum _{i=1}^{N}p_{ic}g_{i\overline {c}} +k} \\[-2pt]{}\end{align*}
Here,
TI loss is obtained by minimizing \begin{equation*} FTL_{c} = \sum _{c} (1- TL_{c})^{1/\gamma } \end{equation*}
E. Training Parameters
The Adam optimizer was used as the optimization algorithm in all the models. It had better performance compared with other algorithms in all the training experiments. The model parameters were initialized with the random weights. The learning rate was set to
F. Evaluation Metrics
Evaluation Metrics play an important role in assessing the outcomes of segmentation models. In this work, we have analysed our results using DSC, Hausdorff Distance (HD), precision, and recall.
1) Hausdorff Distance (HD)
HD [19] is a measure of segmentation error. HD is used to determine the degree of closeness between two images. HD is computed between the boundaries of the predicted (X) and ground truth (Y) segmentations and is given as below, \begin{equation*}HD(X,Y) = max(hd(X,Y),hd(Y,X))\end{equation*}
\begin{equation*}h(X,Y) = maxmin|x-y|\end{equation*}
2) Precision and Recall
Precision and recall are measures of segmentation performance in terms of over and under segmentation. Low precision scores suggests over segmentation while low recall scores suggests under segmentation. True positive (TP) represents a pixel that is correctly predicted as ground truth. True negative (TN) represents a pixel that is correctly predicted as not belonging to ground truth. False Positive (FP) represents a pixel predicted incorrectly as ground truth. False Negative (FN) represents a pixel predicted incorrectly not as ground truth.
Precision is defined as, \begin{equation*}Precision = \frac {TP}{TP+FP}\end{equation*}
\begin{equation*}Recall = \frac {TP}{TP+FN}\end{equation*}
We compared our ResU-Net-c with well-established segmentation methods using U-Net as its base structure, such as U-Net [11], Attention U-Net [20], U-Net++ [17], and the proposed ResU-Net without dilation layers and without Residual Blocks. For ranking our models, we have chosen to use the DSC as the main evaluation parameter for comparison.
Results
Table 3 outlines the segmentation performance (for a single fold) of our ResU-Net-c against the comparison methods. The results of 5- fold cross validated training using DSC loss are shown in Table 4. ResU-Net-c achieved the highest mean DSC of 86.70%. Fig. 4 illustrates visual segmentation results of the proposed method and comparison methods, with original US images and delineated cerebellum overlaid on US images. Fig. 5 provides samples of false detections and original US images with ground truth labels. This sort of localization ability can help in better visualization of the fetal brain structures. Fig. 6 compares optimization learning curves for the proposed ResU-Net-c and comparison methods.
Segmentation results. The rows depict the raw images followed by the segmentation results, and the columns are different US image samples. The ground truth labels are in green and the automated results are labeled in red.
Samples of false predictions. The columns depict the raw images followed by the segmentation results, and the rows are different US image samples. The ground truth labels are in green and the automated results are labelled in red.
Discussion
Table 3 shows that the proposed ResU-Net-c outperforms the comparison methods. This exemplifies the U-Net architecture’s residual blocks effectiveness for US images. With the residual block, the low-level features from previous layers were directly combined with the high-level features from the recent layers, and this promotes the utilization of highly efficient features in the cerebellum segmentation.
Loss functions played an essential role in determining network performance. The performance metrics for all comparison methods with DL, CL, and FTL are shown in the Table 3. DL had better results in Residual U-Net-c, but the combined loss of DL and BCE, resulted in better performance among other comparison methods. Also, we investigated the combination loss of DL and BCE, to see if it can bring in different effects in the training process. We observed that the CL enforces a desired trade-off between false positives and negatives and avoids getting stuck in bad local minima as it leverages the DL [21]. The CL converges considerably faster than BCE during training. FTL has been shown to work well in highly imbalanced datasets, but it is observed that standard loss functions provide better optimization. We attribute this to the fact that FTL may not capture the uncertainties at boundaries. In practice, this results in segmentation maps with high precision but low recall. However, DL equally weighs FP and FN detections, and has shown better performance in the proposed method. Thus, the loss function’s performance depends on the characteristics of the images used in training, such as skewness, and inconsistent boundaries.
Table 4 shows that the proposed method significantly outperformed the other comparison methods and had a higher DSC. The high accuracy of our method confirms that the use of residual blocks and dilated convolution, thus making it versatile to segment the cerebellum in US images.
Our experiments showed that our method segmented the cerebellum structure more accurately than other comparison methods (see Fig. 4). The contour obtained from the U-Net, U-Net++, Attention U-Net model did not depict the exact region of the cerebellum due to weak edges. They were plagued by contour leaks due to the lack of definite boundaries and did not cover the exact region of interest (see Fig. 4(d)-4th column). Applying the proposed method without the dilation layer resulted in segmented contours that crumbled inside the cerebellum. The boundaries of the cerebellum were not captured consistently due to poor contrast in the boundaries of the cerebellum (see Fig. 4(e)). Similar findings were observed in other comparison methods, suggesting that the use of dilation convolution is important for cerebellum segmentation. In all the samples, better visual results were obtained for the proposed ResU-Net-c.
These visual outcomes in Fig. 4 and Fig. 5 correspond to our quantitative findings in Table 3. ResU-Net-c achieved the highest DSC with DL function. The low precision value of the comparison methods indicates a large number of non-cerebellum pixels within the segmented contour. The higher precision of ResU-Net-c confirms that there are fewer FP with the predicted cerebellum. The superiority of the proposed algorithm over the other comparison methods is statistically significant (p < 0.001) and is demonstrated by comparing the HD values of the proposed method with other comparison methods using two-tailed paired t-test.
We observed that segmentation performance is poor in Fig. 5 for images where the boundaries of the fetal head is not fully visible and/or there is discontinuity across all methods. One of the reason for this low performance is attributed to the visualization of the structure of the skull in the fetal head. We recommend that such images may not be appropriate to be included in the automated processing of retrospective US images and requires image quality assessment.
The performance of ResU-Net-c (w/o Res Block) and ResU-Net-c (w/o dilation) is higher than the U-Net and lower than the proposed method. We attribute the increased performance of our method to the enriched semantic features learned through skip connections present in the residual blocks. The residual operations significantly improve the training and the testing performance without increasing the number of network parameters [22].
The ResU-Net-c (w/o dilation) results were lower across all measures, showing the importance of the inclusion of the dilation layer in the proposed method. Dilated convolution filters increase the receptive field of CNN without introducing additional parameters, which avoids the overfitting issues in the training process. Dilated convolutions used in our method supports the exponential expansion of the receptive field without loss of spatial resolution [23].
Fig. 6 shows that the proposed method has a steep acceleration in the learning curve than other methods. These quantitative findings demonstrate the accuracy, robustness, and reliability of the proposed method in the cerebellum segmentation.
With the use of dilation layer and residual blocks, the number of parameters needed to train the model has been reduced. ResU-Net-c has 17,640,722 trainable parameters, as shown in Table 2. The number of model parameters is lower for UNet++. Likewise, ResU-Net-c needed a lesser number of epochs for convergence. The average processing times to test an image using our method and other comparison methods are shown in Table 3. The processing time of our method was lower compared to the U-Net but higher than Attenuation U-Net. Optimization of running time was not within the original scope of this research and has been left for future work.
Manual cerebellum measurements are easy to perform, and semi-automatic techniques are fast; however, these approaches all rely on user inputs and therefore are subject to be compromised in robustness and consistency. Further manual approaches are time-consuming when a large number of images are needed. Our automated segmentation method enables retrospective study that involves a large number of US Images. Our future work will involve image quality assessment as a prior to automated fetal US image segmentation.
Future work will also include the extension of our algorithm to prospective cerebellum measurements and quantification. We suggest that our method can potentially help to decrease operator dependency in clinical applications for the assessment of fetal health, thereby increasing robustness and reproducibility.
Conclusion
We proposed a new semantic segmentation method to segment the cerebellum from 2D US images with a U-Net architecture combined with residual blocks. All the comparison models have been evaluated using three different loss functions: DL, FTL and a CL. The experimental results demonstrate that the proposed ResU-Net-c model outperformed in the segmentation tasks compared to the existing methods. This method can be extended to perform semantic segmentation of other fetal brain structures with biometric measurements.
ACKNOWLEDGMENT
(Vishal Singh and Pradeeba Sridar contributed equally to this work.)