FetSAM: Advanced Segmentation Techniques for Fetal Head Biometrics in Ultrasound Imagery

Goal: FetSAM represents a cutting-edge deep learning model aimed at revolutionizing fetal head ultrasound segmentation, thereby elevating prenatal diagnostic precision. Methods: Utilizing a comprehensive dataset–the largest to date for fetal head metrics–FetSAM incorporates prompt-based learning. It distinguishes itself with a dual loss mechanism, combining Weighted DiceLoss and Weighted Lovasz Loss, optimized through AdamW and underscored by class weight adjustments for better segmentation balance. Performance benchmarks against prominent models such as U-Net, DeepLabV3, and Segformer highlight its efficacy. Results: FetSAM delivers unparalleled segmentation accuracy, demonstrated by a DSC of 0.90117, HD of 1.86484, and ASD of 0.46645. Conclusion: FetSAM sets a new benchmark in AI-enhanced prenatal ultrasound analysis, providing a robust, precise tool for clinical applications and pushing the envelope of prenatal care with its groundbreaking dataset and segmentation capabilities.


I. INTRODUCTION
I N THE transformative realm of biomedical imaging, fetal ul- trasound serves as a critical juncture between technology and healthcare [1].It has revolutionized prenatal care by providing a non-invasive window into the womb, enabling real-time monitoring of fetal growth and development [2].The significance of standardizing quantification and achieving precise fetal head segmentation goes beyond technicality; it serves as a beacon for advancing prenatal care [3].Reducing variability in data collection is crucial for reliable monitoring across multiple scans.Furthermore, the cost-effectiveness and widespread accessibility of ultrasound imaging make it a powerful tool for global healthcare improvement [4].Future integration of AI-enabled tissue segmentation with ultrasound imaging could catalyze earlier interventions, safer deliveries, and healthier beginnings, positioning this research as an essential cornerstone in prenatal care [5].
Accurate estimation of fetal age and weight is pivotal in monitoring standard prenatal development [1].Ultrasound imaging facilitates the non-invasive observation of the fetus, encompassing the brain structures.Modern advancements in image processing and segmentation methods have paved the way for the automatic delineation and measurement of essential fetal brain regions from ultrasound images [6].In particular, segmenting the fetal brain, cavum septum pellucidum (CSP), and lateral ventricles (LV) from ultrasound images can offer indispensable biomarkers for gestational age and fetal growth models [1].The CSP is a crucial midline fluid-filled cavity, the size of which fluctuates significantly during gestation [7], while the size of the LV augments with progressing gestational age [8].Thus, a quantitative analysis of the CSP and LV dimensions derived from segmented ultrasound images can offer profound insights into the developmental milestones of the fetus.Moreover, the total volume of the brain tissue, determined from the automatic segmentation of the brain, has exhibited a strong correlation with the fetal weight [9].By amalgamating the measures from the segmented multiple brain structures, sturdy machine learning models can be devised to determine fetal age and weight from ultrasound images with automation and efficiency [10].The automation of this analytical procedure has the potential to enhance prenatal risk assessments and ameliorate outcomes by promoting more comprehensive and quantitative surveillance of fetal development [11].
The precise segmentation of fetal head ultrasound images into four paramount categories-background, fetal brain, CSP, and LV-is at the heart of this study.Each category plays a pivotal role in deriving essential biometrics.Automated tools have made it possible to segment vital anatomical structures, such as the fetal skull and brain, paving the way for the accurate and efficient determination of crucial biometric measures, including the head circumference [12], [13].Refer to Fig. 1 for an illustration that depicts the locations of the Fetal Brain, CSP, and LV in the ultrasound image.
The background class, essentially all ultrasound data outside the fetal skull, allows for the automatic cropping of images to focus on the region of interest, namely the fetal head [14].The fetal brain class encapsulates all brain tissue within the skull.Accurate delineation of this class enables the measurement of head circumference, a critical indicator of fetal growth and developmental milestones [6].Further, segmenting internal structures like the CSP and LV provides deeper insights into fetal neurodevelopment [3].The CSP, a naturally occurring division between the left and right hemispheres, can serve as an early indicator of potential brain abnormalities [15].The LV, on the other hand, are fluid-filled cavities related to conditions such as ventriculomegaly and neural tube defects [16].Accurate segmentation of these structures is critical for monitoring fetal brain development, thereby enabling better-informed parental decisions and preparing the clinical team for post-birth interventions.
One of the most salient challenges in fetal ultrasound imaging is the quality of the images themselves [17].Ultrasound images often suffer from speckle noise, low contrast, and ambiguous boundaries, making it difficult to segment anatomical structures like the brain, CSP, and LV [12], [18].Wu et al. [19] highlight the importance of CSP, an essential structure for normal fetal brain development, and discuss the challenges of manual measurement.They propose a data-driven system using a novel network called CA-Unet to segment CSP and other related structures, achieving high precision and Dice scores.Similarly, Coronado-Gutiérrez et al. [20] introduce a pipeline to automatically delineate and measure fetal brain structures, offering a comprehensive solution for assessing fetal brain development and anomalies.Alzubaidi et al. [21] introduced the ETLM framework, which leverages ensemble learning for fetal brain segmentation and employs regression models to estimate fetal age and weight based on biometric measurements.
The CSP and LV are particularly challenging due to their small sizes and low visibility, often occupying only a few pixels in 2D ultrasound slices [13].The fetal brain undergoes rapid development, complicating the segmentation task further due to fluctuations in visibility at different gestational stages [18], [22].
Classical models like active contours and atlas-based approaches have been widely used but often struggle with the aforementioned challenges.For instance, active contours often fail to converge to the correct boundaries [18].The advent of deep learning has brought a revolution in this domain.U-Net architectures, for example, have shown promise but are also subject to limitations like overfitting and lack of generalizability [18].
In addressing the challenges of fetal brain segmentation, we explore a spectrum of methodologies, significantly benefiting from advancements in 2D fetal ultrasound neuroimaging.While, recent studies have demonstrated notable success in enhancing segmentation accuracy and reliability.
Zeng et al. [23] introduce a deep learning method specifically tailored for fetal ultrasound image segmentation, achieving remarkable precision in head circumference biometry.Their approach, incorporating attention-gated modules into a V-Net model, underscores the effectiveness of deep supervision and attention mechanisms in focusing on relevant features, thus improving segmentation accuracy.
Similarly, Zhao et al. [24] present TransFSM, a hybrid deep learning framework for automated segmentation and biometric measurement, which employs a convolutional neural network encoder alongside a global transformer module.This innovative combination facilitates the learning of long-range dependencies, enhancing the model's ability to segment multiple fetal anatomies accurately from 2D ultrasound images.
In addition, Wu et al. [19] detail a novel approach for the automatic segmentation and measurement of the CSP using a U-Net-based network augmented with a channel attention module.Their results indicate superior performance in CSP segmentation, contributing valuable insights into fetal neural development assessment.
Drawing on the latest advancements in 2D fetal ultrasound imaging, our study methodically integrates advanced segmentation techniques to address critical challenges within the field.Through the strategic implementation of class weighting, bespoke loss functions, and comprehensive data augmentation, we precisely target dataset imbalances, with a particular focus on the LV class.This multifaceted approach not only capitalizes on foundational research to enhance segmentation accuracy and expand its generalizability but also highlights areas ripe for further exploration, as evidenced by our targeted ablation studies.Our commitment to refining fetal ultrasound segmentation is clear, marking a significant step forward in prenatal diagnostic technologies.
In the realm of fetal brain segmentation, the scarcity of annotated data poses a significant challenge, prompting a shift towards innovative approaches such as weakly supervised and semi-supervised methods.These methods, exemplified by the work of Zheng et al. [25] with their 'shadow confidence maps' and Yang et al. [26]'s use of attention mechanisms and RNNs, represent a broader endeavor to refine segmentation accuracy under data constraints.Similarly, hybrid models like those developed by Chen et al. [27] and Yang et al. [28] blend traditional segmentation techniques with deep learning advancements to overcome these limitations.
Despite these efforts, as evidenced by the achievements in Dice scores for CSP and LV segmentation reported by Huang et al. [18], the quest for highly accurate, robust, and generalizable segmentation methods remains ongoing.This backdrop of continuous exploration and innovation in addressing data scarcity and enhancing segmentation precision frames the context for our study.Our work contributes to this dynamic field by introducing FetSAM, a model designed to advance the state-of-the-art in fetal head ultrasound image segmentation.By situating our contributions within this evolving landscape, we aim to underscore the significance of our advancements and the potential for further research to navigate the complexities of fetal brain, CSP, and LV segmentation.
In addressing the significant challenges of fetal brain, CSP, and LV segmentation, such as dataset scarcity, image noise, and the minute size of anatomical structures, our methodology is deliberately designed for precision.By constructing a custom dataset that includes detailed structures of CSP and LV, we tackle the issue of data scarcity head-on.To mitigate inherent noise and enhance segmentation precision, we've implemented an extensive data augmentation strategy, applying ten different techniques that simulate real-world imaging challenges, thus preparing our model for a wide range of imaging conditions.Recognizing the challenges posed by the small size and class imbalance of CSP and LV, we developed a sophisticated combined loss function that integrates weighted DiceLoss and Weighted Lovasz loss mechanisms.This approach ensures that these critical, yet smaller, structures receive adequate attention during the learning process, significantly boosting segmentation accuracy.Additionally, our use of prompt-based segmentation acts as a magnifying glass, bringing regions of interest into clearer focus, which is crucial for the detailed segmentation of CSP and LV, thereby enhancing the accuracy of biometric measurements.Our comprehensive methodology has been rigorously validated against ten state-of-the-art segmentation models, including those based on modern mixed transformer architectures, ensuring our approach's transparency and superiority in addressing these unique challenges.Guided by these challenges, we've meticulously developed a comprehensive approach, outlined below, to not only address these specific issues head-on but also to set new benchmarks in the accuracy and reliability of fetal brain, CSP, and LV segmentation.

II. MATERIALS AND METHODS
Fig. 2 illustrates our optimized pipeline for comparing Fet-SAM with state-of-the-art models in multi-class segmentation of fetal head ultrasound imaging.The pipeline encompasses five critical stages: Dataset Split and Consistency, Data Augmentation Strategies, Class Weight Calculation and Custom Loss, Model Fine-Tuning and Optimization Functions, and Inference and Evaluation.Starting with our unique fetal head ultrasound dataset, the pipeline integrates advanced techniques such as extensive data augmentation, class weight calculation based on inverse frequency, custom loss mechanisms, and fine-tuning optimization techniques.The pipeline culminates in a comparative analysis of FetSAM against established segmentation models like U-Net, DeepLabV3, and Segformer, assessed using a variety of metrics including DSC and HD.The following subsections will delve into each stage of this pipeline, from dataset description and augmentation strategies to model architecture and evaluation metrics.

A. Dataset Description and Split
Our research utilizes a comprehensive dataset compiled from two public repositories, featuring four distinct ultrasound fetal head planes: Trans thalamic, Trans ventricular, Trans cerebellum, and Diverse Fetal Head images.Comprising 3,832 high-resolution images, this dataset emphasizes critical fetal brain structures, including the CSP and LV.To facilitate broad computational tool compatibility, data is available in 11 widely accepted formats, verified by a rigorous dual-stage process involving multiple domain experts.
For experimental consistency, a custom script was used to segment the dataset across different ultrasound planes, allocating 60% (2,299 images) for training.The remaining 40% was divided equally between a validation set (20%, 766 images) and an untouched test set (20%, 767 images), reserved for future clinical validation.This test set remains unused in the current study to ensure the model's future clinical applicability and effectiveness are assessed independently.The validation set includes a balanced class distribution of background, brain, CSP, and LV instances, offering a solid basis for model evaluation.This distribution and strategic data split underscore our commitment to a methodical and future-oriented approach to research validation.For further details on this dataset, readers are referred to the original publication by Alzubaidi et al. [29].

B. Data Augmentation Strategies
To bolster the robustness of our model and effectively utilize our training dataset, we employed a comprehensive data augmentation strategy.While the original training dataset consisted of 2,299 images, these augmentation techniques expanded it to a total of 20,691 images.Importantly, the validation dataset was left unaugmented to ensure an unbiased evaluation of the model's performance.
Our data augmentation pipeline comprised a series of transformations, each designed to simulate real-world variations that the model might encounter.Below are the transformations applied: 1) Original Image Resize (Rz): The images were resized to 256 × 256 pixels.No other transformations were applied to maintain the originality of the dataset.2) Random Crop (RC+Rz): Random crops of 128 × 128 pixels were taken from the resized images, which were then resized back to 256 × 256 pixels.3) Vertical Flip (VF+Rz): A vertical flip was applied with a probability of 0.3.4) Horizontal Flip (HF+Rz): A horizontal flip was applied with a probability of 0.3.5) Rotation (ROT+Rz): The images were rotated within a limit of 30 degrees, applied with a probability of 0.3.6) Combined Transformations (CMB): A chain of transformations, including random crop, vertical flip, horizontal flip, and rotation, was applied.7) Random Resized Crop (RRC+Rz): The images were randomly resized within a scale of 1.2 to 1.4, and then cropped to 256 × 256 pixels.8) Padding and Random Crop (PAD+RC+Rz): Padding was applied to increase the image size, followed by a random crop to bring it back to 256 × 256 pixels.9) Advanced Transformations (RB+ET+GN+GB): This set of sophisticated techniques was designed to simulate complex variations and artifacts commonly found in ultrasound images.All transformations were applied with a probability of 0.5 and include: r Brightness and Contrast Adjustments(RB): Random ad- justments simulate variations in lighting and visibility.
r Elastic Transformations(ET): Elastic deformations with an alpha parameter of 1 and a sigma parameter of 50 mimic natural variations in tissue elasticity.
r Gaussian Noise(GN): Noise with a variance limit ranging from 10.0 to 50.0 simulates the speckle noise commonly found in ultrasound images.
r Gaussian Blur(GB): A blur with a limit ranging from 3 to 7 simulates the effects of blurring due to factors like patient movement or low-quality imaging devices.These advanced transformations aim to train the model to be more resilient to real-world variations, thereby enhancing its generalizability and robustness.Implemented using the Albumentations library for consistency.

C. Class Weight Calculation and Custom Loss Mechanism 1) Class Weight Calculation:
To address the challenges posed by our dataset, we opted for a specific type of class weighting for multiple reasons.First, our dataset is imbalanced; the number of instances for the "background" and "fetal brain" classes significantly outnumber those for the "CSP" and "LV" classes.Second, the geometric characteristics of the classes also vary; while classes like "CSP" and "LV" are represented by small rectangles, the "fetal brain" class is represented by a larger ellipse.This class-weighting approach aims to compensate for these disparities, enhancing the model's ability to accurately segment all classes.
7) Return Weights: Finally, we return the normalized class weights as a list and optionally as a dictionary mapped to class names.

2) Custom Loss Mechanism:
To address the issue of class imbalance and the distinct geometric characteristics of various classes in our dataset, we have employed the class weights derived from the data analysis phase to construct a bespoke loss function.This function intensifies the model's focus on the smaller and less prevalent classes like CSP and LV.Specifically, with class weights like [0.1,0.1,0.9,0.7],our custom loss function will allocate more attention to classes with larger weights, ensuring that difficult-to-segment classes are prioritized during training.Selecting the most effective loss functions was crucial for the optimization of our model.We assessed several loss functions such as JaccardLoss, DiceLoss, TverskyLoss, Focal-Loss, and LovaszLoss, alongside their possible combinations.Through thorough manual hyperparameter tuning and empirical evaluation, we identified that an amalgamation of Weighted Dice Loss and Weighted Lovasz Loss yielded the most promising segmentation results.
The Weighted Dice Loss L Dice is formulated as follows: where y i and p i denote the ground truth and the predicted probabilities for pixel i, w i represents the class weight for pixel i, and N is the total number of pixels.Lovasz loss is an extension of subgradient methods for convex optimization to the problem of optimizing the mean intersection-over-union (IoU) measure, which is nonlinear and non-differentiable.In the context of binary segmentation, the Lovasz hinge loss effectively leverages the convex hull of the Jaccard index, optimizing the IoU metric directly [30].
The Weighted Lovasz Loss L Lovasz can be depicted as: where Lovasz(y i , p i ) is the Lovasz hinge loss for the true label y i and the predicted label p i , and w i is the class weight for pixel i.
The Combined Loss L Combined , a synthesis of the two losses, is calculated as: where α and β are the hyperparameters that modulate the influence of each loss term on the final combined loss.In our fine-tuned model, we have set both α and β to 0.5, reflecting the equal contribution of each loss term to the final loss value.By merging these losses, our goal was to exploit the distinctive advantages of both Dice and Lovasz losses, while considering class weights, to achieve a segmentation model that is both accurate and sensitive to the nuances of each class.
3) Optimizer Configuration: During the developmental phase of our model, various optimizers and learning rate schedulers, including the standard Adam optimizer, were extensively explored and tested.Our empirical evaluations, coupled with a rigorous process of trial and error, helped in steering the choice towards the most optimal configuration for our specific application.
In this study, we employ the AdamW optimizer, which has manifested superior performance in terms of model convergence and generalization.The learning rate and weight decay are both set at 1 × 10 −4 .To further refine the optimization process, we utilize a MultiStepLR scheduler that dynamically adjusts the learning rate at predefined epochs [10,20,30] with a gamma value of 0.7.This intricate setup ensures a harmonious blend of swift convergence and robust model generalization, vital for the intricate task at hand.

D. Introduction to Fetal Segment Anything Model (FetSAM)
In our study, we leverage the Segment Anything Model (SAM) [31], a promptable foundation model developed for generic image segmentation tasks.SAM consists of three main components: an Image Encoder, a Prompt Encoder, and a Mask Decoder (see Fig. 2).The Image Encoder is responsible for processing the input image and generating a set of image features.Simultaneously, the Prompt Encoder produces a prompt embedding based on the input prompt, usually a spatial or textual clue.The Mask Decoder then utilizes both the image features and prompt embedding to generate a segmentation mask.
To adapt SAM for fetal head ultrasound imaging, we introduce FetSAM (Fetal Segment Anything Model).We fine-tune the mask decoder component of SAM using our custom loss function, which incorporates class weights to address the imbalanced nature of our dataset.We also employ an extensive data augmentation strategy to increase the diversity and size of our training dataset.
The architecture of FetSAM is similar to SAM but includes a few key customizations.The Image Encoder and Prompt Encoder produce embeddings that are fed into the Mask Decoder.Unlike in SAM, our Mask Decoder is trained to be more sensitive to the fetal brain structures of interest, particularly Fetal brain, CSP and LV, by incorporating our custom loss mechanism.This enables FetSAM to generate more accurate and clinically relevant segmentation masks for prenatal care (Fig. 2).
To generate prompt bounding boxes from the labeled masks, we designed an algorithm that iterates through each class channel in a one-hot encoded mask to compute the minimum and maximum coordinates that define the bounding box for each class.
This algorithm 1 deriving prompt bounding boxes that takes a one-hot encoded mask one_hot_mask with dimensions C × H × W , where C represents the number of classes, H is the height, and W is the width of the mask.Additional inputs to the algorithm include the number of classes num_classes, a threshold value threshold, and an offset offset to adjust the bounding boxes.The algorithm outputs a dictionary bounding_boxes containing bounding boxes for each class.
Initially, bounding_boxes is set as an empty dictionary.The algorithm then iterates over each class i in C to find the y and x indices where the mask value exceeds the threshold.If either set of indices is empty, the algorithm continues to the next iteration.Otherwise, it calculates the minimum and maximum x and y indices and applies the given offset to them.After traversing all the classes in C, the algorithm checks for any classes missing in num_classes.For any such missing classes, it assigns a default bounding box [0, 0, 0, 0].Finally, the algorithm returns bounding_boxes, thus providing a comprehensive set of bounding boxes for each class based on the input segmentation mask.

E. Traditional Sate of the Art Segmentation Models
In this section, we delineate the procedure employed for the fine-tuning of various conventional segmentation models.Our approach involved a systematic evaluation of multiple encoder architectures for each segmentation model to ascertain the most efficacious combinations.

1) Encoder Models:
In this subsection, we present the encoder architectures that were evaluated for their efficacy in segmentation tasks see (Fig. 2. Four distinct encoder models were chosen based on their computational efficiency, performance, and architectural innovations.They are as follows: 1) EfficientNet-B0 (ImageNet) [32]: EfficientNet is a convolutional neural network architecture that scales up CNNs in a principled way using compound scaling.EfficientNet-B0 is a lightweight variant that achieves strong performance on ImageNet while being very computationally efficient.
2) EfficientNet-B0 (AdvProp) [33]: This model retains the EfficientNet-B0 architecture but incorporates the Ad-vProp training methodology.The latter dynamically adjusts the scaling factor for adversarial examples during training, thereby enhancing both robustness and accuracy.3) MiT-B0 (ImageNet) [34]: The Mix Transformer (MiT) replaces the standard convolutional backbone with a transformer encoder, enabling the modeling of longerrange dependencies in images.The B0 variant maintains high ImageNet accuracy while being computationally efficient.4) MobileViTv2_075 (ImageNet) [35]: Designed for mobile applications, MobileViTv2 employs an inverted bottleneck convolutional stem and depthwise convolutions in its transformer blocks.The _075 variant further improves efficiency with a width multiplier of 0.75x.

2) Segmentation Model Selection and Optimization:
In this subsection, we delve into the process of fine-tuning traditional segmentation models for our specific task as we illustrated early in Fig. 2. We experimented with ten different segmentation architectures, each paired with multiple encoder models.The goal was to identify the optimal configuration that achieves high segmentation performance while maintaining computational efficiency.Follow is a brief about Each Segmentation Model with Additional Information: 1) U-Net (Mvit075_Unet) [36]: The U-Net architecture consists of an encoder network followed by a decoder network with skip connections between them.The encoder extracts features and the decoder upsamples and reconstructs the segmentation mask.The skip connections retain fine-grained information.In our case, the model is paired with the MobileViTv2_075 encoder and has an encoder depth of 5. 2) U-Net++ (Efb0_Unet++) [37]: U-Net++ improves on U-Net by redesigning the decoder component with a series of nested, dense skip pathways to better leverage encoder features at multiple scales.This enhances segmentation detail.We optimized this model with the EfficientNet-B0 encoder.3) MA-Net (Mvit075_Manet) [38]: MA-Net introduces a new attention refinement module between the encoder and decoder that models inter-dependencies between encoder features.This improves feature representation and segmentation accuracy.The model utilizes the Mo-bileViTv2_075 encoder with an encoder depth of 5. 4) LinkNet (Efb0_Linknet) [39]: LinkNet is built from an encoder network followed by a decoder.The key aspect is linking the encoder and decoder directly with residual connections that improve information flow.Our implementation employs the EfficientNet-B0 encoder.5) FPN (Mvit075_FPN) [40]: Feature Pyramid Network (FPN) creates a pyramid of hierarchical encoder features.Decoder stages then upsample and merge these multi-scale features to capture both local and global context.The model uses the MobileViTv2_075 encoder.6) PSPNet (Mvit075_PSPNet) [41]: Pyramid Scene Parsing Network uses pyramid pooling modules after the encoder to aggregate different-region contextual information.This global context aids challenging segmentation tasks.It is paired with the MobileViTv2_075 encoder.7) PAN (Efb0_PAN) [42]: Path Aggregation Network adds a bottom-up path augmentation through adaptive feature pooling to complement the original top-down feature propagation in the decoder.The model uses the EfficientNet-B0 encoder.8) DeepLabV3 (Mvit075_DLabV3) [43]: DeepLabV3 employs atrous convolution in the encoder and decoder to extract dense feature maps.It also uses pyramid pooling to encode multi-scale context.The model uses the MobileViTv2_075 encoder.9) DeepLabV3+ (Efb0_DLabV3+) [44]: DeepLabV3+ enhances DeepLabV3 by adding a decoder module to refine the segmentation results, especially along object boundaries.It's optimized with the employs the EfficientNet-B0 encoder with AdvProp pre-trained.10) SegFormer (Mit_SegFormer_b0) [34]: SegFormer employs a hierarchical Transformer encoder, specifically MiT-B0, in combination with a lightweight decoder.This configuration achieves robust performance without the need for dense prediction layers.

F. Models Evaluation
In this section, we elaborate on the evaluation metrics employed for assessing the segmentation performance of the proposed FetSAM model alongside other traditional segmentation algorithms.Each metric is calculated separately for the four segmented classes: background, Brain, CSP, and LV.Subsequently, the mean values are also calculated to furnish a comprehensive evaluation of the model's performance.All models are validated using the same dataset to ensure a fair comparison.

1) Metrics Optimized for Maximum Values:
r Dice Similarity Coefficient (DSC): The DSC is defined as where A and B are the sets of pixels belonging to the predicted and ground truth masks, respectively.|A ∩ B| denotes the size of the intersection of the two sets.The DSC ranges from 0 to 1, where a higher value indicates better similarity.where A border and B border are the sets of border points in A and B, respectively.A lower ASD indicates a better match.

G. Experimental Environment
The experiments for this study were conducted on a robust computational setup powered by a 12th Gen Intel(R) Core(TM) i7-12700KF processor, complemented by 128 GiB of system memory.The graphics-intensive tasks were handled by an NVIDIA GeForce RTX 3090 graphics card, equipped with 24 GB of GDDR6X VRAM.
In terms of software frameworks, we primarily relied on PyTorch for constructing the model architectures, training, and evaluation.To streamline the training process and for better experiment tracking, PyTorch Lightning was utilized.For leveraging pre-trained models and additional utilities, we also made use of the Hugging Face Transformers library.
To ensure optimal model performance and avoid overfitting, training was monitored with an early stopping mechanism, configured with a patience of 5 epochs.This approach allowed us to halt the training process when the model's performance ceased to improve on the validation set, thus saving computational resources and time.

III. RESULTS
In this section, we present a comprehensive evaluation of the proposed FetSAM model against ten state-of-the-art segmentation models.The assessment is multi-faceted, involving both quantitative and qualitative metrics to provide a thorough comparison.First, we delve into the performance of each model as gauged by individual metrics DSC, HD, and ASD -across the four classes of interest: Background, Brain, CSP, and LV.Next, we aggregate these metrics in a comprehensive table to offer an overall perspective on how FetSAM stands in comparison to other models.Finally, we provide a visual evaluation to supplement the quantitative metrics, showcasing the efficacy of FetSAM in real-world applications.The aim is to furnish a wellrounded view of FetSAM's performance, thereby substantiating its advantages and potential areas for improvement.

A. Quantitative Evaluation 1) Dice Similarity Coefficient (DSC):
DSC serves as our primary metric for evaluating segmentation accuracy, a measure that directly compares the overlap between predicted and ground truth masks.Fig. 3 visualizes these DSC scores across four classes-Background, Brain, CSP, and LV-as well as the mean DSC for each model.
Overall performance: Our proposed model, Fet-SAM, exhibits unparalleled performance, achieving a remarkable mean DSC score of 0.90117.This is substantially higher  than its closest competitor, Efb0_DLabV3+, which has a mean DSC of 0.85322.

Class-wise insights:
r Background: All models exhibit strong performance in segmenting the background, with DSC values above 0.98.However, FetSAM stands out with a nearly perfect DSC of 0.99506.
r Brain: FetSAM dominates in the Brain class as well, with a DSC of 0.98508.The nearest competing model, Efb0_DLabV3+, lags behind with a DSC of 0.97806.
r CSP: This class proves challenging for many models, with DSC scores as low as 0.56203 for Mvit075_PSPNet.Fet-SAM, however, excels with a DSC of 0.8037, substantially higher than any other model.
r LV: Again, FetSAM leads with a DSC of 0.82084, while the lowest performer, Mvit075_Manet, only achieves a DSC of 0.51384.

Comparative analysis:
The consistently high DSC scores across all classes underline FetSAM's advanced segmentation capabilities.Even in more complex classes like CSP and LV, FetSAM demonstrates a significant margin of superiority over other models.The closest competitor, Efb0_DLabV3+, does well but falls short in these challenging classes.
By outperforming all other models in each individual class and on average, FetSAM validates its effectiveness and reliability for ultrasound image segmentation tasks.

2) Hausdorff Distance (HD):
The Hausdorff Distance (HD) is a pivotal metric for evaluating the segmentation models.It measures the maximum distance of a set to the nearest point in the other set and focuses on the worst-case distance between the predicted segmentation and the ground truth.This attribute makes HD especially useful for understanding the model's performance when errors occur.Fig. 4 offers a detailed comparison of HD performance across various models and classes.
Outstanding performance of FetSAM: FetSAM excels in the HD metric, registering the lowest mean HD value of 1.86484 across all classes.This performance is notably better than the second-best model, Mit_SegFormer_b0, with a mean HD of 38.98057.FetSAM's individual class HD values are 2.67867 for Background, 3.07074 for Brain, 1.05123 for CSP, and 0.65873 for LV.

Role of prompts in FetSAM:
The use of prompts in FetSAM appears to focus the model's attention more effectively around the Region of Interest (ROI).This focused attention is likely a significant factor contributing to FetSAM's exceptional performance in minimizing the HD.
Comparative analysis: On the other end of the performance spectrum, Efb0_Linknet has the highest mean HD of 96.60033, indicating less accurate segmentations in worst-case scenarios.Other models like Efb0_DLabV3+ and Mvit075_DLabV3 also demonstrated higher mean HD values, specifically 81.11064 and 79.6815, respectively.
Potential Areas for Improvement: Although Fet-SAM sets the standard in HD performance, there remains room for improvement in other models.The LV class consistently shows higher HD values, suggesting a focus area for future model enhancements.
3) Average Surface Distance (ASD): ASD is another metric of importance in evaluating the performance of segmentation models.This metric computes the mean distance between the points on the predicted segmentation and their closest points on the ground truth.Fig. 5 shows a bar chart comparison of ASD across different models and classes.
Exceptional performance of FetSAM: FetSAM demonstrates a spectacular performance in the ASD metric, with a mean ASD of just 0.46645.This is a considerable improvement over the next best-performing model, Mit_SegFormer_b0, which has a mean ASD of 5.0018.The individual class ASD values for FetSAM are 0.67628 for Background, 0.67105 for Brain, 0.33891 for CSP, and 0.17955 for LV.
Comparative analysis: Efb0_Linknet and Mvit075_DLabV3 have the highest mean ASD values, registering 17.20281 and 17.65179 respectively.These models particularly struggle in the LV class, with ASD values of 53.99729 for Efb0_Linknet and 55.44369 for Mvit075_DLabV3, pushing their mean ASD values significantly higher.
Potential areas for improvement: While FetSAM sets a high standard in ASD performance, other models show a wide range of ASD values, indicating room for improvement.The LV class seems to be the most problematic for many models, suggesting a potential focus area for future optimizations.

B. Comprehensive Model Comparison
In this section, we discuss the overall performance of the ten segmentation models along with FetSAM, based on four key evaluation metrics: DSC, HD, and ASD.Table I and Table II summarizes the performance metrics for each model across the different classes: Background, Brain, CSP, and LV.
FetSAM emerges as the standout model, leading in three out of the four metrics.It achieves a mean DSC value of 0.90117, an HD value of 1.86484, and an ASD value of 0.46645.
Among the other models, Efb0_DLabV3+ performs well in terms of DSC struggles in HD and ASD.On the contrary, models like Mvit075_DLabV3 and Efb0_Linknet show generally poor performance across most metrics.Their lesser performance is most evident in the LV class, thereby affecting their overall mean metric values.
A consistent trend observed across most models is their similar performance in DSC and ASD, but a noticeable struggle in HD.This suggests that while these models are generally good at segmentation, they are prone to larger errors.This makes FetSAM's robustness even more commendable as it maintains low error rates across all these metrics.
It's worth noting that the LV class is challenging for almost all models, particularly in terms of HD.This could be an area for future research and refinement of segmentation models.
The superior performance of FetSAM is likely due to its use of prompts, which allow the model to focus more on the ROI.This focused attention mechanism is a likely contributor to its overall exceptional performance across metrics.In conclusion, while FetSAM sets a high standard in segmentation performance, there is room for improvement in other models, especially in the HD and ASD metrics and in challenging classes like LV.

C. Ablation Studies on Loss Functions and Model Configuration
To assess the effectiveness of the chosen loss functions and the robustness of the FetSAM model under various configurations, extensive ablation studies were undertaken.These studies not only evaluate the individual contributions of the Dice and Lovasz loss functions but also investigate their combined effect.Additionally, the adaptability and consistent performance of FetSAM with different input prompts are analyzed.The findings from these ablation studies are essential to validate the design decisions made in developing FetSAM and to demonstrate its enhanced performance in fetal ultrasound image segmentation.

1) Impact of Loss Function on Segmentation Performance:
The ablation study, presented in Table III , aimed to assess the impact of various loss functions on the performance of different segmentation models.The Dice Loss serves as a baseline, while the Lovasz Loss, targeting the optimization of the intersection-over-union metric, offers a nuanced approach.Our findings suggest that a Combined Loss function, integrating both Dice and Lovasz Losses, significantly outperforms the individual loss functions across all models.This combination notably improves the DSC and reduces the HD and ASD, underscoring its effectiveness in enhancing segmentation precision.

2) Ablation Study on FetSAM's Sensitivity to Prompt
Size: In our comprehensive ablation study, we scrutinized the influence of prompt size on FetSAM's segmentation proficiency.This examination is pivotal for fetal biometrics in prenatal diagnostics, where precision is paramount.We initially deployed FetSAM with bounding boxes formulated by our algorithm, then we probed its adaptability to varying prompt dimensions by expanding the bounding box offsets to 0, 10, and 20, respectively.
Table IV delineates a discernible performance degradation concomitant with increased offsets.Operating with zero offset, FetSAM showcases exemplary efficacy across the board.Yet, as offsets extend to 10 and 20, a conspicuous downturn is observed in DSC along with escalations in both HD and ASD metrics, a trend that is especially pronounced within the LV classcrucial for the fidelity of biometric computations.These results articulate the criticality of prompt exactitude for FetSAM's optimal utilization in clinical milieus, which demands not merely segmentation, but also the scrupulous reckoning of biometrics.Armed with precise bounding prompts, FetSAM transcends conventional models, bestowing enhanced HD and ASD results.Such findings illuminate the exigencies of optimizing FetSAM for variable prompt dimensions, bolstering its versatility and clinical viability.It is this capability for precise biometric elucidation that positions FetSAM as a harbinger for transformation in fetal medicine, equipping practitioners with data of unparalleled reliability for the assessment of fetal well-being and progression.
Adding to our insights, the inclusion of baseline results for FetSAM before fine-tuning, as depicted in the augmented Table 4, further substantiates the significant enhancements posttuning.These enhancements are not trivial; they accentuate the transformative impact of fine-tuning in advancing FetSAM from a baseline model to one that sets new precedents in fetal biometric segmentation.As we march towards clinical application, these ablation studies serve as a testament to FetSAM's potential in real-world settings, where precision is not a luxury but a necessity.

D. Qualitative Evaluation and Visual Comparison
Fig. 6 provides a side-by-side comparison of the segmentation masks produced by each model, including our proposed FetSAM model.This qualitative comparison is not just an accessory but a critical aspect of our evaluation, as it allows us to understand the nuanced performance of each model in various scenarios that occur in different trimesters of fetal development.In the first and second images, which focus on the fetal brain during the first trimester, FetSAM stands out for its exceptional performance.This is noteworthy because CSP and LV are not yet visible at this stage.Other models that follow in terms of accuracy are Efb0_DLabV3+, Mit_SegFormer_b0, Mvit075_Unet, Mvit075_FPN, and Efb0_Linknet.
Moving on to images 3, 4, and 5, which capture the fetal head in the second trimester, FetSAM continues to impress.Not only does it accurately predict the fetal brain, but it also does an excellent job with the CSP and LV, which have now become visible.Models like Mit_SegFormer_b0, Mvit075_Unet, Mvit075_DLabV3, and Mvit075_FPN also perform well but lag behind FetSAM.For instance, image 4 reveals that while some models, such as Efb0_Unet++ and Mvit075_Manet, incorrectly predict LV, FetSAM remains consistent with the ground truth.This consistency is also observed in models like Efb0_PAN, Efb0_DLabV3+, and Mit_SegFormer_b0.
The complexity increases with images 7 and 8, representing the third and second trimesters, respectively.Here, many models struggle to predict the complete shape of the fetal brain.Yet, FetSAM, along with Mvit075_Unet and Efb0_PAN, rises to the challenge in image 7.In image 8, FetSAM alone predicts the correct mask accurately, followed by Efb0_PAN and Efb0_Linknet.
Lastly, image 9 brings a unique challenge due to the different orientation of the fetal brain, which complicates CSP detection.Nevertheless, FetSAM, along with Efb0_DLabV3+, Efb0_PAN, Mvit075_Unet, and Mvit075_Manet, manages to handle this complexity effectively.
The visual observations confirm our quantitative findings, further emphasizing FetSAM's robustness and adaptability across different developmental stages and imaging conditions.

IV. DISCUSSION
In this work, we have embarked on a pioneering study in the domain of fetal brain segmentation, introducing the innovative FetSAM model.Utilizing a novel dataset crafted specifically for this study, our research stands as a groundbreaking effort in the field, marked by the absence of pre-existing benchmarks in the literature.This uniqueness renders our findings exceptionally impactful.
The FetSAM model has demonstrated superior performance across key quantitative metrics such as the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD), and Average Surface Distance (ASD).This high level of performance is attributed to its prompt-based architecture, which facilitates targeted attention on regions of interest (ROI).Such precision is crucial in complex anatomical tasks like fetal brain segmentation, where anatomical structures undergo significant changes across different trimesters.
Our methodological approach was robust, incorporating class weighting, custom loss functions, and extensive data augmentation to address the challenges posed by the imbalanced dataset, particularly for the LV class.Despite these enhancements, areas for improvement remain, particularly in the DSC and HD metrics, indicating that the LV class could be a focal point for future research and optimization.

While models such as Efb0_DLabV3+ and Mvit075_
DLabV3 exhibited competitive performance in metrics, they were consistently outperformed by FetSAM, especially in more challenging scenarios like third-trimester images or varying fetal brain orientations.These observations underscore FetSAM's robustness and its potential for both further research and practical clinical applications.
Notably, all models faced challenges in accurately segmenting the LV class, highlighting a need for future research to focus on achieving a more balanced dataset and possibly developing more tailored attention mechanisms or custom loss functions.
The ablation studies conducted have further validated FetSAM's advanced segmentation capabilities, particularly its utilization of prompt boxes to enhance segmentation precision.This novel approach significantly reduces false positives and false negatives, contributing to the model's exceptional performance in the HD metric.Such improvements are critical for clinical applications where precise segmentation of edges and small structures directly influences diagnostic outcomes.
The remarkable HD improvement, outpacing the enhancements seen in DSC, highlights the method's efficacy in precise edge and structure delineation.While DSC measures overall accuracy, HD's sensitivity to object boundary delineation makes FetSAM's focus on minimizing segmentation errors in these areas a key factor in its superior HD performance.
Ultimately, FetSAM's innovative prompt box utilization not only enhances segmentation accuracy but also illustrates its potential to significantly advance medical image segmentation.Future efforts will aim to further refine these techniques, improving FetSAM's robustness and clinical utility across diverse scenarios.
In conclusion, FetSAM sets a new benchmark in fetal brain segmentation, offering a solid foundation for future research in this field.Future studies might explore alternative attention mechanisms, advanced data augmentation techniques, and specialized loss functions to further refine FetSAM's performance, particularly in areas identified as needing improvement.

V. CONCLUSION
This study marks a significant advancement in fetal brain segmentation by unveiling the FetSAM model, which is calibrated on a novel dataset and juxtaposed with a multitude of segmentation models.FetSAM has demonstrated its robustness and adaptability, excelling in critical quantitative metrics and showcasing potential for real-world applications due to its superior qualitative performance.
Looking ahead, we the importance of transitioning towards fully automated systems that encapsulate the entire process from frame selection to segmentation, catering to the emerging trends in the field.The current success of FetSAM illustrates its potential to adapt to such an end-to-end automated framework, which would accommodate variations in sonographer expertise-from novices to seasoned practitioners-and enhance the consistency of fetal brain segmentation outcomes.
To further enhance FetSAM, our immediate research goals include experimenting with diverse types of prompts for the model, ranging from textual to point-based inputs, to optimize segmentation accuracy.Moreover, a comparative analysis between FetSAM's performance and that of fetal ultrasound sonographers is envisaged.Such a study would deepen our understanding of FetSAM's utility in clinical scenarios and benchmark its efficacy against the human expertise that currently defines the standard of care.
In sum, the FetSAM model establishes a solid groundwork for both immediate usage and future enhancements in the arena of fetal brain segmentation.Its commendable performance, allied with the scope for further improvements, positions it as a strong prospect for exhaustive clinical trials and applications in realworld healthcare settings.

Fig. 1 .
Fig. 1.Illustration depicting the location of Fetal Brain, CSP, and LV in the ultrasound image.

Fig. 2 .
Fig. 2. Schematic of the optimized pipeline for comparing FetSAM with state-of-the-art models in multi-class segmentation of fetal head ultrasound imaging.

1 ) 1 5) 6 )
Initialize Class Counts: We initiate an array class_counts with zeros to store the count of each class.The size of this array is the number of classes n classes .class_counts = 0 ∈ R n classes 2) Iterate Over the Dataset: For each batch of masks in the data loader, we convert the one-hot encoded masks to class labels.masks = arg max(masks_one_hot, dim=1) 3) Update Class Counts: We then count the occurrences of each class in these masks and update class_counts.unique, counts = np.unique(masks,return_counts=True) class_counts[unique]+ = counts 4) Handle Zero Counts: To avoid division by zero, we replace any zero count with one.class_counts[class_counts = 0] = Calculate Raw Weights: The raw class weights are calculated as the inverse frequency of each class normalized by the sum of these inverse frequencies.Smoothing and Normalization: We smooth the weights to ensure they fall within a specified range [min_weight, max_weight].weights = smoothed_weights− min(smoothed_weights) max(smoothed_weights) − min(smoothed_weights)

Values:r
Hausdorff Distance (HD): The HD is defined as the max- imum of two directed Hausdorff distances: HD(A, B) = max sup a∈A inf b∈B d(a, b), sup b∈B inf a∈A d(a, b) where sup and inf denote the supremum and infimum, and d(a, b) is the Euclidean distance between points a and b.A lower HD indicates better similarity.rAverage Surface Distance (ASD): The ASD is the mean of all the surface distances between the border points of the predicted and ground truth masks.It is defined as ASD(A, B) = 1 |A border | a∈A border min b∈B border d(a, b)

Fig. 3 .
Fig. 3. Comparison of DSC across models and classes, highlighting the superior performance of FetSAM.

Fig.
Fig. Comparison of HD across models and classes, highlighting the superior performance of FetSAM.

Fig. 5 .
Fig. 5. Comparison of ASD Across Models and Classes, Emphasizing the Exceptional Performance of FetSAM.

Fig. 6 .
Fig. 6.Comparison of predicted segmentation masks: A side-by-side comparison of the 10 masks produced by each segmentation model, contrasted with the ground truth and our proposed FetSAM model.

Algorithm 1 :
Algorithm for Deriving Prompt Bounding Boxes.for each class i in C do 3: y indices , x indices ← find where one_hot_mask[i] > threshold 4: if y indices .size= 0 OR x indices .size= 0 then 5: Continue to the next iteration 6: end if 7: x min , x max ← min(x indices ), max(x indices ) 8: y min , y max ← min(y indices ), max(y indices ) 9: Apply offset to x min , x max , y min , y max 10: bbox ← [x min , y min , x max , y max ] 11: bounding_boxes[i] ← bbox 12: end for 13: for each class i in num_classes do 14: if i not in bounding_boxes then The resulting bounding box bbox = [x min , y min , x max , y max ] is stored in bounding_boxes under the corresponding class i.

TABLE I
COMPARISON OF SEGMENTATION PERFORMANCE METRICS FOR THE CLASSES OF BACKGROUND, BRAIN, AND CSP ACROSS VARIOUS MODELS INCLUDING FETSAMTABLE II COMPARISON OF SEGMENTATION PERFORMANCE METRICS FOR THE CLASSES OF LV AND MEAN ACROSS VARIOUS MODELS INCLUDING FETSAM

TABLE III ABLATION
STUDY RESULTS FOR DIFFERENT LOSS FUNCTIONS

TABLE IV ABLATION
STUDY RESULTS FOR FETSAM MODEL WITH BASELINE AND VARYING PROMPT SIZES