FPUS23: An Ultrasound Fetus Phantom Dataset With Deep Neural Network Evaluations for Fetus Orientations, Fetal Planes, and Anatomical Features

Ultrasound imaging is one of the most prominent technologies to evaluate the growth, progression, and overall health of a fetus during its gestation. However, the interpretation of the data obtained from such studies is best left to expert physicians and technicians who are trained and well-versed in analyzing such images. To improve the clinical workflow and potentially develop an at-home ultrasound-based fetal monitoring platform, we present a novel fetus phantom ultrasound dataset, FPUS23, which can be used to identify (1) the correct diagnostic planes for estimating fetal biometric values, (2) fetus orientation, (3) their anatomical features, and (4) bounding boxes of the fetus phantom anatomies at 23 weeks gestation. The entire dataset is composed of 15,728 images, which are used to train four different Deep Neural Network models, built upon a ResNet34 backbone, for detecting aforementioned fetus features and use-cases. We have also evaluated the models trained using our FPUS23 dataset, to show that the information learned by these models can be used to substantially increase the accuracy on real-world ultrasound fetus datasets. We make the FPUS23 dataset and the pre-trained models publicly accessible at https://github.com/bharathprabakaran/FPUS23, which will further facilitate future research on fetal ultrasound imaging.


I. INTRODUCTION
Ultrasound imaging techniques are used to create an image of organs and tissues inside the human body without the use of radiation, such as X-Rays, or expensive equipment, like Magnetic Resonance Imaging (MRI).Ultrasound technologies are used in day-to-day healthcare clinics to efficiently diagnose diseases like COVID-19 [1], [2] and detect tumors [3].Ultrasound is also widely used to monitor the development of an unborn human fetus to obtain information regarding its development and overall health.An ultrasound examination is performed at various stages of the fetus' gestation to confirm the pregnancy and determine its location, condition, size, growth, orientation, gestational age, identify potential birth defects and complications, and many other factors relevant to the healthy development and delivery of the fetus.However, the data obtained from such examinations are difficult to understand and require the expertise or training of sonographers or physicians to accurately interpret the data.For instance, as depicted in an ultrasound image of a fetus at 23 weeks gestation (fig.1), identifying a fetus can be quite easy.However, identifying the orientation of the fetus and evaluating key biometric parameters, like the abdominal circumference or femur length, which are used to ascertain the gestational age of the fetus, requires the level of expertise that is currently offered only by trained sonographers and physicians.This "experience" can be learned and embedded within the deep learning models, which can be deployed in clinical use-cases as assistants to aid healthcare professionals in data interpretation.These models can also be used to build an at-home portable ultrasound-based fetal monitoring platform that can enable the user to understand the data by collating and interpreting the vital information.However, the development of such a deep learning model for analyzing fetal ultrasound data requires investigation of the following key research challenges: (1) ultrasound fetus data fall under the category of healthcare information that is heavily protected by regulatory requirements regarding their generation, storage, usage, etc. to ensure patient privacy; (2) due to these regulations, there are very few openly accessible fetal ultrasound datasets, which can be used to develop such clinical assistants and at-home monitoring platforms; (3) even the state-of-the-art datasets that are accessible are not properly annotated with the relevant information in order to be able to train DNN models, which can be used to infer relevant fetus anatomy information; and (4) the existing datasets are not large enough to enable the DNN model to learn the required features efficiently, despite the use of existing transfer learning approaches.
To address these research challenges, we build the FPUS23 dataset (see fig. 2) by: (1) using a fetus phantom at 23 weeks gestation, instead of actual human fetuses, thereby circumventing the regulations associated with healthcare data; (2) not generating or using any healthcare data in our dataset -FPUS23 will be openly accessible to further facilitate future research and advancements in this domain; (3) properly labeling and annotating the dataset with the help of scientists with experience in fetal ultrasound imaging to generate a dataset that can be used to identify (i) diagnostic planes for extracting fetal biometric parameters, (ii) fetus orientation, (iii) fetus anatomies, and (iv) bounding boxes of the fetus anatomies; (4) building a dataset with 15,728 ultrasound image samples that can be used to learn the required information; (5) extensive evaluation of our datasets with appropriate transfer learning approaches, including model compression techniques, as illustrated in Section IV.

II. RELATED WORK
There is an abundance of ultrasound datasets for various use-cases, which can be used to generate DNN-based models for classification and segmentation.For instance, the breast ultrasound image dataset presented by Al-Dhabyani et al. [6], which is composed of normal, benign, and malignant images that can be used to train to a model to act as a classifier.Similarly, the POCUS dataset, presented by Born et al. [1], and the COVIDX-US dataset, by Ebadi et al. [2], are openly accessible for building DNN-based clinical assistants that can aid in the analytics and diagnosis of COVID-19.Leclerc et al. [7] presented a cardiac ultrasound electrocardiography dataset containing image sequences with two and four-chamber views of the heart of 500 patients.Likewise, there are a wide number of ultrasound datasets for diagnosing and analyzing several internal body organs.
Fetal Ultrasound: Deep learning has also been explored for fetal ultrasound imaging, albeit not as widely or  [4] and [5]).
comprehensively.[8] proposed a multi-scale self-attention generator that can be used to automatically generate ultrasound images from various segmentation masks, which can then be used for fetal brain segmentation and analysis.[9] proposed the use of deep learning to automatically analyze the fetal heart by encoding and translating Spatio-temporal information in order to classify amongst three different fetal heart planes.[10] proposed a CNN-based classifier that can be used to detect cardiac abnormalities in fetal ultrasound images.[11] have proposed the use of CNNs to detect the six standard fetal brain planes on a proprietary dataset containing 30, 000 2D fetal ultrasound images gathered between 16 and 34 weeks gestation, to achieve 91% accuracy.[12] presented two DNN models that are used to detect the ideal frame for the fetal head, followed by its segmentation, which is used to measure the head circumference, a key biometric parameter.[13] proposed an improved multi-task learning network that improves the segmentation capabilities of the model when compared to [12].Although ultrasound examinations of a fetus are very common, there are very few openly accessible datasets for researchers to build DNN-based models that can aid in the analytics of fetal ultrasound images.[14] presented a dataset of fetal ultrasound images with annotations regarding the head circumference, as shown in [12], [13].[15] presented a fetal ultrasound dataset with over 12, 400 images from 1, 792 patients, which were categorized into six classes containing the anatomical planes.Most of the other works in this category primarily work on proprietary datasets, which are not accessible for analysis and evaluation.

III. FPUS23: THE FETAL ULTRASOUND DATASET A. DATA COLLECTION
The required ultrasound fetal data was generated and collected using the "US-7 SPACE FAN-ST" fetus phantom [5], which has been typically used to train sonographers to assess the development and condition of the fetus.Figs.3(a) and 3(b) illustrate an overview of the oval-shaped phantom abdomen, which mimics the uterus containing a fetus at 23 weeks gestation, and the life-size fetus demonstration model that is placed inside the phantom abdomen, respectively.We propose to use a 23-week old phantom as a mid-pregnancy scan is typically performed around this time to check for fetus anomalies.The phantom abdomen can be rotated into four different positions to change the orientation and presentation (cephalic or breech) of the fetus phantom (see fig. 3(c)).The fetus model includes full skeletal structure and key organic features that can be observed and used to train the sonographer to assess the fetus' anatomy (like head, arms legs, abdomen) and internal body organs (like brain, skull, spine, cardiac chambers, stomach, kidney, blood vasculature, etc.).The biometric parameters of the fetus can also be measured/learned using an ultrasound of the fetus phantom at the appropriate positions or the correct diagnostic planes.Besides the aforementioned features, the quantity of amniotic fluid, any potential abnormalities, location of the placenta, fetal posture, etc. can also be learned with the help of this model.
We use the X6-1 xMATRIX array transducer [16], which is interfaced with the Philips Epiq-7 system [4] to collect and process the data to generate the final ultrasound image (see fig. 3(d)).The Anatomically Intelligent Ultrasound (AIUS) imaging technology deploys advanced organ modeling and imaging techniques to generate a two-dimensional image of the fetus phantom using the default settings for the "OB Fetal Echo" imaging option.The imaging depth was set to 12cm and captured at a 23Hz frame rate.Sufficient ultrasound gel is applied on the phantom abdomen to ensure acoustic coupling with the probe, thereby reducing acoustic impedance, and enabling clear imaging.We executed two protocols to collect the images used in the FPUS23 dataset: Circumference (AC), and the femur standard plane, which is used to estimate the fetus' Femur Length (FL).
The correct diagnostic planes were identified using the clinical protocols discussed by Salomon et al. [17] and Bethune et al. [18].To further enrich the dataset, after the acquisition of several frames at the correct diagnostic plane, we tilt, rotate, or traverse the ultrasound probe in random directions to collect more information.(2) Protocol-II: The focus of this protocol is to obtain images capturing the anatomies of the fetus phantom in the generated images.We do this by navigating the probe to obtain the head, abdomen, arms, and legs, individually or combined, in the picture and move the probe in different directions to obtain a heterogeneous set of images capturing the fetal anatomies.Furthermore, the phantom abdomen was also rotated and placed in the four possible orientations [head up (hu) or down (hd), view front (vf) or back (vb)], when collecting the ultrasound data, to potentially mimic the real-life behavior of fetus orientation and presentation (see fig. 3(c)).Additionally, the probe orientation was also changed between horizontal and vertical, with respect to the abdomen, when the data was collected to enhance the dataset with more information.

B. ANNOTATION
The data streams obtained by the Philips Epiq-7 ultrasound system are converted to PNG image sequences, of dimension 664x388, using custom in-house software for easier labeling, annotating, and processing.The stored PNG files are annotated using a customized version of the Computer Vision Annotation Tool -CVAT [19], which was primarily opted for its ease-of-use and wide-range features.Each acquired ultrasound frame was subsequently annotated by scientists with experience in fetal ultrasound imaging.The sequences obtained using Protocol-I are labeled as a correct diagnostic plane for one of the three biometric parameters (BPD, AC, FL) or as a non-diagnostic plane.Since the number of data samples obtained for each of the diagnostic planes is quite smaller than the non-diagnostic plane output class, the samples were augmented in each of the other three output classes to ensure equal representation of data across all classes.The data obtained using Protocol-II is, first, labeled with fetus orientations, namely huvf, huvb, hdvf, or hdvb, as discussed earlier, based on the position of the phantom abdomen when the scans are made.Next, the images are tagged with the anatomies present in the image, such as heads, arms, legs, and abdomen.The images are subsequently exhaustively annotated with boxes representing the respective anatomies to determine their bounds and potentially estimate biometric parameters later, such as femur length, at the correct diagnostic plane.Images that do not contain any vital and/or relevant information regarding the fetus are not labeled or annotated.These finalized labels and annotations, for each valid image in the dataset, are extracted as an XML file, which can be used to train various deep learning models as required.Table 1 depicts an overview of our FPUS23 dataset and the number of input samples present in each class for each of the four different super-labels: (1) Diagnostic Plane, (2) Fetus Orientation, (3) Fetus Anatomy, and (4) Anatomy Bounds, using box annotation.The dataset is split in the ratio of 8 : 1 : 1, with respect to training, validation, and testing, respectively, for the first three cases.We split the anatomy bounds data and use 80% of it for training and 20% for validating the model.Fig. 4 depicts a few sample images and their corresponding annotations for the four different super-labels of our FPUS23 dataset.

A. EXPERIMENTAL SETUP
The experimental evaluations illustrated in this section, which depict the efficacy of the DNN models trained using our FPUS23 dataset, are primarily completed on a CentOS 7.9 Operating System running on an Intel Core i7-8700 CPU with 16GB RAM and 2 Nvidia GeForce GTX 1080 Ti GPUs.Our scripts were executed with the following software versions: CUDA 11.5, Pytorch 3.7.4.3, torchvision 0.11.1, and Pytorch-lightning 1.5.1.We use a ResNet34 [20] DNN model, pre-trained using the ImageNet [21], and retrain it with our dataset for 15 epochs using the cross-entropy loss function.The initial and final layers of the ResNet34 architecture were adapted to accommodate the custom input data dimensions and output classes, respectively.We use a modified Faster-RCNN [22] with our ResNet34 backbone to build the model used for determining the anatomy bounds of the fetus in our dataset.The learning rate for all models was set to 0.001 using the Adam Optimizer with a step size of 20 and γ = 0.1.
We use the traditional metrics of accuracy, precision, and recall to determine the efficacy of the deep learning models illustrated in this section.Accuracy is the ratio of the total number of inputs accurately predicted with respect to the total number of predictions, which is a primary evaluation metric for classification systems.For the anatomy bounds, we use the mean Average Precision (mAP; IoU=[.50:.05:.95]), which is the ability of the model to not label a negative sample as positive, and mean Average Recall (mAR; IoU=[.50:.05:.95]), which denotes the ability of the model to identify all positive instances of each class: where N denotes the total number of output classes, p i,i the number of pixels classified as class i and labeled as class i, and p i,j , p j,i are the number of pixels classified as class i and labeled as class j and vice-versa.We also evaluate the model's F1-score, which is the harmonic mean of the model's precision and recall.

B. BASELINE MODEL
To illustrate the effectiveness of the FPUS23 dataset, we retrain a modified version of the ResNet34 architecture, which we consider as the baseline, to illustrate the capability of the network to learn relevant information regarding the classification of labels and detecting fetal anatomies.Accuracy and F1-score are the two metrics used to determine the quality of the model, whereas the number of floating-point operations (Flops) and memory (MB) are relevant to estimate the hardware and resource requirements of the baseline model and determine its deployability in edge devices.The results of these experiments are illustrated in gives us the understanding that the models are able to learn the features quite well.Similarly, the modified Faster-RCNN model, embedded with our ResNet34 backbone, is able to detect fetal anatomies at significantly high precision.Note, the significantly high quality of the models can also be attributed to their over-parameterization, which implies that smaller networks, achieving similar output accuracy, can be obtained with Neural Architecture Search (NAS) [23] and model compression techniques [24] (see Section IV-C).Moreover, the use of an inverted probe during an ultrasound exam leads to the generation of an inverted image.To design a robust network model that can extract features and information from the inverted image to ensure correct classification, relevant image samples need to be collected, with the probe inverted, and included in the dataset during the training stage.However, this information can also be added to the model by flipping the collected images along the y-axis, which mimics probe-inverted images, and adding them to the original dataset before training.

C. MODEL COMPRESSION
To further reduce the model's hardware requirements, we use compression techniques like pruning and quantization.We have implemented the technique proposed by Han et al. [25], which proposes to eliminate the smallest x% of total weights and associated connections from the network, followed by a network retraining stage, wherein the model relearns the information on the reduced set of available parameters, to potentially achieve similar output quality as the original model.We analyze the quality and hardware require-ments of the models that are 30%, 50%, and 70% pruned.The pruned networks are subsequently quantized to 8-bit integer (INT8) precision, using the quantization-aware training strategy presented by Khudia et al. [26], to further reduce the model's size and improve its computational performance; INT8 computations are several orders of magnitude faster than 32-bit floating-point (FP32) operations [27].Similar to the regularizing effect illustrated in [28], compression of the baseline models led to potential scenarios where the compressed models outperform the original.This regularizing effect is especially prominent when the ResNet34 classifiers are compressed, due to their heavy over-parameterization.Table 3 illustrates the quality evaluations and the hardware requirements of these models when trained on FPUS23.
Plenty of research works on automated NAS have demonstrated their efficacy in reducing the number of parameters and computations for achieving similar quality results compared to over-parameterized architectures that are typically designed by hand.Towards this, we first investigate the effectiveness of two state-of-the-art automated NAS approaches presented by Fang et al. [29] and Wang et al. [30] in reducing the number of Flops and memory while retaining the output quality.Both these approaches yield a residual DNN with 10 intermediate feature extraction layers to achieve an output quality similar to that of the baseline.However, while exhaustively generating and exploring smaller networks from scratch, we generated a residual DNN with 8 layers, instead of the 10 proposed by [29], [30], which offers similar output quality while requiring a fewer number of parameters.We use this ResNet8 architecture to build models for classifying the diagnostic planes, fetus orientation, and fetus anatomy.However, we have observed that the quality of the Faster-RCNN model, built using a ResNet8, is substantially lower as opposed to the model using a ResNet10 backbone.Therefore, the Faster-RCNN model is built using a ResNet10 backbone instead.From the results, it is quite evident that the ResNet8 model achieves a similar output quality as the original baseline models, even when both of them are compressed using pruning and/or quantization.For instance, a fetus anatomy classifier built using the ResNet8 architecture, which has been 30% pruned and INT8 quantized, achieves an output quality greater than the baseline ResNet34 model, while requiring less than 10MB in memory, making it ideal for deployment on resource-constrained processing platforms.The Faster-RCNN built using the ResNet10 backbone also achieves similar quality to the baseline model while substantially reducing the hardware requirements.Table 4 illustrates the quality evaluations and the hardware requirements of the NAS models when trained on the FPUS23 dataset.

D. EVALUATION ON STATE-OF-THE-ART REAL-WORLD DATASET
To further demonstrate the applicability of our dataset in real-world fetal ultrasound use-cases, we fine-tune the models trained using FPUS23 on the training set and evaluate them on the test set presented by [15].We train two models: first, we train the baseline model on the FPUS23 dataset, before retraining on the real-world dataset, and in the second, we train the model directly on the real-world dataset.Model-1 achieves 91.92% accuracy in detecting the anatomical planes after training for just 1 epoch, whereas Model-2 achieves the same accuracy only after training for more than 16 epochs.Therefore, the models can learn relevant features regarding fetal ultrasounds from the FPUS23 dataset before being deployed for other fetal ultrasound datasets with very little fine-tuning.To re-emphasize this, we perform a small analysis on the networks under consideration.We consider the weights of the FPUS23-trained model before and after fine-tuning on the real-world dataset.Each of the weights in these two sets are subtracted, squared, and aggregated together (like sum of squared errors) to obtain a single value, which denotes the amount of fine-tuning undergone by the model.We do the same for the ImageNet-trained model and compare the two.The FPUS23-trained model aggregates a value of 2.01×10 −7 , whereas the ImageNet-trained model aggregates to 3.02 × 10 −6 , which is more than 15× larger.This implies that the former model requires less fine-tuning in comparison to the latter.Section IV-E presents a comprehensive analysis of the knowledge retained and transferred by models trained on the FPUS23 dataset using ablation studies, followed by a discussion of the anatomy detection results for the trained DNN model.[15] without any fine-tuning.

E. ABLATION STUDIES
With no fine-tuning on the state-of-the-art dataset [15], the FPUS23-trained model is able to make predictions with relatively higher accuracy when compared to an ImageNet-trained model, as shown by the results in Table 5.
Next, we provide a class-wise prediction breakdown of the models when fine-tuned on the state-of-the-art dataset.After 15 epochs, the ImageNet-trained model converges to the same accuracy as the FPUS23-trained model, which requires just 1 epoch for fine-tuning.The results of these experiments are presented in Tables 6 and 7; the ImageNet-trained model is fine-tuned for 5 epochs and converges to the same accuracy-level as the FPUS23-trained model after 15 epochs.
We have also trained the two models using a 40% subset of the real-world training data to achieve the same accuracy as the baseline (92%) when using the FPUS23-trained model.Whereas the ImageNet trained model is unable to achieve the same accuracy and falls short at 88% accuracy.

V. CONCLUSION AND FUTURE WORK
In this paper, we present the FPUS23, which is an ultrasound dataset of a fetus phantom at 23 weeks gestation.The data streams are collected and annotated by scientists with relevant fetal ultrasound experience to obtain information regarding the (1) diagnostic plane, (2) fetus orientation, (3) fetus anatomy, and (4) their bounds, using box annotations.The generated dataset is used to train a variety of deep learning models to illustrate the model's ability to extract vital information, which can be used to accurately distinguish among the classes in different categories and detect the fetus anatomy bounds.Furthermore, to evaluate their deployability in portable resource-constrained devices, we evaluated the capability of a smaller DNN compressed using pruning and quantization to illustrate that smaller DNNs are equally competent at extracting relevant information from the dataset and are capable of execution on resource-constrained devices and embedded platforms.The FPUS23 dataset is open-source and the trained models are accessible online.In our future work, we plan to include annotated data of fetus phantoms at different gestation durations to offer a more comprehensive fetal ultrasound dataset.

FIGURE 1 :FIGURE 2 :
FIGURE 1: Identifying anatomical fetal features, such as limbs, which can enable the extraction of biometric parameters that determine the growth of the fetus.Figures (a) -(d) illustrate the ability of an object detection model in detecting various fetus anatomies, across different views, with little fine-tuning on our FPUS23 dataset.

FIGURE 3 :
FIGURE 3: Overview of (a) the phantom abdomen in the mother body torso and its possible rotations; (b) the fetus phantom placed in the abdomen; (c) the four possible fetus orientations; (d) the Philips Epiq-7 ultrasound system (adapted from[4] and[5]).

( 1 )FIGURE 4 :
FIGURE 4: Super-labels of the FPUS23 dataset and corresponding samples in each class.

Fig. 5 22 TABLE 7 :
Fig. 5 provides a sample of the anatomy detection results of the Faster-RCNN model with the ResNet34 backbone trained using the FPUS23 dataset.As illustrated by the ground truth and prediction, in figs.5(a) and 5(b), respectively, the precision achieved by the model is quite high, and can detect anatomies quite accurately and precisely in most instances, as illustrated by the overall mAP and mAR

TABLE 1 :
Breakdown of the number of labeled data samples in each class of our FPUS23 dataset; Of the 15, 728 images, not all contain relevant information for labeling -images obtained during probe movements may not accurately depict anatomies.

TABLE 2 :
Preliminary quality evaluations (accuracy and F1-score) and hardware requirements (Flops and memory) of the modified ResNet34 model trained using the FPUS23 dataset.

TABLE 3 :
Exhaustive evaluations of the compressed baseline ResNet34 model.

Table 2 .
Achieving ∼ 99% accuracy in the classification of diagnostic planes, fetus orientation, and fetus anatomy

TABLE 4 :
Evaluations of the DNN model obtained through exhaustive neural architecture search; As expected, a smaller model achieves the same or similar output quality as the baseline while reducing the hardware requirements by up to 17×.

TABLE 5 :
Evaluation of the models trained using ImageNet and FPUS23 on the real-world fetal ultrasound dataset

TABLE 6 :
[15]s-wise prediction breakdown of the ImageNet-trained model when fine-tuned for the real-world dataset[15]after 5 epochs.